Real-time Codebase Indexing

Codebase Index

Overview

In this tutorial, we will build codebase index. CocoIndex provides built-in support for codebase chunking, with native Tree-sitter support. It works with large codebases, and can be updated in near real-time with incremental processing - only reprocess what's changed.

Use Cases

A wide range of applications can be built with an effective codebase index that is always up-to-date.

Semantic code context for AI coding agents like Claude, Codex, Gemini CLI.
MCP for code editors such as Cursor, Windsurf, and VSCode.
Context-aware code search applications—semantic code search, natural language code retrieval.
Context for code review agents—AI code review, automated code analysis, code quality checks, pull request summarization.
Automated code refactoring, large-scale code migration.
SRE workflows: enable rapid root cause analysis, incident response, and change impact assessment by indexing infrastructure-as-code, deployment scripts, and config files for semantic search and lineage tracking.
Automatically generate design documentation from code—keep design docs up-to-date.

Flow Overview

The flow is composed of the following steps:

Read code files from the local filesystem
Extract file extensions, to get the language of the code for Tree-sitter to parse
Split code into semantic chunks using Tree-sitter
Generate embeddings for each chunk
Store in a vector database for retrieval

Setup

Install Postgres, follow installation guide.
Install CocoIndex
```
pip install -U cocoindex
```

Add the codebase as a source

We will index the CocoIndex codebase. Here we use the LocalFile source to ingest files from the CocoIndex codebase root directory.

import os

@cocoindex.flow_def(name="CodeEmbedding")
def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    data_scope["files"] = flow_builder.add_source(
        cocoindex.sources.LocalFile(path=os.path.join('..', '..'),
                                    included_patterns=["*.py", "*.rs", "*.toml", "*.md", "*.mdx"],
                                    excluded_patterns=[".*", "target", "**/node_modules"]))
    code_embeddings = data_scope.add_collector()

Include files with the extensions of .py, .rs, .toml, .md, .mdx
Exclude files and directories starting ., target in the root and node_modules under any directory.

flow_builder.add_source will create a table with sub fields (filename, content).

Sources

Process each file and collect the information

Extract the extension of a filename

We need to pass the language (or extension) to Tree-sitter to parse the code. Let's define a function to extract the extension of a filename while processing each file.

@cocoindex.op.function()
def extract_extension(filename: str) -> str:
    """Extract the extension of a filename."""
    return os.path.splitext(filename)[1]

Custom Function

Split the file into chunks

We use the SplitRecursively function to split the file into chunks. SplitRecursively is CocoIndex building block, with native integration with Tree-sitter. You need to pass in the language to the language parameter if you are processing code.

with data_scope["files"].row() as file:
    # Extract the extension of the filename.
    file["extension"] = file["filename"].transform(extract_extension)
    file["chunks"] = file["content"].transform(
          cocoindex.functions.SplitRecursively(),
          language=file["extension"], chunk_size=1000, chunk_overlap=300)

SplitRecursively

Embed the chunks

We use SentenceTransformerEmbed to embed the chunks.

@cocoindex.transform_flow()
def code_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice[list[float]]:
    return text.transform(
        cocoindex.functions.SentenceTransformerEmbed(
            model="sentence-transformers/all-MiniLM-L6-v2"))

SentenceTransformerEmbed

tip

@cocoindex.transform_flow() is needed to share the transformation across indexing and query. When building a vector index and querying against it, the embedding computation must remain consistent between indexing and querying.

Transform Flow

Then for each chunk, we will embed it using the code_to_embedding function, and collect the embeddings to the code_embeddings collector.

with data_scope["files"].row() as file:
    with file["chunks"].row() as chunk:
        chunk["embedding"] = chunk["text"].call(code_to_embedding)
        code_embeddings.collect(filename=file["filename"], location=chunk["location"],
                                code=chunk["text"], embedding=chunk["embedding"])

Export the embeddings

code_embeddings.export(
    "code_embeddings",
    cocoindex.storages.Postgres(),
    primary_key_fields=["filename", "location"],
    vector_indexes=[cocoindex.VectorIndex("embedding", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])

We use Cosine Similarity to measure the similarity between the query and the indexed data.

Query the index

We match against user-provided text by a SQL query, reusing the embedding operation in the indexing flow.

def search(pool: ConnectionPool, query: str, top_k: int = 5):
    # Get the table name, for the export target in the code_embedding_flow above.
    table_name = cocoindex.utils.get_target_storage_default_name(code_embedding_flow, "code_embeddings")
    # Evaluate the transform flow defined above with the input query, to get the embedding.
    query_vector = code_to_embedding.eval(query)
    # Run the query and get the results.
    with pool.connection() as conn:
        with conn.cursor() as cur:
            cur.execute(f"""
                SELECT filename, code, embedding <=> %s::vector AS distance
                FROM {table_name} ORDER BY distance LIMIT %s
            """, (query_vector, top_k))
            return [
                {"filename": row[0], "code": row[1], "score": 1.0 - row[2]}
                for row in cur.fetchall()
            ]

Define a main function to run the query in terminal.

def main():
    # Initialize the database connection pool.
    pool = ConnectionPool(os.getenv("COCOINDEX_DATABASE_URL"))
    # Run queries in a loop to demonstrate the query capabilities.
    while True:
        try:
            query = input("Enter search query (or Enter to quit): ")
            if query == '':
                break
            # Run the query function with the database connection pool and the query.
            results = search(pool, query)
            print("\nSearch results:")
            for result in results:
                print(f"[{result['score']:.3f}] {result['filename']}")
                print(f"    {result['code']}")
                print("---")
            print()
        except KeyboardInterrupt:
            break

if __name__ == "__main__":
    main()

Run the index setup & update

Install dependencies
```
pip install -e .
```
Setup and update the index
```
cocoindex update main
```
You'll see the index updates state in the terminal

Test the query

At this point, you can start the CocoIndex server and develop your RAG runtime against the data. To test your index, you could

python main.py

When you see the prompt, you can enter your search query. for example: spec. The returned results - each entry contains score (Cosine Similarity), filename, and the code snippet that get matched.

CocoInsight

To get a better understanding of the indexing flow, you can use CocoInsight to help the development step by step. To spin up, it is super easy.

cocoindex server main.py -ci

Follow the url from the terminal - https://cocoindex.io/cocoinsight to access the CocoInsight.

Supported Languages

SplitRecursively has native support for all major programming languages.

Supported Languages

Overview​

Use Cases​

Flow Overview​

Setup​

Add the codebase as a source​

Process each file and collect the information​

Extract the extension of a filename​

Split the file into chunks​

Embed the chunks​

Export the embeddings​

Query the index​

Run the index setup & update​

Test the query​

CocoInsight​

Supported Languages​

Overview

Use Cases

Flow Overview

Setup

Add the codebase as a source

Process each file and collect the information

Extract the extension of a filename

Split the file into chunks

Embed the chunks

Export the embeddings

Query the index

Run the index setup & update

Test the query

CocoInsight

Supported Languages