Build Real-Time Codebase Indexing for AI Code Generation
In this blog, we will show you how to index a codebase for RAG with CocoIndex. CocoIndex provides built-in support for codebase chunking, with native Tree-sitter support.
Because we use incremental processing (only reprocessing what has changed) under the hood, the index is updated in near real-time. It can be real-time if we integrate with any source that directly pushes change notifications, e.g., a code editor. This can be built for code search applications, or for providing relevant context to AI code generation systems where low latency is critical for a seamless developer experience.
It'd mean a lot to us if you could โญ star CocoIndex on Github, if this tutorial helps you.
Tree-sitterโ
CocoIndex leverages Tree-sitter's capabilities to intelligently chunk code based on the actual syntax structure rather than arbitrary line breaks. These syntactically coherent chunks are then used to build a more effective index for RAG systems, enabling more precise code retrieval and better context preservation.
Tree-sitter is a parser generator tool and an incremental parsing library, it is available in Rust ๐ฆ.
You can find the full code of this project here. Only ~ 50 lines of Python on the indexing path, check it out ๐!
Prerequisitesโ
If you don't have Postgres installed, please follow installation guide. CocoIndex uses Postgres to keep track of data lineage for incremental processing.
Define Indexing Flowโ
Flow Designโ
The flow diagram illustrates how we'll process our codebase:
- Read code files from the local filesystem
- Extract file extensions, to get the language of the code for Tree-sitter to parse
- Split code into semantic chunks using Tree-sitter
- Generate embeddings for each chunk
- Store in a vector database for retrieval
1. Add the codebase as a source.โ
@cocoindex.flow_def(name="CodeEmbedding")
def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
"""
Define an example flow that embeds files into a vector database.
"""
data_scope["files"] = flow_builder.add_source(
cocoindex.sources.LocalFile(path="../..",
included_patterns=["*.py", "*.rs", "*.toml", "*.md", "*.mdx"],
excluded_patterns=[".*", "target", "**/node_modules"]))
code_embeddings = data_scope.add_collector()
We ingest the CocoIndex codebase from the root directory.
We can change the path to the codebase you want to index.
We index all the files with the extensions of .py
, .rs
, .toml
, .md
, .mdx
, and skip directories starting with ., target (in the root) and node_modules (under any directory).
flow_builder.add_source
will create a table with sub fields (filename
, content
), we can refer to the documentation for more details.
2. Process each file and collect the information.โ
2.1 Extract the extension of a filenameโ
We need to pass the language (or extension) to Tree-sitter to parse the code. Let's define a function to extract the extension of a filename while processing each file. You can find the documentation for custom function here.
@cocoindex.op.function()
def extract_extension(filename: str) -> str:
"""Extract the extension of a filename."""
return os.path.splitext(filename)[1]
Then we are going to process each file and collect the information.
with data_scope["files"].row() as file:
file["extension"] = file["filename"].transform(extract_extension)
Here we extract the extension of the filename and store it in the extension
field.
2.2 Split the file into chunksโ
We will chunk the code with Tree-sitter.
We use the SplitRecursively
function to split the file into chunks.
It is integrated with Tree-sitter, so you can pass in the language to the language
parameter.
To see all supported language names and extensions, see the documentation here. All the major languages are supported, e.g., Python, Rust, JavaScript, TypeScript, Java, C++, etc. If it's unspecified or the specified language is not supported, it will be treated as plain text.
with data_scope["files"].row() as file:
file["chunks"] = file["content"].transform(
cocoindex.functions.SplitRecursively(),
language=file["extension"], chunk_size=1000, chunk_overlap=300)
2.3 Embed the chunksโ
We use SentenceTransformerEmbed
to embed the chunks.
You can refer to the documentation here.
There are 12k models supported by ๐ค Hugging Face.
You can just pick your favorite model.
@cocoindex.transform_flow()
def code_to_embedding(text: cocoindex.DataSlice[str]) -> cocoindex.DataSlice[list[float]]:
"""
Embed the text using a SentenceTransformer model.
"""
return text.transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2"))
Then for each chunk, we will embed it using the code_to_embedding
function. and collect the embeddings to the code_embeddings
collector.
@cocoindex.transform_flow()
is needed to share the transformation across indexing and query. We build a vector index and query against it,
the embedding computation needs to be consistent between indexing and querying. See documentation for more details.
with data_scope["files"].row() as file:
with file["chunks"].row() as chunk:
chunk["embedding"] = chunk["text"].call(code_to_embedding)
code_embeddings.collect(filename=file["filename"], location=chunk["location"],
code=chunk["text"], embedding=chunk["embedding"])
2.4 Collect the embeddingsโ
Export the embeddings to a table.
code_embeddings.export(
"code_embeddings",
cocoindex.storages.Postgres(),
primary_key_fields=["filename", "location"],
vector_indexes=[cocoindex.VectorIndex("embedding", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
We use Consine Similarity to measure the similarity between the query and the indexed data. To learn more about Consine Similarity, see Wiki.
3. Query the indexโ
We match against user-provided text by a SQL query, reusing the embedding operation in the indexing flow.
def search(pool: ConnectionPool, query: str, top_k: int = 5):
# Get the table name, for the export target in the code_embedding_flow above.
table_name = cocoindex.utils.get_target_storage_default_name(code_embedding_flow, "code_embeddings")
# Evaluate the transform flow defined above with the input query, to get the embedding.
query_vector = code_to_embedding.eval(query)
# Run the query and get the results.
with pool.connection() as conn:
with conn.cursor() as cur:
cur.execute(f"""
SELECT filename, code, embedding <=> %s::vector AS distance
FROM {table_name} ORDER BY distance LIMIT %s
""", (query_vector, top_k))
return [
{"filename": row[0], "code": row[1], "score": 1.0 - row[2]}
for row in cur.fetchall()
]
Define a main function to run the query in terminal.
@cocoindex.main_fn()
def _run():
# Initialize the database connection pool.
pool = ConnectionPool(os.getenv("COCOINDEX_DATABASE_URL"))
# Run queries in a loop to demonstrate the query capabilities.
while True:
try:
query = input("Enter search query (or Enter to quit): ")
if query == '':
break
# Run the query function with the database connection pool and the query.
results = search(pool, query)
print("\nSearch results:")
for result in results:
print(f"[{result['score']:.3f}] {result['filename']}")
print(f" {result['code']}")
print("---")
print()
except KeyboardInterrupt:
break
The @cocoindex.main_fn() decorator initializes the library with settings loaded from environment variables. See documentation for initialization for more details.
Run the index setup & updateโ
๐ Now you are all set!
Run following commands to setup and update the index.
python main.py cocoindex setup
python main.py cocoindex update
You'll see the index updates state in the terminal
Test the queryโ
At this point, you can start the CocoIndex server and develop your RAG runtime against the data. To test your index, you could
python main.py
When you see the prompt, you can enter your search query. for example: spec.
You can find the search results in the terminal
The returned results - each entry contains score (Cosine Similarity), filename, and the code snippet that get matched.
Support usโ
We are constantly improving, and more features and examples are coming soon. If you love this article, please give us a star โญ at GitHub to help us grow.
Thanks for reading!