Skip to main content

2 posts tagged with "feature"

View All Tags

Index codebase for RAG

ยท 8 min read

Code Indexing for RAG

Overviewโ€‹

In this blog, we will show you how to index codebase for RAG with CocoIndex. CocoIndex is a tool to help you index and query your data. It is designed to be used as a framework to build your own data pipeline. CocoIndex provides built-in support for code base chunking, with native Tree-sitter support.

Tree-sitterโ€‹

Tree-sitter is a parser generator tool and an incremental parsing library, it is available in Rust ๐Ÿฆ€ - GitHub. CocoIndex has built-in Rust integration with Tree-sitter to efficiently parse code and extract syntax trees for various programming languages.

Codebase chunking is the process of breaking down a codebase into smaller, semantically meaningful chunks. CocoIndex leverages Tree-sitter's capabilities to intelligently chunk code based on the actual syntax structure rather than arbitrary line breaks. These semantically coherent chunks are then used to build a more effective index for RAG systems, enabling more precise code retrieval and better context preservation.

Fast pass ๐Ÿš€ - you can find the full code here. Only ~ 50 lines of Python code for the RAG pipeline, check it out ๐Ÿค—!

Please give CocoIndex on Github a star to support us if you like our work. Thank you so much with a warm coconut hug ๐Ÿฅฅ๐Ÿค—. GitHub

Prerequisitesโ€‹

If you don't have Postgres installed, please refer to the installation guide. CocoIndex uses Postgres to manage the data index, we have it on our roadmap to support other databases, including the in-progress ones. If you are interested in other databases, please let us know by creating a GitHub issue or Discord.

Define cocoIndex Flowโ€‹

Let's define the cocoIndex flow to read from a codebase and index it for RAG.

CocoIndex Flow for Code Embedding

The flow diagram above illustrates how we'll process our codebase:

  1. Read code files from the local filesystem
  2. Extract file extensions
  3. Split code into semantic chunks using Tree-sitter
  4. Generate embeddings for each chunk
  5. Store in a vector database for retrieval

Let's implement this flow step by step.

1. Add the codebase as a source.โ€‹

@cocoindex.flow_def(name="CodeEmbedding")
def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
"""
Define an example flow that embeds files into a vector database.
"""
data_scope["files"] = flow_builder.add_source(
cocoindex.sources.LocalFile(path="../..",
included_patterns=["*.py", "*.rs", "*.toml", "*.md", "*.mdx"],
excluded_patterns=[".*", "target", "**/node_modules"]))
code_embeddings = data_scope.add_collector()

In this example, we are going to index the cocoindex codebase from the root directory. You can change the path to the codebase you want to index. We will index all the files with the extensions of .py, .rs, .toml, .md, .mdx, and skip directories starting with ., target (in the root) and node_modules (under any directory).

flow_builder.add_source will create a table with the following sub fields, see documentation here.

  • filename (key, type: str): the filename of the file, e.g. dir1/file1.md
  • content (type: str if binary is False, otherwise bytes): the content of the file

2. Process each file and collect the information.โ€‹

2.1 Extract the extension of a filenameโ€‹

First let's define a function to extract the extension of a filename while processing each file. You can find the documentation for custom function here.

@cocoindex.op.function()
def extract_extension(filename: str) -> str:
"""Extract the extension of a filename."""
return os.path.splitext(filename)[1]

Then we are going to process each file and collect the information.

    # ...
with data_scope["files"].row() as file:
file["extension"] = file["filename"].transform(extract_extension)

Here we extract the extension of the filename and store it in the extension field. for example, if the filename is spec.rs, the extension field will be .rs.

2.2 Split the file into chunksโ€‹

Next, we are going to split the file into chunks. We use the SplitRecursively function to split the file into chunks. You can find the documentation for the function here.

CocoIndex provides built-in support for Tree-sitter, so you can pass in the language to the language parameter. To see all supported language names and extensions, see the documentation here. All the major languages are supported, e.g., Python, Rust, JavaScript, TypeScript, Java, C++, etc. If it's unspecified or the specified language is not supported, it will be treated as plain text.

  with data_scope["files"].row() as file:
# ...
file["chunks"] = file["content"].transform(
cocoindex.functions.SplitRecursively(),
language=file["extension"], chunk_size=1000, chunk_overlap=300)

2.3 Embed the chunksโ€‹

We will use the SentenceTransformerEmbed function to embed the chunks. You can find the documentation for the function here. There are 12k models supported by ๐Ÿค— Hugging Face. You can just pick your favorite model.

 def code_to_embedding(text: cocoindex.DataSlice) -> cocoindex.DataSlice:
"""
Embed the text using a SentenceTransformer model.
"""
return text.transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2"))

Then for each chunk, we will embed it using the code_to_embedding function. and collect the embeddings to the code_embeddings collector.

We extract this code_to_embedding function instead of directly calling transform(cocoindex.functions.SentenceTransformerEmbed(...)) in place.

This is because we want to make this one shared between the indexing flow building and the query handler definition. Alternatively, to make it simpler. It's also OK to avoid this extra function and directly do things in place - not a big deal to copy paste a little bit, we did this for the quickstart project.

    with data_scope["files"].row() as file:
# ...
with file["chunks"].row() as chunk:
chunk["embedding"] = chunk["text"].call(code_to_embedding)
code_embeddings.collect(filename=file["filename"], location=chunk["location"],
code=chunk["text"], embedding=chunk["embedding"])

2.4 Collect the embeddingsโ€‹

Finally, let's export the embeddings to a table.

    code_embeddings.export(
"code_embeddings",
cocoindex.storages.Postgres(),
primary_key_fields=["filename", "location"],
vector_index=[("embedding", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])

3. Setup Query Handler for your indexโ€‹

We will use the SimpleSemanticsQueryHandler to query the index. Note that we need to pass in the code_to_embedding function to the query_transform_flow parameter. This is because the query handler will use the same embedding model as the one used in the flow.

query_handler = cocoindex.query.SimpleSemanticsQueryHandler(
name="SemanticsSearch",
flow=code_embedding_flow,
target_name="code_embeddings",
query_transform_flow=code_to_embedding,
default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)

Define a main function to run the query handler.

@cocoindex.main_fn()
def _run():
# Run queries in a loop to demonstrate the query capabilities.
while True:
try:
query = input("Enter search query (or Enter to quit): ")
if query == '':
break
results, _ = query_handler.search(query, 10)
print("\nSearch results:")
for result in results:
print(f"[{result.score:.3f}] {result.data['filename']}")
print(f" {result.data['code']}")
print("---")
print()
except KeyboardInterrupt:
break

if __name__ == "__main__":
load_dotenv(override=True)
_run()

The @cocoindex.main_fn() decorator initializes the library with settings loaded from environment variables. See documentation for initialization for more details.

Run the index setup & updateโ€‹

๐ŸŽ‰ Now you are all set!

Run following commands to setup and update the index.

python main.py cocoindex setup
python main.py cocoindex update

You'll see the index updates state in the terminal

Terminal showing index update process

Test the queryโ€‹

At this point, you can start the cocoindex server and develop your RAG runtime agains the data.

To test your index, there are two options:

Option 1: Run the index server in the terminalโ€‹

python main.py

When you see the prompt, you can enter your search query. for example: spec.

Enter search query (or Enter to quit): spec

You can find the search results in the terminal

Search results in terminal

The returned results - each entry contains score (Cosine Similarity), filename, and the code snippet that get matched. At cocoindex, we use the cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY to measure the similarity between the query and the indexed data. You can switch to other metrics too and quickly test it out.

To learn more about Consine Similarity, see Wiki.

Option 2: Run CocoInsight to understand your data pipeline and data indexโ€‹

CocoInsight is a tool to help you understand your data pipeline and data index. It connects to your local CocoIndex server with zero data retention.

CocoInsight is in Early Access now (Free) ๐Ÿ˜Š You found us! A quick 3 minute video tutorial about CocoInsight: Watch on YouTube.

Run the CocoIndex serverโ€‹

python main.py cocoindex server -c https://cocoindex.io

Once the server is running, open CocoInsight in your browser. You'll be able to connect to your local CocoIndex server and explore your data pipeline and index.

CocoInsight UI showing data exploration

On the right side, you can see the data flow that we defined.

On the left side, you can see the data index in the data preview.

CocoInsight Data Preview showing indexed code chunks

You can click on any row to see the details of that data entry, including the full content of code chunks and their embeddings.

To evaluate the performance of your index, you click the +search button on top and enter your query.

Communityโ€‹

We love to hear from the community! You can find us on Github and Discord.

If you like this post and our work, please support CocoIndex on Github with a star โญ. Thank you with a warm coconut hug ๐Ÿฅฅ๐Ÿค—.

On-premise structured extraction with LLM using Ollama

ยท 7 min read

Structured data extraction with Ollama and CocoIndex

Overviewโ€‹

In this blog, we will show you how to use Ollama to extract structured data that you can run locally and deploy on your own cloud/server.

You can find the full code here. Only ~ 100 lines of Python code, check it out ๐Ÿค—!

Please give Cocoindex on Github a star to support us if you like our work. Thank you so much with a warm coconut hug ๐Ÿฅฅ๐Ÿค—. GitHub

Prerequisitesโ€‹

Install Postgresโ€‹

If you don't have Postgres installed, please refer to the installation guide.

Install ollamaโ€‹

Ollama allows you to run LLM models on your local machine easily. To get started:

Download and install Ollama. Pull your favorite LLM models by the ollama pull command, e.g.

ollama pull llama3.2

Extract Structured Data from Markdown filesโ€‹

1. Define outputโ€‹

We are going to extract the following information from the Python Manuals as structured data.

So we are going to define the output data class as the following. The goal is to extract and populate ModuleInfo.

@dataclasses.dataclass
class ArgInfo:
"""Information about an argument of a method."""
name: str
description: str

@dataclasses.dataclass
class MethodInfo:
"""Information about a method."""
name: str
args: cocoindex.typing.List[ArgInfo]
description: str

@dataclasses.dataclass
class ClassInfo:
"""Information about a class."""
name: str
description: str
methods: cocoindex.typing.List[MethodInfo]

@dataclasses.dataclass
class ModuleInfo:
"""Information about a Python module."""
title: str
description: str
classes: cocoindex.typing.List[ClassInfo]
methods: cocoindex.typing.List[MethodInfo]

2. Define cocoIndex Flowโ€‹

Let's define the cocoIndex flow to extract the structured data from markdowns, which is super simple.

First, let's add Python docs in markdown as a source. We will illustrate how to load PDF a few sections below.

@cocoindex.flow_def(name="ManualExtraction")
def manual_extraction_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
data_scope["documents"] = flow_builder.add_source(
cocoindex.sources.LocalFile(path="markdown_files"))

modules_index = data_scope.add_collector()

flow_builder.add_source will create a table with the following sub fields, see documentation here.

  • filename (key, type: str): the filename of the file, e.g. dir1/file1.md
  • content (type: str if binary is False, otherwise bytes): the content of the file

Then, let's extract the structured data from the markdown files. It is super easy, you just need to provide the LLM spec, and pass down the defined output type.

CocoIndex provides builtin functions (e.g. ExtractByLlm) that process data using LLM. We provide built-in support for Ollama, which allows you to run LLM models on your local machine easily. You can find the full list of models here. We also support OpenAI API. You can find the full documentation and instructions here.

    # ...
with data_scope["documents"].row() as doc:
doc["module_info"] = doc["content"].transform(
cocoindex.functions.ExtractByLlm(
llm_spec=cocoindex.LlmSpec(
api_type=cocoindex.LlmApiType.OLLAMA,
# See the full list of models: https://ollama.com/library
model="llama3.2"
),
output_type=ModuleInfo,
instruction="Please extract Python module information from the manual."))

After the extraction, we just need to cherrypick anything we like from the output using the collect function from the collector of a data scope defined above.

    modules_index.collect(
filename=doc["filename"],
module_info=doc["module_info"],
)

Finally, let's export the extracted data to a table.

    modules_index.export(
"modules",
cocoindex.storages.Postgres(table_name="modules_info"),
primary_key_fields=["filename"],
)

3. Query and test your indexโ€‹

๐ŸŽ‰ Now you are all set!

Run following commands to setup and update the index.

python main.py cocoindex setup
python main.py cocoindex update

You'll see the index updates state in the terminal

Index Updates

After the index is built, you have a table with the name modules_info. You can query it at any time, e.g., start a Postgres shell:

psql postgres://cocoindex:cocoindex@localhost/cocoindex

And run the SQL query:

SELECT filename, module_info->'title' AS title, module_summary FROM modules_info;

You can see the structured data extracted from the documents. Here's a screenshot of the extracted module information:

Module Information

CocoInsightโ€‹

CocoInsight is a tool to help you understand your data pipeline and data index. CocoInsight is in Early Access now (Free) ๐Ÿ˜Š You found us! A quick 3 minute video tutorial about CocoInsight: Watch on YouTube.

1. Run the CocoIndex serverโ€‹

python main.py cocoindex server -c https://cocoindex.io

to see the CocoInsight dashboard https://cocoindex.io/cocoinsight. It connects to your local CocoIndex server with zero data retention.

There are two parts of the CocoInsight dashboard:

CocoInsight Dashboard

  • Flows: You can see the flow you defined, and the data it collects.
  • Data: You can see the data in the data index.

On the data side, you can click on any data and scroll down to see the details. In this data extraction example, you can see the data extracted from the markdown files and the structured data presented in tabular format.

CocoInsight Data

For example, for the array module, you can preview the data by clicking on the data.

CocoInsight Data Preview for Array Module

Lots of great updates coming soon, stay tuned!

Add Summary to the dataโ€‹

Using cocoindex as framework, you can easily add any transformation on the data (including LLM summary), and collect it as part of the data index. For example, let's add some simple summary to each module - like number of classes and methods, using simple Python funciton.

We will add a LLM example later.

1. Define outputโ€‹

First, let's add the structure we want as part of the output definition.

@dataclasses.dataclass
class ModuleSummary:
"""Summary info about a Python module."""
num_classes: int
num_methods: int

2. Define cocoIndex Flowโ€‹

Next, let's define a custom function to summarize the data. You can see detailed documentation here

@cocoindex.op.function()
def summarize_module(module_info: ModuleInfo) -> ModuleSummary:
"""Summarize a Python module."""
return ModuleSummary(
num_classes=len(module_info.classes),
num_methods=len(module_info.methods),
)

3. Plug in the function into the flowโ€‹

    # ...
with data_scope["documents"].row() as doc:
# ... after the extraction
doc["module_summary"] = doc["module_info"].transform(summarize_module)

๐ŸŽ‰ Now you are all set!

Run following commands to setup and update the index.

python main.py cocoindex setup
python main.py cocoindex update

Extract Structured Data from PDF filesโ€‹

Ollama does not support PDF files directly as input, so we need to convert them to markdown first.

To do this, we can plugin a custom function to convert PDF to markdown. See the full documentation here.

1. Define a function specโ€‹

The function spec of a function configures behavior of a specific instance of the function.

class PdfToMarkdown(cocoindex.op.FunctionSpec):
"""Convert a PDF to markdown."""

2. Define an executor classโ€‹

The executor class is a class that implements the function spec. It is responsible for the actual execution of the function.

This class takes PDF content as bytes, saves it to a temporary file, and uses PdfConverter to extract the text content. The extracted text is then returned as a string, converting PDF to markdown format.

It is associated with the function spec by spec: PdfToMarkdown.

@cocoindex.op.executor_class(gpu=True, cache=True, behavior_version=1)
class PdfToMarkdownExecutor:
"""Executor for PdfToMarkdown."""

spec: PdfToMarkdown
_converter: PdfConverter

def prepare(self):
config_parser = ConfigParser({})
self._converter = PdfConverter(create_model_dict(), config=config_parser.generate_config_dict())

def __call__(self, content: bytes) -> str:
with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file:
temp_file.write(content)
temp_file.flush()
text, _, _ = text_from_rendered(self._converter(temp_file.name))
return text

You may wonder why we want to define a spec + executor (instead of using a standalone function) here. The main reason is there're some heavy preparation work (initialize the parser) needs to be done before being ready to process real data.

3. Plugin it to the flowโ€‹

    # Note the binary = True for PDF
data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="manuals", binary=True))
modules_index = data_scope.add_collector()

with data_scope["documents"].row() as doc:
# plug in your custom function here
doc["markdown"] = doc["content"].transform(PdfToMarkdown())

๐ŸŽ‰ Now you are all set!

Run following commands to setup and update the index.

python main.py cocoindex setup
python main.py cocoindex update

Communityโ€‹

We love to hear from the community! You can find us on Github and Discord.

If you like this post and our work, please support Cocoindex on Github with a star โญ. Thank you with a warm coconut hug ๐Ÿฅฅ๐Ÿค—.