Skip to main content
View all authors

On-premise structured extraction with LLM using Ollama

ยท 7 min read

Structured data extraction with Ollama and CocoIndex

Overviewโ€‹

In this blog, we will show you how to use Ollama to extract structured data that you can run locally and deploy on your own cloud/server.

You can find the full code here. Only ~ 100 lines of Python code, check it out ๐Ÿค—!

Please give Cocoindex on Github a star to support us if you like our work. Thank you so much with a warm coconut hug ๐Ÿฅฅ๐Ÿค—. GitHub

Install ollamaโ€‹

Ollama allows you to run LLM models on your local machine easily. To get started:

Download and install Ollama. Pull your favorite LLM models by the ollama pull command, e.g.

ollama pull llama3.2

Extract Structured Data from Markdown filesโ€‹

1. Define outputโ€‹

We are going to extract the following information from the Python Manuals as structured data.

So we are going to define the output data class as the following. The goal is to extract and populate ModuleInfo.

@dataclasses.dataclass
class ArgInfo:
"""Information about an argument of a method."""
name: str
description: str

@dataclasses.dataclass
class MethodInfo:
"""Information about a method."""
name: str
args: cocoindex.typing.List[ArgInfo]
description: str

@dataclasses.dataclass
class ClassInfo:
"""Information about a class."""
name: str
description: str
methods: cocoindex.typing.List[MethodInfo]

@dataclasses.dataclass
class ModuleInfo:
"""Information about a Python module."""
title: str
description: str
classes: cocoindex.typing.List[ClassInfo]
methods: cocoindex.typing.List[MethodInfo]

2. Define cocoIndex Flowโ€‹

Let's define the cocoIndex flow to extract the structured data from markdowns, which is super simple.

First, let's add Python docs in markdown as a source. We will illustrate how to load PDF a few sections below.

@cocoindex.flow_def(name="ManualExtraction")
def manual_extraction_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
data_scope["documents"] = flow_builder.add_source(
cocoindex.sources.LocalFile(path="markdown_files"))

modules_index = data_scope.add_collector()

flow_builder.add_source will create a table with the following sub fields, see documentation here.

  • filename (key, type: str): the filename of the file, e.g. dir1/file1.md
  • content (type: str if binary is False, otherwise bytes): the content of the file

Then, let's extract the structured data from the markdown files. It is super easy, you just need to provide the LLM spec, and pass down the defined output type.

CocoIndex provides builtin functions (e.g. ExtractByLlm) that process data using LLM. We provide built-in support for Ollama, which allows you to run LLM models on your local machine easily. You can find the full list of models here. We also support OpenAI API. You can find the full documentation and instructions here.

    # ...
with data_scope["documents"].row() as doc:
doc["module_info"] = doc["content"].transform(
cocoindex.functions.ExtractByLlm(
llm_spec=cocoindex.LlmSpec(
api_type=cocoindex.LlmApiType.OLLAMA,
# See the full list of models: https://ollama.com/library
model="llama3.2"
),
output_type=ModuleInfo,
instruction="Please extract Python module information from the manual."))

After the extraction, we just need to cherrypick anything we like from the output using the collect function from the collector of a data scope defined above.

    modules_index.collect(
filename=doc["filename"],
module_info=doc["module_info"],
)

Finally, let's export the extracted data to a table.

    modules_index.export(
"modules",
cocoindex.storages.Postgres(table_name="modules_info"),
primary_key_fields=["filename"],
)

3. Query and test your indexโ€‹

๐ŸŽ‰ Now you are all set!

Run following commands to setup and update the index.

python main.py cocoindex setup
python main.py cocoindex update

You'll see the index updates state in the terminal

Index Updates

After the index is built, you have a table with the name modules_info. You can query it at any time, e.g., start a Postgres shell:

psql postgres://cocoindex:cocoindex@localhost/cocoindex

And run the SQL query:

SELECT filename, module_info->'title' AS title, module_summary FROM modules_info;

You can see the structured data extracted from the documents. Here's a screenshot of the extracted module information:

Module Information

CocoInsightโ€‹

CocoInsight is a tool to help you understand your data pipeline and data index. CocoInsight is in Early Access now (Free) ๐Ÿ˜Š You found us! A quick 3 minute video tutorial about CocoInsight: Watch on YouTube.

1. Run the CocoIndex serverโ€‹

python main.py cocoindex server -c https://cocoindex.io

to see the CocoInsight dashboard https://cocoindex.io/cocoinsight. It connects to your local CocoIndex server with zero data retention.

There are two parts of the CocoInsight dashboard:

CocoInsight Dashboard

  • Flows: You can see the flow you defined, and the data it collects.
  • Data: You can see the data in the data index.

On the data side, you can click on any data and scroll down to see the details. In this data extraction example, you can see the data extracted from the markdown files and the structured data presented in tabular format.

CocoInsight Data

For example, for the array module, you can preview the data by clicking on the data.

CocoInsight Data Preview for Array Module

Lots of great updates coming soon, stay tuned!

Add Summary to the dataโ€‹

Using cocoindex as framework, you can easily add any transformation on the data (including LLM summary), and collect it as part of the data index. For example, let's add some simple summary to each module - like number of classes and methods, using simple Python funciton.

We will add a LLM example later.

1. Define outputโ€‹

First, let's add the structure we want as part of the output definition.

@dataclasses.dataclass
class ModuleSummary:
"""Summary info about a Python module."""
num_classes: int
num_methods: int

2. Define cocoIndex Flowโ€‹

Next, let's define a custom function to summarize the data. You can see detailed documentation here

@cocoindex.op.function()
def summarize_module(module_info: ModuleInfo) -> ModuleSummary:
"""Summarize a Python module."""
return ModuleSummary(
num_classes=len(module_info.classes),
num_methods=len(module_info.methods),
)

3. Plug in the function into the flowโ€‹

    # ...
with data_scope["documents"].row() as doc:
# ... after the extraction
doc["module_summary"] = doc["module_info"].transform(summarize_module)

๐ŸŽ‰ Now you are all set!

Run following commands to setup and update the index.

python main.py cocoindex setup
python main.py cocoindex update

Extract Structured Data from PDF filesโ€‹

Ollama does not support PDF files directly as input, so we need to convert them to markdown first.

To do this, we can plugin a custom function to convert PDF to markdown. See the full documentation here.

1. Define a function specโ€‹

The function spec of a function configures behavior of a specific instance of the function.

class PdfToMarkdown(cocoindex.op.FunctionSpec):
"""Convert a PDF to markdown."""

2. Define an executor classโ€‹

The executor class is a class that implements the function spec. It is responsible for the actual execution of the function.

This class takes PDF content as bytes, saves it to a temporary file, and uses PdfConverter to extract the text content. The extracted text is then returned as a string, converting PDF to markdown format.

It is associated with the function spec by spec: PdfToMarkdown.

@cocoindex.op.executor_class(gpu=True, cache=True, behavior_version=1)
class PdfToMarkdownExecutor:
"""Executor for PdfToMarkdown."""

spec: PdfToMarkdown
_converter: PdfConverter

def prepare(self):
config_parser = ConfigParser({})
self._converter = PdfConverter(create_model_dict(), config=config_parser.generate_config_dict())

def __call__(self, content: bytes) -> str:
with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as temp_file:
temp_file.write(content)
temp_file.flush()
text, _, _ = text_from_rendered(self._converter(temp_file.name))
return text

You may wonder why we want to define a spec + executor (instead of using a standalone function) here. The main reason is there're some heavy preparation work (initialize the parser) needs to be done before being ready to process real data.

3. Plugin it to the flowโ€‹

    # Note the binary = True for PDF
data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="manuals", binary=True))
modules_index = data_scope.add_collector()

with data_scope["documents"].row() as doc:
# plug in your custom function here
doc["markdown"] = doc["content"].transform(PdfToMarkdown())

๐ŸŽ‰ Now you are all set!

Run following commands to setup and update the index.

python main.py cocoindex setup
python main.py cocoindex update

Communityโ€‹

We love to hear from the community! You can find us on Github and Discord.

If you like this post and our work, please support Cocoindex on Github with a star โญ. Thank you with a warm coconut hug ๐Ÿฅฅ๐Ÿค—.

We are officially open sourced! ๐ŸŽ‰

ยท 3 min read

CocoIndex is now open source

We are thrilled to announce the open-source release of CocoIndex, the world's first engine that supports both custom transformation logic and incremental updates specialized for data indexing.

CocoIndex combines custom transformation logic and incremental updates

CocoIndex is an ETL framework to preapare data for AI applications such as semantic search, retrieval-augmented generation (RAG). It offers a data-driven programming model that simplifies the creation and maintenance of data indexing pipelines, ensuring data freshness and consistency.

CocoIndex is now open source under the Apache License 2.0. This means the core functionality of CocoIndex is freely available for anyone to use, modify, and distribute. We believe that open sourcing CocoIndex will foster innovation, enable broader adoption, and create a vibrant community of contributors who can help shape its future. By choosing the Apache License 2.0, we're ensuring that both individual developers and enterprises can confidently build upon and integrate CocoIndex into their projects while maintaining the flexibility to create proprietary extensions.

๐Ÿ”ฅ Key Featuresโ€‹

  • Data Flow Programming: Build indexing pipelines by composing transformations like Lego blocks, with built-in state management and observability.
  • Support Custom Logic: Plug in your choice of chunking, embedding, and vector stores. Extend with custom transformations like deduplication and reconciliation.
  • Incremental Updates: Smart state management minimizes re-computation by tracking changes at the file level, with future support for chunk-level granularity.
  • Python SDK: Built with a RUST core ๐Ÿฆ€ for performance, exposed through an intuitive Python binding ๐Ÿ for ease of use.

We are moving fast and a lot of features and improvements are coming soon.

๐Ÿš€ Getting Startedโ€‹

  1. Installation: Install the CocoIndex Python library:

    pip install cocoindex
  2. Set Up Postgres with pgvector Extension: Ensure Docker Compose is installed, then start a Postgres database:

    docker compose -f <(curl -L https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/postgres.yaml) up -d
  3. Define Your Indexing Flow: Create a flow to index your data. For example:

    @cocoindex.flow_def(name="TextEmbedding")
    def text_embedding(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))
    doc_embeddings = data_scope.add_collector()

    with data_scope["documents"].row() as doc:
    doc["chunks"] = doc["content"].transform(
    cocoindex.functions.SplitRecursively(language="markdown", chunk_size=300, chunk_overlap=100))

    with doc["chunks"].row() as chunk:
    chunk["embedding"] = chunk["text"].transform(
    cocoindex.functions.SentenceTransformerEmbed(model="sentence-transformers/all-MiniLM-L6-v2"))

    doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
    text=chunk["text"], embedding=chunk["embedding"])

    doc_embeddings.export(
    "doc_embeddings",
    cocoindex.storages.Postgres(),
    primary_key_fields=["filename", "location"],
    vector_index=[("embedding", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])

For a detailed walkthrough, refer to our Quickstart Guide.

๐Ÿค— Communityโ€‹

We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests on GitHub, and discussions in our Discord.

  • GitHub: Please give us a star repository ๐Ÿค—.
  • Documentation: Check out our documentation for detailed guides and API reference.
  • Discord: Join discussions, seek support, and share your experiences on our Discord server.
  • Social Media: Follow us on Twitter and LinkedIn for updates.

We would love to fostering an inclusive, welcoming, and supportive environment. Contributing to CocoIndex should feel collaborative, friendly and enjoyable for everyone. Together, we can build better AI applications through robust data infrastructure.

Looking forward to seeing what you build with CocoIndex!

Customizable Data Indexing Pipelines

ยท 4 min read

CocoIndex is the world's first open-source engine that supports both custom transformation logic and incremental updates specialized for data indexing. So, what is custom transformation logic?

Index-as-a-service (or RAG-as-service), tends to package a predesigned service and expose two endpoints to users - one to configure the source, and an API to read from the index. Many predefined pipelines for unstructured documents do this. The requirements are fairly simple: parse PDFs, perform some chunking and embedding, and dump into vector stores. This works well if your requirements are simple and primarily focused on document parsing.

We've talked to many developers across various verticals that require data indexing, and being able to customize logic is essential for high-quality data retrieval. For example:

  • Basic choices for pipeline components
    • which parser for different files?
    • how to chunk files (documents with structure normally have different optimal chunking strategies)?
    • which embedding model? which vector database?
  • What should the pipeline do?
    • Is it a simple text embedding"?
    • Is it building a knowledge graph?
    • Should it perform simple summarization for each source for retrieval without chunking?
  • What additional work is needed to improve pipeline quality?
    • Do we need deduplication?
    • Do we need to look up different sources to enrich our data?
    • Do we need to reconcile and align multiple documents?

Here we'll walk through some examples of the topology of index pipelines, and we can explore more in the future!

Basic embeddingโ€‹

Basic embedding pipeline

In this example, we do the following:

  1. Read from sources, for example, a list of PDFs
  2. For each source file, parse it to markdown with a PDF parser. There are lots of choices out there: Llama Parse, Unstructured IO, Gemini, DeepSeek, etc.
  3. Chunk all the markdown files. This is a way to break text into smaller chunks, or units, to help organize and process information. There are many options here: flat chunks, hierarchical chunks, secondary chunks, and many publications in this area. There are also special chunking strategies for different verticals - for code, tools like Tree-sitter can help parse and chunk based on syntax. Normally the best choice is tied to your document structure and requirements.
  4. Perform embedding for each chunk. There are lots of great choices: Voyage, models like OpenAI etc.
  5. Collect the embeddings in vector stores. The embedding is normally attached with metadata, for example, which file this embedding belongs to, etc. There are many great choices for vector stores: Chromadb, Milvus, Pinecone and many databases now support vector indexing, for example PostgreSQL (pgvector) and MongoDB.

Anthropic has published a great article about Contextual Retrieval that suggests the combination of vector-based search and TF-IDF.

The way to think about a data flow for the pipeline is: A combination of TF-IDF and vector search

In addition to prepare the vector embedding as basic embedding example above, after the source data parsing, we can do the following:

  1. For each document, extract keywords from it with its frequency.
  2. Across all documents, group by keywords and sum up their frequencies, putting into an internal storage.
  3. For each document each keyword, calculate the TF-IDF score using two inputs: frequency in the current document, and the sum of frequency across all documents. Store keyword and along with TF-IDF score keyword (if above a certain threshold) to a keyword index.

On query time, we can query both the vector index and the keyword index, and combine results from both.

Simple Data Lookup/Enrichment exampleโ€‹

Sometimes, you want to enrich your data with metadata looked up from other sources. For example, if we want to create index on diagnostic reports, which uses ICD-10 (International Classification of Diseases Version 10) codes to describe diseases, we can have a pipeline like this:

Simple Data Lookup/Enrichment

In this example, we do the following:

  • On the first path, build a ICD-10 dictionary by

    1. For each ICD-10 description document, convert it to markdown with a PDF parser.
    2. Split into items.
    3. For each item, extract the ICD-10 code and description, and collect into a storage.
  • On the second path, for each report

    1. Parse it to markdown with a PDF parser.
    2. Split into items.
    3. For each item, lookup the ICD-10 dictionary prepared above and enrich the item with the descritions for ICD-10 codes.

Now we have a vector index, built based on diagnostic reports enriched with ICD-10 descriptions.

Data Consistency in Indexing Pipelines

ยท 7 min read

Data Consistency in Indexing Pipelines

An indexing pipeline builds indexes derived from source data. The index should always be converging to the current version of source data. In other words, once a new version of source data is processed by the pipeline, all data derived from previous versions should no longer exist in the target index storage. This is called data consistency requirement for an indexing pipeline.

Data Indexing and Common Challenges

ยท 5 min read

Data Indexing Pipeline

At its core, data indexing is the process of transforming raw data into a format that's optimized for retrieval. Unlike an arbitrary application that may generate new source-of-truth data, indexing pipelines process existing data in various ways while maintaining trackability back to the original source. This intrinsic nature - being a derivative rather than source of truth - creates unique challenges and requirements.

CocoIndex - A Data Indexing Platform for AI Applications

ยท 4 min read

CocoIndex Cover Image

High-quality data tailored for specific use cases is essential for successful AI applications in production. The old adage "garbage in, garbage out" rings especially true for modern AI systems - when a RAG pipeline or agent workflow is built on poorly processed, inconsistent, or irrelevant data, no amount of prompt engineering or model sophistication can fully compensate. Even the most advanced AI models can't magically make sense of low-quality or improperly structured data.

Welcome to CocoIndex

ยท 2 min read

Aloha CocoIndex

Welcome to the official CocoIndex blog! We're excited to share our journey in building high-performance indexing infrastructure for AI applications.

CocoIndex is designed to provide exceptional velocity for AI systems that need fast, reliable access to their data. Whether you're building large language models, recommendation systems, or other AI applications, our goal is to make data indexing and retrieval as efficient as possible.