Skip to main content

20 posts tagged with "Examples"

CocoIndex examples and implementation guides

View All Tags

Extracting Intake Forms with BAML and CocoIndex

· 7 min read
Linghua Jin
CocoIndex Maintainer

Extracting Intake Forms with BAML and CocoIndex

This tutorial shows how to use BAML together with CocoIndex to build a data pipeline that extracts structured patient information from PDF intake forms. The BAML definitions describe the desired output schema and prompt logic, while CocoIndex orchestrates file input, transformation, and incremental indexing.

We’ll walk through setup, defining the BAML schema, generating the Python client, writing the CocoIndex flow, and running the pipeline. Throughout, we follow best practices (e.g. caching heavy steps) and cite documentation for key concepts.

BAML​

BAML, created by BoundaryML, is a typed prompt engineering language that makes LLM workflows predictable, testable, and production-safe. Instead of treating prompts as fragile strings, BAML lets developers define clear input parameters, output schemas, and model configurations—transforming prompts into strongly typed functions.

CocoIndex​

CocoIndex is a unified data processing engine built for AI-native applications. It lets you define transformations in one declarative workflow—then keeps everything continuously up to date with real-time, incremental processing. Designed for reliability and scale, CocoIndex ensures that every derived artifact (embeddings, metadata, extractions, models) always reflects the latest source data, making it the foundation for fast, consistent RAG, analytics, and automation pipelines.

Flow Overview​

Flow Overview

  • Read PDF files from a directory.
  • For each file, call the BAML function to get a structured Patient.
  • Collect results and export to Postgres.

Prerequisites​

  1. Install Postgres if you don't have one.

  2. Install dependencies

    pip install -U cocoindex baml-py
  3. Create a .env file. You can copy it from .env.example first:

    cp .env.example .env

    Then edit the file to fill in your GEMINI_API_KEY.

Structured Extraction Component with BAML​

Create a baml_src/ directory for your BAML definitions. We’ll define a schema for patient intake data (nested classes) and a function that prompts Gemini to extract those fields from a PDF. Save this as baml_src/patient.baml

Define Patient Schema​

Classes: We defined Pydantic-style classes (Contact, Address, Insurance, etc.) to match the FHIR-inspired patient schema. These become typed output models. Required fields are non-nullable; optional fields use ?.

Schema

class Contact {
name string
phone string
relationship string
}

class Address {
street string
city string
state string
zip_code string
}

class Pharmacy {
name string
phone string
address Address
}

class Insurance {
provider string
policy_number string
group_number string?
policyholder_name string
relationship_to_patient string
}

class Condition {
name string
diagnosed bool
}

class Medication {
name string
dosage string
}

class Allergy {
name string
}

class Surgery {
name string
date string
}

class Patient {
name string
dob string
gender string
address Address
phone string
email string
preferred_contact_method string
emergency_contact Contact
insurance Insurance?
reason_for_visit string
symptoms_duration string
past_conditions Condition[]
current_medications Medication[]
allergies Allergy[]
surgeries Surgery[]
occupation string?
pharmacy Pharmacy?
consent_given bool
consent_date string?
}

Define the BAML function to extract patient info from a PDF​

function ExtractPatientInfo(intake_form: pdf) -> Patient {
client Gemini
prompt #"
Extract all patient information from the following intake form document.
Please be thorough and extract all available information accurately.

{{ intake_form }}

Fill in with "N/A" for required fields if the information is not available.

{{ ctx.output_format }}
"#
}

We specify client Gemini and a prompt template. The special variable {{ intake_form }} injects the PDF, and {{ ctx.output_format }} tells BAML to expect the structured format defined by the return type. The prompt explicitly asks Gemini to extract all fields, filling “N/A” if missing.

Configure the LLM client to use Google’s Gemini model​

client<llm> Gemini {
provider google-ai
options {
model gemini-2.5-flash
api_key env.GEMINI_API_KEY
}
}

Configure BAML generator​

In baml_src folder add generator.baml

generator python_client {
output_type python/pydantic
output_dir "../"
version "0.213.0"
}

The generator block tells baml-cli to create a Python client with Pydantic models in the parent directory.

When we run baml-cli generate

This will compile the .baml definitions into a baml_client/ Python package in your project root. It contains:

  • baml_client/types.py with Pydantic classes (Patient, etc.).
  • baml_client/sync_client.py and async_client.py with a callable b object. For example, b.ExtractPatientInfo(pdf) will return a Patient.

Continuous Data Transformation flow with incremental processing​

Next we will define data transformation flow with CocoIndex. Once you declared the state and transformation logic, CocoIndex will take care of all the state change for you from source to target.

CocoIndex Flow​

Declare Flow​

Declare a Cocoindex flow, connect to the source, add a data collector to collect processed data.

@cocoindex.flow_def(name="PatientIntakeExtractionBaml")
def patient_intake_extraction_flow(
flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
) -> None:
data_scope["documents"] = flow_builder.add_source(
cocoindex.sources.LocalFile(
path=os.path.join("data", "patient_forms"), binary=True
)
)

patients_index = data_scope.add_collector()

This iterates over each document. We transform doc["content"] (the bytes) by our extract_patient_info function. The result is stored in a new field patient_info. Then we collect a row with the filename and extracted patient info.

Ingesting Data

Define a custom function to use BAML extraction to transform a PDF​

@cocoindex.op.function(cache=True, behavior_version=1)
async def extract_patient_info(content: bytes) -> Patient:
pdf = baml_py.Pdf.from_base64(base64.b64encode(content).decode("utf-8"))
return await b.ExtractPatientInfo(pdf)
  • The extract_patient_info function is decorated with @cocoindex.op.function(cache=True, behavior_version=1). Setting cache=True causes CocoIndex to cache outputs of this function for incremental runs (so unchanged inputs skip rerunning the LLM). We increase behavior_version (start at 1) so that any prompt or logic changes will force a refresh.
  • Inside the function, we convert bytes to a BAML Pdf (via base64) and then call await b.ExtractPatientInfo(pdf). This returns a Patient dataclass instance (mapped from the BAML output)

Process each document​

  1. Transform each doc with BAML
  2. collect the structured output
with data_scope["documents"].row() as doc:
doc["patient_info"] = doc["content"].transform(extract_patient_info)

patients_index.collect(
filename=doc["filename"],
patient_info=doc["patient_info"],
)

Transforming Data

It is common to have heavy nested data, CocoIndex is natively designed to handle heavily nested data structures.

Nested Data

Export to Postgres​

patients_index.export(
"patients",
cocoindex.storages.Postgres(),
primary_key_fields=["filename"],
)

we export the collected index to Postgres. This will create/maintain a table patients keyed by filename, automatically deleting or updating rows if inputs change. Because CocoIndex tracks data lineage, it will handle updates/deletions of source files incrementally

Running the Pipeline​

Generate BAML client code (required step, in case you didn’t do it earlier. )

baml generate

This generates the baml_client/ directory with Python code to call your BAML functions.

Update the index:

cocoindex update main

CocoInsight

I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. It just connects to your local CocoIndex server, with zero pipeline data retention.

cocoindex server -ci main

Composable by Default: Use the Best Components for Your Use Case​

While CocoIndex provides a rich set of building blocks for building LLM pipelines, it is fundamentally designed as an open system. Developers can bring in their preferred transformation components tailored to their domain — from document parsers to structured extractors like BAML.

This flexibility enables deep composability with other open ecosystems. The synergy between CocoIndex and BAML highlights this philosophy: BAML brings powerful prompt-driven schema extraction, while CocoIndex orchestrates and maintains the flow at scale. There’s no lock-in — developers and enterprises experimenting at the frontier can adapt, extend, and integrate freely.

Summary​

By combining BAML and CocoIndex, we get a robust, schema-driven workflow: BAML ensures the prompt-to-schema mapping is correct and type-safe, while CocoIndex handles data ingestion, transformation, and incremental storage. This example extracted patient intake information (names, insurance, medications, etc.) from PDFs, but the pattern applies to any structured data extraction task.

Index PDF elements - text, images with mixed embedding models and metadata

· 7 min read
Linghua Jin
CocoIndex Maintainer

Index PDF elements - text, images with mixed encoders and citations with metadata

PDFs are rich with both text and visual content — from descriptive paragraphs to illustrations and tables. This example builds an end-to-end flow that parses, embeds, and indexes both, with full traceability to the original page.

In this example, we split out both text and images, link them back to page metadata, and enable unified semantic search. We’ll use CocoIndex to define the flow, SentenceTransformers for text embeddings, and CLIP for image embeddings — all stored in Qdrant for retrieval.

Automated invoice processing with AI, Snowflake and CocoIndex - with incremental processing

· 17 min read
Dhilip Subramanian
Data & AI Practitioner, CocoIndex Community Contributor

cover

I recently worked with a clothing manufacturer who wanted to simplify their invoice process. Every day, they receive around 20–22 supplier invoices in PDF format. All these invoices are stored in Azure Blob Storage. The finance team used to open each PDF manually and copy the details into their system. This took a lot of time and effort. On top of that, they already had a backlog of 8,000 old invoices waiting to be processed.

At first, I built a flow using n8n. This solution read the invoices from Azure Blob Storage, used Mistral AI to pull out the fields from each PDF, and then loaded the results into Snowflake. The setup worked fine for a while. But as the number of invoices grew, the workflow started to break. Debugging errors inside a no-code tool like n8n became harder and harder. That’s when I decided to switch to a coding solution.

I came across CocoIndex, an open-source ETL framework designed to transform data for AI, with support for real-time incremental processing. It allowed me to build a pipeline that was both reliable and scalable for this use case.

Fast iterate your indexing strategy - trace back from query to data

· 4 min read
Linghua Jin
CocoIndex Maintainer

cover

We are launching a major feature in both CocoIndex and CocoInsight to help users fast iterate with the indexing strategy, and trace back all the way to the data — to make the transformation experience more seamlessly integrated with the end goal.

We deeply care about making the overall experience seamless. With the new launch, you can define query handlers, so that you can easily run queries in tools like CocoInsight.

Incrementally Transform Structured + Unstructured Data from Postgres with AI

· 7 min read
Linghua Jin
CocoIndex Maintainer

PostgreSQL Product Indexing Flow

CocoIndex is one framework for building incremental data flows across structured and unstructured sources.

In CocoIndex, AI steps -- like generating embeddings -- are just transforms in the same flow as your other types of transformations, e.g. data mappings, calculations, etc.

Why One Framework for Structured + Unstructured?​

  • One mental model: Treat files, APIs, and databases uniformly; AI steps are ordinary ops.
  • Incremental by default: Use an ordinal column to sync only changes; no fragile glue jobs.
  • Consistency: Embeddings are always derived from the exact transformed row state.
  • Operational simplicity: One deployment, one lineage view, fewer moving parts.

This blog introduces the new PostgreSQL source and shows how to take data from PostgreSQL table as source, transform with both AI models and non-AI calculations, and write them into a new PostgreSQL table for semantic + structured search.

Build a Visual Document Index from multiple formats all at once - PDFs, Images, Slides - with ColPali

· 5 min read
Linghua Jin
CocoIndex Maintainer

Colpali

Do you have a messy collection of scanned documents, PDFs, academic papers, presentation slides, and standalone images — all mixed together with charts, tables, and figures — that you want to process into the same vector space for semantic search or to power an AI agent?

In this example, we’ll walk through how to build a visual document indexing pipeline using ColPali for embedding both PDFs and images — and then query the index using natural language.
We’ll skip OCR entirely — ColPali can directly understand document layouts, tables, and figures from images, making it perfect for semantic search across visual-heavy content.

Index Images with ColPali: Multi-Modal Context Engineering

· 7 min read
Linghua Jin
CocoIndex Maintainer

Colpali

We’re excited to announce that CocoIndex now supports native integration with ColPali — enabling multi-vector, patch-level image indexing using cutting-edge multimodal models.

With just a few lines of code, you can now embed and index images with ColPali’s late-interaction architecture, fully integrated into CocoIndex’s composable flow system.

Bring your own building blocks: Export anywhere with Custom Targets

· 8 min read
Linghua Jin
CocoIndex Maintainer

Custom Targets

We’re excited to announce that CocoIndex now officially supports custom targets — giving you the power to export data to any destination, whether it's a local file, cloud storage, a REST API, or your own bespoke system.

This new capability unlocks a whole new level of flexibility for integrating CocoIndex into your pipelines and allows you to bring your own "building blocks" into our flow model.

Indexing Faces for Scalable Visual Search - Build your own Google Photo Search

· 5 min read
Linghua Jin
CocoIndex Maintainer

Face Detection

CocoIndex supports multi-modal processing natively - it could process both text and image with the same programming model and observe in the same user flow (in CocoInsight).

In this blog, we’ll walk through a comprehensive example of building a scalable face recognition pipeline using CocoIndex. We’ll show how to extract and embed faces from images, structure the data relationally, and export everything into a vector database for real-time querying.

CocoInsight can now visualize identified sections of an image based on the bounding boxes and makes it easier to understand and evaluate AI extractions - seamlessly attaching computed features in the context of unstructured visual data.

Build Real-Time Product Recommendation Engine with LLM and Graph Database

· 8 min read
Linghua Jin
CocoIndex Maintainer

Product Graph

In this blog, we will build a real-time product recommendation engine with LLM and graph database. In particular, we will use LLM to understand the category (taxonomy) of a product. In addition, we will use LLM to enumerate the complementary products - users are likely to buy together with the current product (pencil and notebook). We will use Graph to explore the relationships between products that can be further used for product recommendations or labeling.

Build text embeddings from Google Drive for RAG

· 9 min read

Text Embedding from Google Drive

In this blog, we will show you how to use CocoIndex to build text embeddings from Google Drive for RAG step by step including how to setup Google Cloud Service Account for Google Drive. CocoIndex is an open source framework to build fresh indexes from your data for AI. It is designed to be easy to use and extend.