# CocoIndex Docs — full text

> The complete CocoIndex documentation and example walkthroughs concatenated into one file for LLMs and agents. Each section below is one docs page or example page, in reading order. For a lighter index of pages and examples, see /docs/llms.txt.

---

# CocoIndex overview

Source: https://cocoindex.io/docs/getting_started/overview/

CocoIndex is an ultra-performant framework for building data processing pipelines for AI workloads, with built-in incremental processing.

## Programming model

CocoIndex uses a *declarative*, state-driven programming model. You specify *what* your target should look like as a function of your source data — not *how* to incrementally update it. CocoIndex handles change detection and applies only the necessary updates automatically.

If you’ve used React, spreadsheets, or materialized views, this will feel familiar:
- **React**: declare UI as a function of state → React re-renders what changed
- **Spreadsheets**: declare formulas → cells recompute when inputs change
- **CocoIndex**: declare [target states](../programming_guide/target_state) as a function of source → CocoIndex syncs what changed

## CocoIndex features

### High-performance Rust 🦀 engine
CocoIndex executes pipelines on a high-performance Rust engine, delivering resilient and scalable data processing.

### Easy to code
- Write simple transformations in Python without learning new DSLs
- Write batch-style code without worrying about deltas — CocoIndex runs it incrementally in both batch and live mode, continuously updating results. No separate DAGs, operators, or orchestration logic required.

### Incremental & low-latency
CocoIndex tracks fine-grained dependencies and only recomputes what changed in the input data or the code. End-to-end updates drop from hours/days to seconds while keeping full correctness.

### Full lineage & explainability
Every processing step, intermediate result, and execution path is inspectable. This helps it remain compliant with the EU AI Act for transparency, and satisfies enterprise auditability/traceability requirements.

### Open integration model
Sources and targets plug in through a standard, open interface (no vendor lock-in). Leverage the full Python ecosystem for models, functions, and libraries.

### High throughput + controlled concurrency
Pipelines automatically parallelize with managed concurrency and request batching — reducing GPU cost, RPC fanout, and end-to-end latency.

### Fault-tolerant runtime
The engine gracefully retries transient failures and resumes from previous progress after interruptions — eliminating manual backfills and replays.

### Low operational overhead
CocoIndex removes the need for elaborate plumbing: refreshing datasets, maintaining state, handling backfills, ensuring correctness, coordinating GPUs, scaling workers, and managing infra are all handled by the engine.

## Incremental data processing

CocoIndex continuously maintains and tracks state while processing only new or changed data. It is designed to support incremental processing from day zero.

What incremental processing means:
- Avoid unnecessarily recomputing work, based on multi-level change detection:
  - **Component level**: only reprocess source items with changes
  - **Function level**: within an item’s processing, memoize expensive function calls and reuse when possible
  - **Target level**: apply minimum necessary changes (insertions, updates, deletions) to the target
- Support multiple mechanisms to capture source changes (CDC, poll-based) out of the box

You write simple batch-style code — no delta logic, no state handling. CocoIndex automatically runs your pipeline incrementally and keeps the output up to date for serving, training, or feature computation.

## Next steps

- [Install CocoIndex](./installation) and follow the [Quickstart](./quickstart) to build your first pipeline in 5 minutes
- Read [Core Concepts](../programming_guide/core_concepts) for the mental model behind CocoIndex

---

# CocoIndex installation

Source: https://cocoindex.io/docs/getting_started/installation/

## Install Python and pip

To follow the steps in this guide, you'll need:

1. Install [Python](https://wiki.python.org/moin/BeginnersGuide(2f)Download.html). We support Python 3.11 to 3.13.
2. Install [pip](https://pip.pypa.io/en/stable/installation/) - a Python package installer

## Install CocoIndex

### Using pip

```sh
pip install -U cocoindex
```

### Using uv

```sh
uv add cocoindex
```

### Using Poetry

```sh
poetry add cocoindex
```

Or specify in `pyproject.toml`:

```toml
[tool.poetry.dependencies]
cocoindex = { version = "^1.0" }
```

## System requirements

CocoIndex is supported on the following operating systems:

- **macOS**: 10.12+ on x86_64, 11.0+ on arm64
- **Linux**: x86_64 or arm64, glibc 2.28+ (e.g., Debian 10+, Ubuntu 18.10+, Fedora 29+, CentOS/RHEL 8+)
- **Windows**: 10+ on x86_64

---

# CocoIndex quickstart

Source: https://cocoindex.io/docs/getting_started/quickstart/

In this tutorial, we'll build a simple app that converts PDF files to Markdown and saves them to a local directory.

## Overview

1. Read PDF files from a local directory
2. Convert each file to Markdown using Docling
3. Save the Markdown files to an output directory (as **target states**)

You declare the transformation logic with native Python without worrying about changes.

Think: **target_state = transformation(source_state)**

When your source data is updated, or your processing logic is changed (for example, switching parsers or tweaking conversion settings), CocoIndex performs smart incremental processing that only reprocesses the minimum. And it keeps your Markdown files always up to date.

## Setup

1. Install CocoIndex (see [Installation](./installation) for other package managers) and the Docling dependency:

    ```bash
    pip install -U cocoindex docling
    ```

2. Create a new directory for your project:

    ```bash
    mkdir cocoindex-quickstart
    cd cocoindex-quickstart
    ```

3. Create a `pdf_files/` directory and add your PDF files:

    ```bash
    mkdir pdf_files
    ```
    You can download sample PDF files from the [git repo](https://github.com/cocoindex-io/cocoindex/tree/main/examples/pdf_to_markdown).

4. Create a `.env` file to configure the database path:

    ```bash
    echo "COCOINDEX_DB=./cocoindex.db" > .env
    ```

## Define the app

At a high level, the app has three layers:

1. **App** — binds the pipeline function to concrete input and output paths
2. **Main function** — finds PDF files and mounts one processing component per file
3. **File processing** — converts one PDF to Markdown and declares the output file

We'll define the code in the opposite order so each Python symbol exists before it is referenced.

Create a new file `main.py`. We'll define the processing functions first, then wire them into an App.

### Define file processing

This function converts a single PDF to Markdown:

```python title="main.py"
import pathlib

import cocoindex as coco
from cocoindex.connectors import localfs
from cocoindex.resources.file import PatternFilePathMatcher
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

_pipeline_options = PdfPipelineOptions(
    accelerator_options=AcceleratorOptions(device=AcceleratorDevice.CPU)
)
_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=_pipeline_options)
    }
)

@coco.fn(memo=True)
def process_file(
    file: localfs.File,
    outdir: pathlib.Path,
) -> None:
    markdown = _converter.convert(
        file.file_path.resolve()
    ).document.export_to_markdown()
    outname = file.file_path.path.stem + ".md"
    localfs.declare_file(outdir / outname, markdown, create_parent_dirs=True)
```

- **`localfs.File`** — A file object returned by `localfs.walk_dir()`, implementing the [`FileLike`](../common_resources/data_types#filelike) base class. See the [localfs connector](../connectors/localfs) for full details.
- **`memo=True`** — Caches results; unchanged files are skipped on re-runs
- **`localfs.declare_file()`** — Declares a file [target state](../programming_guide/target_state); auto-deleted if source is removed. See [localfs as target](../connectors/localfs#as-target) for the full API.

### Define the main function

```python title="main.py"
@coco.fn
async def app_main(sourcedir: pathlib.Path, outdir: pathlib.Path) -> None:
    files = localfs.walk_dir(
        sourcedir,
        recursive=True,
        path_matcher=PatternFilePathMatcher(included_patterns=["**/*.pdf"]),
    )
    await coco.mount_each(process_file, files.items(), outdir)
```

`mount_each()` mounts one processing component per file. Each item from `files.items()` is a `(key, file)` pair — the key (the file's relative path) becomes the component subpath automatically.

It's up to you to pick the process granularity — it can be at directory level, at file level, or at page level. In this example, because we want to independently convert each file to Markdown, the file level is the most natural choice.

### Create the App

```python title="main.py"
app = coco.App(
    "PdfToMarkdown",
    app_main,
    sourcedir=pathlib.Path("./pdf_files"),
    outdir=pathlib.Path("./out"),
)
```
This defines a CocoIndex App — the top-level runnable unit in CocoIndex. It binds the main function with its arguments.

## Run the pipeline

Run the pipeline:

```bash
cocoindex update main.py
```

CocoIndex will:

1. Create the `out/` directory
2. Convert each PDF in `pdf_files/` to Markdown in `out/`

Check the output:

```bash
ls out/
# example.md (one .md file for each input PDF)
```

## Incremental updates

The power of CocoIndex is **incremental processing**. Try these:

**Add a new file:**

Add a new PDF to `pdf_files/`, then run:

```bash
cocoindex update main.py
```

Only the new file is processed.

**Modify a file:**

Replace a PDF in `pdf_files/` with an updated version, then run:

```bash
cocoindex update main.py
```

Only the changed file is reprocessed.

**Delete a file:**

```bash
rm pdf_files/example.pdf
cocoindex update main.py
```

The corresponding Markdown file is automatically removed.

## Next steps

- Read [Core Concepts](../programming_guide/core_concepts) to understand the mental model — state-driven programming, processing components, and memoization
- Dive into the [Programming Guide](../programming_guide/app), starting with Apps, to learn how to build more complex pipelines
- Browse more [examples](https://github.com/cocoindex-io/cocoindex/tree/main/examples) for real-world patterns (text embedding, RAG, knowledge graphs)

---

# Use CocoIndex with AI coding agents

Source: https://cocoindex.io/docs/getting_started/ai_coding_agents/

CocoIndex ships an official **agent skill** that teaches AI coding agents how to build CocoIndex v1 pipelines correctly — core concepts, API surface, common patterns, connectors, and best practices, all in one place.

The skill lives in the [`skills/cocoindex/`](https://github.com/cocoindex-io/cocoindex/tree/main/skills/cocoindex) directory of the main repo.

## Why a skill?

CocoIndex v1 is a fundamental redesign from v0. Without context, an LLM trained on older snapshots tends to hallucinate the wrong API (the v0 flow-builder DSL, deprecated decorators, missing `@coco.fn`, unstable component paths). The skill gives the agent a single concise reference covering:

- **Mental model** — `target_state = transform(source_state)`, declarative pipelines, [processing components](../programming_guide/processing_component).
- **Core APIs** — `@coco.fn`, `mount` / `use_mount` / `mount_each`, `ContextKey`, `@coco.lifespan`, target state declarations.
- **Common patterns** — file transformation, vector embedding, LLM extraction, knowledge graphs.
- **Connectors** — PostgreSQL, SQLite, LanceDB, Qdrant, SurrealDB, Apache Doris, LocalFS, S3, Kafka, Google Drive.
- **Best practices** — memoization, stable component paths, vector schema with `Annotated[NDArray, KEY]`.

## Use with Claude Code

Claude Code auto-loads skills from `.claude/skills/` in the current project, or from `~/.claude/skills/` globally.

Project-local install (recommended for repos that build with CocoIndex):

```sh
mkdir -p .claude/skills
git clone --depth=1 https://github.com/cocoindex-io/cocoindex.git /tmp/cocoindex-skill
cp -r /tmp/cocoindex-skill/skills/cocoindex .claude/skills/
```

Global install:

```sh
mkdir -p ~/.claude/skills
git clone --depth=1 https://github.com/cocoindex-io/cocoindex.git /tmp/cocoindex-skill
cp -r /tmp/cocoindex-skill/skills/cocoindex ~/.claude/skills/
```

Once installed, Claude Code picks up the skill automatically when you ask it to build or modify a CocoIndex pipeline.

## Use with other agents

The skill is plain Markdown — `SKILL.md` plus a few reference files. Any agent that accepts file-based context will work:

- **Cursor** — copy `skills/cocoindex/SKILL.md` into `.cursor/rules/cocoindex.md`.
- **Generic AGENTS.md / CLAUDE.md** — concatenate or `@import` `SKILL.md` from your top-level agent instructions file.
- **Custom RAG / agent stack** — index the `skills/cocoindex/` directory like any other documentation source.

## What's inside

```
skills/cocoindex/
├── SKILL.md                       # main entry — concepts, APIs, patterns
└── references/
    ├── api_reference.md           # quick API reference
    ├── connectors.md              # full connector reference
    ├── patterns.md                # detailed pipeline patterns
    ├── setup_project.md           # project setup
    └── setup_database.md          # database setup
```

## Contributing

The skill is versioned alongside the codebase — when the API changes, the skill changes with it. PRs that improve clarity, add examples, or cover new connectors are welcome. See the [contributing guide](../contributing/guide/) for how to get started.

---

# CocoIndex core concepts

Source: https://cocoindex.io/docs/programming_guide/core_concepts/

## Incremental processing

When processing data and storing results in targets (e.g., a database) for knowledge retrieval by AI agents or search systems, both your data and code evolve over time. Reprocessing everything after every change is expensive, slow, and disruptive. Incremental processing solves this by only processing what's changed, and applying those changes to the target.

Implementing incremental processing by hand is complicated because:

- You need to figure out what has changed and what has not.

- You need to think in the time dimension and carefully compute the “delta”, e.g., what needs to be inserted, updated, or deleted in your target.

- You need to track and preserve intermediate states to avoid full recomputation when possible.

- You need to evolve the target schema and backfill data when the code logic changes.

With so many moving parts, when something goes wrong, it is difficult to debug.

## State-driven programming

CocoIndex uses a *declarative* programming model — instead of programming *how* to incrementally process your data and apply changes to your target, you declaratively specify *what* your target should look like, based on the current state of your data source.

**Info**

If you've used React, spreadsheets, or materialized views, this mental model will feel familiar:

- **Spreadsheets**: You declare formulas in cells. When any upstream cell changes, downstream cells automatically recompute to reflect the new state.

- **React**: You declare your UI as a function of state. When state changes, React automatically re-renders the UI to match.

- **Materialized Views**: You declare a query (e.g., SQL) that runs on source tables. When input data changes, the view automatically refreshes to match.

CocoIndex uses the above ideas to formulate a state-driven paradigm for long-running, side-effectful data processing pipelines with the following key concepts:

- **Data transformations**: You read the current state from your source and perform a series of transformations. For example, converting PDFs to markdown files,
extracting features or structures, or mapping data to fit a particular schema.

- **Target states**: You output the results to a target such as a relational database, vector database, or file system. Note that the target state is a pure function of the source state (i.e., it has no other side effects). **TargetState = Transform(SourceState)**

- **Incremental processing**: Under the hood, when the source state changes, CocoIndex incrementally processes only the changes needed to update the target, so you don't have to manage it yourself.

## App

An ***app*** is the top-level executable entity in CocoIndex. In an app, you write code to:

- Read state from sources

- Transform the data

- Declare ***target states***: i.e., what the output should look like

CocoIndex then syncs these target states to external systems (Postgres, vector databases, etc.).

For example, here's an app that reads PDFs from a drive, converts them to markdown, and outputs to a folder:

## Processing Component

In practice, your source often contains many items — files, rows, or entities — each of which can be processed independently. A ***processing component*** groups an item's processing together with its target states. Each processing component runs independently and applies its target states to the external system as soon as it completes, without waiting for the rest of the app.

For example, if you have many files in a drive and want to process them file by file, your processing component would operate at the file level:

Taking this further, suppose you want to split each file into chunks and create embedding vectors for indexing. The processing component can still operate at the file level, but each component now produces multiple target states (one per chunk). CocoIndex applies all target state changes (inserts, updates, and deletes rows in the target database) as a unit for each file — all writes happen after processing completes, and each target backend applies its batch atomically when supported (e.g., within a database transaction):

Let's see what happens when the source state changes in different ways:

  
    When a new file (`c.md`) is added, a new processing component is created for it. Once execution completes, CocoIndex applies the new target states — inserting `vector5` and `vector6` into the vector store.
  
  
    When file `b.md` is updated — say its content is reduced to just one chunk instead of two — the processing component's target state changes from `vector3` and `vector4` to just `vector5`. CocoIndex deletes `vector3` and `vector4`, then inserts `vector5` into the vector database, all within a single transaction.
  
  
    When file `b.md` is deleted from the source folder, CocoIndex deletes its associated target states (`vector3` and `vector4`) from the vector database in a single transaction.
  

## Function memoization: skip unchanged computations

***Function memoization*** is a technique that allows skipping a function when its input and code are unchanged from a previous run. It is essential for incremental processing — without it, every run would require full recomputation.

In CocoIndex, both processing components and transforms are expressed as functions, so function memoization can be enabled at either level. Using the chunk-embed example:

- *Processing component level*: If a file hasn't changed and the processing logic hasn't changed, the entire processing component is skipped.

- *Transform level*: If the input to the "embed" transform (the chunk text) hasn't changed and the transform logic (e.g., the model) hasn't changed, that specific embedding computation is skipped.

See [Function Memoization](./function#memoization) for more details.

Here's how memoization behaves in different scenarios:

  
    When input file `b.md` changes:

    - The input state `a.md` is unchanged, so the 1st processing component is entirely reused without reprocessing.

    - The input state `b.md` changed, so the 2nd processing component must be reprocessed. After splitting, suppose we get two chunks: `chunk3` (identical to before) and `chunk5` (new).

      - `Embed(chunk3)` was memoized previously, so its cached result is reused.

      - `Embed(chunk5)` is new and must be computed.
  
  
    When the "Split into chunks" logic changes:

    - All processing components must be reprocessed since the logic changed.

    - For the 1st processing component, the new logic produces the same chunks as before. The memoized `Embed` results are reused without recomputation.

    - For the 2nd processing component, the new logic produces different chunks (`chunk5` and `chunk6`), so `Embed` must be invoked on them.

    As these examples show, memoization can save expensive computations even when logic changes — as long as the intermediate results remain the same.
  

## Next steps

Now that you understand the mental model, dive into the Programming Guide to learn how to use these concepts in code:

- [App](./app) — creating and running pipelines
- [Target State](./target_state) — declaring what should exist in external systems
- [Processing Component](./processing_component) — structuring work and mounting components
- [Function](./function) — the `@coco.fn` decorator, memoization, and change detection

---

# The CocoIndex App

Source: https://cocoindex.io/docs/programming_guide/app/

**Note — Prerequisite**
This page builds on [Core Concepts](./core_concepts), which introduces the App and the source → transform → target-state model **with diagrams**. If the App model feels abstract, start there.

An **App** is the top-level runnable unit in CocoIndex.
It names your pipeline and binds a main function with its parameters. When you call `app.update()`, CocoIndex runs that main function as the root [processing component](./processing_component) which can mount child processing components to do work and declare target states.

## Creating an app

To create an App, provide:

1. **An `AppConfig`** (or just a name string) — identifies the pipeline
2. **A main function** — the entry point for your pipeline
3. **Arguments** — any additional arguments to pass to the main function

```python
import cocoindex as coco

@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
    # ... your pipeline logic ...

app = coco.App(
    coco.AppConfig(name="MyPipeline"),
    app_main,
    sourcedir=pathlib.Path("./data"),
)
```

You can also pass just a name string instead of `AppConfig`:

```python
app = coco.App("MyPipeline", app_main, sourcedir=pathlib.Path("./data"))
```

**Tip**
The main function is usually async. See [How Sync and Async Work Together](./sdk_overview#how-sync-and-async-work-together) for details.

## Updating an app

Call `update()` to execute the pipeline. It returns an `UpdateHandle` that is also `Awaitable`, so the simplest usage stays the same:

```python
# Async — await the result directly (backward-compatible)
result = await app.update()
```

```python
# Sync (blocking) API
result = app.update_blocking()
```

**Parameters:**

- `live` option keeps the app running after the initial scan so live components can continue watching for changes. See [Live Mode](./live_mode).
- `report_to_stdout` option prints periodic progress updates during execution. Pass `True` for the default refresh interval, or a `timedelta` to set it.
- `full_reprocess` option reprocesses everything and invalidates existing caches. This forces all components to re-execute and all target states to be re-applied, even if they haven't changed.

When you update an App, CocoIndex:

1. Runs the lifespan setup (if not already done)
2. Executes the main function (the root processing component), which mounts child processing components
3. Compares the declared target states with the previous run and applies only the necessary changes to external systems

Given the same logic and inputs, updates are repeatable. When logic or inputs change, only the affected parts re-execute.

To watch progress beyond the `report_to_stdout` flag, the `UpdateHandle` returned by `app.update()` also exposes stats programmatically — poll with `handle.stats()` or stream with `handle.watch()`. For those structured APIs, and for splitting a run into separately-reported scopes with `coco.stats_group()`, see [Progress monitoring](../advanced_topics/progress_monitoring).

## How an app runs

An App is the top-level runner and entry point. A **processing component** is the unit of incremental execution *within* an app.

- Your app's main function runs as the **root processing component** at the root path.
- Each call to `mount()` or `use_mount()` declares a **child processing component** at a child path. Sugar APIs like `mount_each()` and `mount_target()` also create child components.
- Each processing component declares a set of target states, and CocoIndex syncs them as a unit when that component finishes — all writes happen after processing completes, and each target backend applies its batch atomically when supported.

This is why `app.update()` does not "run everything from scratch": CocoIndex uses the component path tree to decide what can be reused and what must re-run.

For example, an app that processes files might mount one component per file — the per-file fan-out from [Core Concepts](./core_concepts#processing-component), shown here as a path tree:

```text
(root)                         ← app_main component
├── "setup"                    ← declare_dir_target component
└── "process"
    ├── "hello.md"             ← process_file component
    └── "world.md"             ← process_file component
```

See [Processing Component](./processing_component) for how mounting and component paths define these boundaries.

## Database path

CocoIndex needs a database path (`db_path`) to store its internal state. This database tracks target states and memoized results from previous runs, enabling CocoIndex to compute what changed and apply only the necessary updates.

The simplest way to configure the database path is via the `COCOINDEX_DB` environment variable:

```bash
export COCOINDEX_DB=./cocoindex.db
```

With `COCOINDEX_DB` set, you can create and run apps without any additional configuration:

```python
import cocoindex as coco

@coco.fn
def app_main() -> None:
    # ... your pipeline logic ...

app = coco.App("MyPipeline", app_main)
app.update_blocking()  # Uses COCOINDEX_DB for storage
```

For details on what the internal database stores and how to tune its LMDB settings (e.g., increasing the maximum database size beyond 4 GiB), see [Internal Storage](../advanced_topics/internal_storage).

## Lifespan (optional)

A **lifespan function** defines the CocoIndex runtime lifecycle: its setup runs when the runtime starts (automatically before the first `app.update()`), and its cleanup runs when the runtime stops. Use it to configure CocoIndex settings programmatically or to initialize shared resources that processing components can reuse.

**Tip**
If you only need to set the database path, using the `COCOINDEX_DB` environment variable is simpler than defining a lifespan function.

### Defining a lifespan

Use the `@lifespan` decorator to register a lifespan function. By default, all apps share the same lifespan (unless you explicitly specify an app in a different [*Environment*](../advanced_topics/multiple_environments)). The function receives an `EnvironmentBuilder` for configuration and uses `yield` to separate setup from cleanup:

```python
import pathlib
from typing import AsyncIterator
import cocoindex as coco

@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    # Configure CocoIndex's internal database location (overrides COCOINDEX_DB if set)
    builder.settings.db_path = pathlib.Path("./cocoindex.db")
    # Setup: initialize resources here
    yield
    # Cleanup happens automatically when the context exits
```

Setting `db_path` in the lifespan takes precedence over the `COCOINDEX_DB` environment variable. If neither is provided, CocoIndex will raise an error.

The lifespan function can be sync or async:

```python
import cocoindex as coco

@coco.lifespan
def coco_lifespan(builder: coco.EnvironmentBuilder) -> Iterator[None]:
    builder.settings.db_path = pathlib.Path("./cocoindex.db")
    yield
```

You can also use the lifespan to provide resources (like database connections) that processing components can access. See [Context](./context) for details on sharing resources across your pipeline.

### Explicit lifecycle control (optional)

The lifespan runs automatically the first time any App updates — most users don't need to do anything beyond defining the lifespan and calling `app.update()`.

If you need more explicit control — for example, to know when startup completes for health checks, or to explicitly trigger shutdown — you can manage the lifecycle directly:

```python
# Async API
await coco.start()   # Run lifespan setup
# ... run apps or other operations ...
await coco.stop()    # Run lifespan cleanup
```

```python
# Sync (blocking) API
coco.start_blocking()   # Run lifespan setup
# ... run apps or other operations ...
coco.stop_blocking()    # Run lifespan cleanup
```

Or use the `runtime()` context manager, which supports both sync and async usage:

```python
# Async
async with coco.runtime():
    await app.update()
```

```python
# Sync (blocking)
with coco.runtime():
    app.update_blocking()
```

## Managing apps with CLI

CocoIndex provides a CLI for managing your apps without writing additional code.

### Update an app

Run your app once to sync all target states:

```bash
cocoindex update main.py
```

This executes your pipeline and applies all declared target states to external systems. Add `--live` (or `-L`) to keep the app running and react to source changes continuously — see [Live Mode](./live_mode).

### Drop an app

Remove an app and revert all its target states:

```bash
cocoindex drop main.py
```

This will delete all target states created by the app (e.g., drop tables, delete rows) and clear its internal state.

`drop` is an explicit, foreground operation — any failure during the recursive delete (root or any descendant) raises rather than being silently logged. The internal tracking record for a component whose delete failed is preserved so the next `drop` (with the underlying problem fixed) can complete the cleanup. See [Error Handling](../advanced_topics/exception_handlers) for the general principle.

See [CLI Reference](../cli) for more commands and options.

---

# Declaring target state

Source: https://cocoindex.io/docs/programming_guide/target_state/

**Note — Prerequisite**
This page builds on [Core Concepts](./core_concepts), which introduces target states and the declarative *target state = transform(source state)* model **with diagrams**. If the ideas below feel abstract, start there.

A **target state** represents what you want to exist in an external system. You *declare* target states in your code; CocoIndex keeps them in sync with your intent — creating, updating, or removing them as needed.

**Note — Terminology**
A **target** is the external system you write to — a directory, a database table, a vector store collection, etc. In Python, targets are represented by objects like `DirTarget` and `TableTarget`. A **target state** is what you want to exist *in* that target — a specific file, row, or embedding.

CocoIndex treats your declarations as the source of truth: if you stop declaring a target state, CocoIndex will remove it from the target.

Examples of target states:

- A file in a directory
- A row in a database table
- An embedding vector in a vector store

When your source data changes, CocoIndex compares the newly declared target states with those from the previous run and applies only the necessary changes.

## Declaring target states

CocoIndex connectors provide **targets** with `declare_*` methods:

```python
# Declare a file target state
dir_target.declare_file(filename="output.html", content=html)

# Declare a row target state
table_target.declare_row(row=DocEmbedding(...))
```

### Where do targets come from?

Target states can be nested — a directory contains files, a table contains rows. The container itself is a target state you declare, and once it's ready, you get a target to declare child target states within it.

Container target states (like a directory or table) are typically top-level — you can declare them directly. Child target states (like files or rows) require the container to be ready first.

Connectors provide convenience methods that mount the container and return a ready-to-use target in one step:

### Example: writing a file to a directory

For simple cases where each processing component writes a single file, you can declare the file directly:

```python
from cocoindex.connectors import localfs

# Declare a single file target state directly
localfs.declare_file(outdir / "output.html", html, create_parent_dirs=True)
```

When you need a `DirTarget` to declare multiple files, use the connector's convenience method:

```python
# Mount a directory target, get a ready-to-use DirTarget
dir_target = await localfs.mount_dir_target(outdir)

# Declare child target states (files)
dir_target.declare_file(filename="output.html", content=html)
```

### Example: writing a row to PostgreSQL

This example uses a [`ContextKey`](./context) to reference the database connection — see [Context](./context) for how keys are defined and provided.

```python
import asyncpg
import cocoindex as coco
from cocoindex.connectors import postgres

# Define a ContextKey for the database connection (provided in lifespan)
TARGET_DB = coco.ContextKey[asyncpg.Pool]("target_db")

# Mount a table target, get a ready-to-use TableTarget
table = await postgres.mount_table_target(
    TARGET_DB,
    table_name="doc_embeddings",
    table_schema=await postgres.TableSchema.from_class(
        DocEmbedding, primary_key=["id"]
    ),
)

# Declare a child target state (a row)
table.declare_row(row=DocEmbedding(...))
```

These convenience methods wrap [`mount_target()`](./processing_component#mount_target), which automatically derives the component path from the target's globally unique key. See [Processing Component](./processing_component) for more on mounting APIs.

**Tip — Type safety**
Targets like `DirTarget` and `TableTarget` have two statuses: **pending** (just created) and **resolved** (after the container target state is ready). The type system tracks this — if you try to use a pending target before it's resolved, type checkers like mypy will flag the error.

## How CocoIndex syncs target states

Under the hood, CocoIndex compares your declared target states with the previous run and applies the minimal changes needed — the same create/update/delete sync the [change-scenario diagrams in Core Concepts](./core_concepts#processing-component) walk through, viewed per target state:

<table>
  <thead>
    <tr>
      <th rowspan="2">Target State</th>
      <th colspan="3" style={{textAlign: 'center'}}>CocoIndex's Action</th>
    </tr>
    <tr>
      <th>On first declaration</th>
      <th>When declared differently</th>
      <th>When no longer declared</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>A database table</td>
      <td>Create the table</td>
      <td>Alter the table</td>
      <td>Drop the table</td>
    </tr>
    <tr>
      <td>A row in a database table</td>
      <td>Insert the row</td>
      <td>Update the row</td>
      <td>Delete the row</td>
    </tr>
    <tr>
      <td>A file in a directory</td>
      <td>Create the file</td>
      <td>Update the file</td>
      <td>Delete the file</td>
    </tr>
  </tbody>
</table>

CocoIndex ensures containers exist before their contents are added, and properly cleans up contents when the container changes.

### What happens when a target's schema changes

When you change a container target state's declaration (e.g., add a column to a table schema, change a primary key), CocoIndex detects the change and does its best to alter the target in place. If the change is too large to alter (e.g., changing primary keys), the target is dropped and recreated.

When a target is dropped and recreated, CocoIndex automatically reprocesses all affected components to backfill the data — you don't need to manually trigger `--full-reprocess`. This is handled by the target connector's [child invalidation](../advanced_topics/custom_target_connector#child-invalidation) mechanism, which signals to CocoIndex whether the change is destructive (all children lost) or lossy (some data may be lost).

## Generic target state APIs

For cases where connector-specific APIs don't cover your needs, CocoIndex provides generic APIs:

- `declare_target_state()` — declare a leaf target state
- `declare_target_state_with_child()` — declare a target state that provides child target states

These are exported from `cocoindex` and used internally by connectors. For defining custom targets, see [Custom Target States Connector](../advanced_topics/custom_target_connector).

---

# The processing component

Source: https://cocoindex.io/docs/programming_guide/processing_component/

**Note — Prerequisite**
This page builds on [Core Concepts](./core_concepts), which introduces processing components, target states, and incremental sync **with diagrams** — the per-file fan-out and the chunk-embedding pipeline it walks through are referenced throughout this page. If the mounting APIs below feel abstract, start there.

Most apps process many independent source items — files, rows, or entities. A **Processing Component** is the unit of execution for one: it runs that item's transformation logic and declares the set of **target states** produced for it.

## Component path

A **component path** is the stable identifier for a processing component across runs (think of it like a path in a tree). CocoIndex uses it to match a component to its previous run, detect what changed for that item, and sync that component's target states as a unit when it finishes. This sync happens per component; CocoIndex does not wait for other components in the same app to complete.

Component paths are hierarchical and form a tree structure. You specify child paths using `coco.component_subpath()` with stable identifiers like string literals, file names, row keys, or entity IDs:

```python
coco.component_subpath(filename)           # e.g., coco.component_subpath("hello.pdf")
coco.component_subpath("user", user_id)    # e.g., coco.component_subpath("user", 12345)
```

Choose paths that are stable for the "same" item (e.g., file path, primary key). If an item disappears and its path is no longer present, CocoIndex cleans up the target states owned by that path (and its sub-paths).

Here's an example component path tree for a pipeline that processes files:

```text
(root)                         ← app_main component
└── process_file
    ├── "hello.pdf"            ← process_file component
    └── "world.pdf"            ← process_file component
```

The tree is populated dynamically as the app runs — each `mount()` / `mount_each()` call adds a subpath.

See [StableKey](./sdk_overview#stablekey) in the SDK Overview for details on what values can be used in component paths.

## Mount

Mounting is how you declare (instantiate) a processing component within an app at a specific path, so CocoIndex knows that component exists, should run, and owns a set of target states — and can match it against its previous run to sync only what changed.

CocoIndex provides two core mounting APIs:

- **`mount()`** — sets up a processing component in a child path without depending on data from it. This allows the component to refresh independently in live mode.
- **`use_mount()`** — returns a value from the component's execution to the caller. The component at that path cannot refresh independently without re-executing the caller.

**Which one you reach for comes down to a single question: does the caller need a value back?** `use_mount()` consumes the child's return value, which *couples* the two — the call blocks until the child finishes and commits its target states, and the child can't refresh on its own without re-running the caller. `mount()` takes nothing back and returns as soon as the child is scheduled, so the child runs on its own and can [refresh independently in live mode](./live_mode) when its inputs change.

And two sugar APIs that simplify common patterns:

- **`mount_each()`** — mounts one component per item in a keyed iterable
- **`mount_target()`** — mounts a target without an explicit subpath

See also [`map()`](#map) for a utility API that operates within a component without creating new ones.

### Automatic subpath derivation
`mount()`, `use_mount()`, and `mount_each()` all accept an optional `ComponentSubpath` as their first argument. When omitted, the subpath is **auto-derived** from the function name using `Symbol(fn.__name__)`.

```python
# These are equivalent:
await coco.mount(process_file, file, target)
await coco.mount(coco.component_subpath(coco.Symbol("process_file")), process_file, file, target)
```

This means the component path for a `process_file` function is `parent / Symbol("process_file")`. The function must have a `__name__` attribute; if it doesn't (e.g., a lambda), provide an explicit subpath.

Since sibling component paths must not collide, you need an explicit subpath when:
- **The same function is mounted more than once** — auto-derived paths would be identical, so each call needs a distinct path (e.g., `coco.component_subpath("session", youtube_id)`)
- **Different functions happen to share a `__name__`** — rare, but possible with wrappers or closures
- **You want a specific path name** — different from the function name

### `mount()`

Use `mount()` when you don't need a return value from the processing component. It schedules the processing component to run and returns a handle:

```python
handle = await coco.mount(process_file, file, target)
```

With an explicit subpath, for example when mounting multiple components of the same function:

```python
handle = await coco.mount(
    coco.component_subpath("process", filename),
    process_file,
    file,
    target,
)
```

The handle provides a method you can call if you need to wait until the processing component is fully ***ready*** — meaning all its target states have been applied to external systems and all components in its sub-paths are ready:

```python
await handle.ready()
```

You usually only need to call `ready()` when you have logic that depends on the processing component's target states being applied — for example, querying the latest data from a target table after syncing it.

`mount()` also accepts **LiveComponent classes** — components that process continuously and react to changes incrementally instead of rescanning everything. See [Live Components](../advanced_topics/live_component) for details.

### `use_mount()`

Use `use_mount()` when you need the processing component's return value. It mounts the component, waits until it's ready, and returns the value directly:

```python
table = await coco.use_mount(setup_table, table_name="docs")
```

With an explicit subpath:

```python
table = await coco.use_mount(
    coco.component_subpath("setup"),
    setup_table,
    table_name="docs",
)
```

A common use of `use_mount()` is to obtain a [target](./target_state#where-do-targets-come-from) after its container target state is applied.

### `mount_each()`
`mount_each()` mounts one processing component per item in a keyed iterable.

```python
files = localfs.walk_dir(sourcedir, path_matcher=PatternFilePathMatcher(included_patterns=["**/*.md"]))
await coco.mount_each(process_file, files.items(), target)
```

Each item in the iterable is a `(key, value)` tuple. The value is passed as the first argument to the function, and any additional arguments are passed through. Items are mounted under an [auto-derived subpath](#automatic-subpath-derivation) (`Symbol(fn.__name__)`), so the component path for each item is `parent / Symbol("process_file") / key`.

In the static case this is just the keyed-loop form of [`mount()`](#mount) — the same per-file fan-out shown in [Core Concepts](./core_concepts#processing-component). The snippet above is equivalent to looping over the items and calling `mount(coco.component_subpath(coco.Symbol("process_file"), key), process_file, value, target)` for each one. The per-item `key` is what keeps each component's path stable across runs, so the engine matches, updates, and cleans up each item individually.

You can provide an explicit subpath as the first argument:

```python
await coco.mount_each(coco.component_subpath("files"), process_file, files.items(), target)
```

Source connectors provide an `items()` method that returns `(StableKey, T)` pairs. For example, `localfs.walk_dir(...).items()` yields `(relative_path, File)` tuples.

It also does one thing a hand-written loop can't: when a source connector supports live watching, its `items()` returns a `LiveMapView` or `LiveMapFeed` instead of a plain iterable, and `mount_each()` detects this and automatically adds, updates, and removes per-item components as the source changes — no changes to `mount_each()` itself are needed. See [Live Mode](./live_mode).

### `mount_target()`
`mount_target()` mounts a target without requiring an explicit subpath.

```python
from cocoindex.connectors import localfs

dir_target = await coco.mount_target(localfs.dir_target(outdir))
```

The component path is derived automatically from the target's globally unique key — you don't need to create a `component_subpath` for it. This is sugar over calling `use_mount()` with a target declaration function.

Connectors also provide convenience methods that wrap `mount_target()`:

```python
# Equivalent to the above
dir_target = await localfs.mount_dir_target(outdir)

# PostgreSQL example
table = await postgres.mount_table_target(
    PG_DB,
    table_name="doc_embeddings",
    table_schema=await postgres.TableSchema.from_class(DocEmbedding, primary_key=["id"]),
)
```

### Using `component_subpath` as a context manager

You can use `component_subpath()` as a context manager to create nested paths without repeating common prefixes:

```python
with coco.component_subpath("process"):
    for f in files:
        await coco.mount(
            coco.component_subpath(str(f.relative_path)),
            process_file,
            f,
            target,
        )
```

This is equivalent to:

```python
for f in files:
    await coco.mount(
        coco.component_subpath("process", str(f.relative_path)),
        process_file,
        f,
        target,
    )
```

**Tip**
When iterating over keyed items, prefer [`mount_each()`](#mount_each) — it handles the loop and subpath creation for you.

## How target states sync

The component path tree determines ownership. When a component is no longer mounted at a path (e.g., a source file is deleted), CocoIndex automatically cleans up its target states — and recursively for all its sub-paths.

**Info — Sync Mechanism**

After a processing component finishes, CocoIndex syncs its target states:

1. **Compares** the target states declared in this run against those from the previous run at the same path
2. **Applies changes** as a unit — creating, updating, or deleting target states as needed
3. **Recursively cleans up** sub-paths where components are no longer mounted

All writes happen strictly after processing completes — you never see partial effects from a processing failure or interrupt. Each target backend applies its batch atomically when supported (e.g., within a database transaction), but changes across different target backends are not transactional with each other.

## What happens when a component fails

CocoIndex processes each component in two phases: **processing** (running your function, declaring target states) and **submit** (writing changes to target backends).

### Failure isolation

The framework's general rule: **a call raises on failure iff the failed work was on the critical path for the call to return.**

- **`use_mount()`** — you're awaiting the child's *value*, so the child succeeding is on the critical path. The child's exception propagates to the parent.
- **`mount()` and `mount_each()`** — these return as soon as the work is *scheduled* (you get a handle back). The child's execution runs in the background, off your critical path. A failure in one child does **not** affect the parent or siblings — by default the exception is logged and other components continue. One bad file shouldn't take down the entire pipeline.

To react to background failures, you can:

- Install [exception handlers](../advanced_topics/exception_handlers#exception-handlers) — global or scoped — to send alerts, record metrics, or implement custom logic. A handler that raises propagates the failure through `await handle.ready()` if you choose to await it.
- [Monitor app progress](../advanced_topics/progress_monitoring) — `UpdateStats` exposes per-component stats including error counts, so you can detect failures programmatically.

For the full picture — including the critical-path principle applied to every API, interrupted update recovery, and the exception handler API — see [Error Handling](../advanced_topics/exception_handlers).

### No rollback, convergent roll-forward

CocoIndex does not roll back partial writes. The two-phase design makes this safe:

- **Processing** is side-effect-free — it only declares target states in memory. If processing fails (e.g., a parsing error), no writes were attempted, so there's nothing to undo.
- **Submit** writes changes to target backends. If a submit fails partway through (e.g., a database connection drops), some writes may have been applied. CocoIndex does not attempt to undo them. Instead, on the next run CocoIndex computes the current desired state, and the target connector reconciles against all possible previous states — converging the target to the correct state regardless of what was partially applied. This is why built-in connectors use convergent operations like upserts (`INSERT ... ON CONFLICT DO UPDATE`) rather than plain inserts.

## How big should a processing component be?

When defining processing components, think about granularity — what one path represents — because it determines the sync boundary for target states.

For example, if you're processing files:

- **Coarse**: one component for "all files" (`coco.component_subpath("process")`)
- **Medium**: one component per file (`coco.component_subpath("process", file_path)`)
- **Fine**: one component per chunk (`coco.component_subpath("process", file_path, chunk_id)`)

In general:

- **Coarse-grained** (fewer, larger components): More target states sync together as a unit, but you only see updates after the larger component finishes.
- **Fine-grained** (more, smaller components): Each component syncs its target states as soon as it finishes, but target states owned by different components do not sync together as a unit.

For small datasets, a single processing component that owns all target states is simple and ensures all target states sync together. As data grows, consider breaking it down into one component per source item (e.g., one per file) to reduce latency: you see each item's target states synced as soon as it's processed, without waiting for the full dataset to complete. This also helps isolate failures to that item.

## Explicit context management

CocoIndex automatically propagates component context through Python's `contextvars`, which works for ordinary function calls (both sync and async). However, in situations where context variables are not preserved (for example, when using `concurrent.futures.ThreadPoolExecutor`), you need to explicitly capture and attach the context.

Use `coco.get_component_context()` to capture the current context, and `context.attach()` to restore it:

```python
from concurrent.futures import ThreadPoolExecutor

@coco.fn
def app_main() -> None:
    # Capture the current context
    ctx = coco.get_component_context()

    def worker(item):
        # Attach the context in the worker thread
        with ctx.attach():
            # Now CocoIndex APIs work correctly
            process_item(item)

    with ThreadPoolExecutor() as executor:
        executor.map(worker, items)
```

This pattern ensures that CocoIndex can track component relationships and target state ownership even across thread boundaries.

## Processing helpers

### `map()`
`map()` applies an async function to each item in a collection, running all calls concurrently within the current processing component. Unlike [`mount()`](#mount) and [`mount_each()`](#mount_each), it does **not** create child processing components — it's purely concurrent execution (similar to `asyncio.gather()`).

```python
@coco.fn(memo=True)
async def process_file(file: FileLike, table: postgres.TableTarget[DocEmbedding]) -> None:
    chunks = splitter.split(await file.read_text())
    id_gen = IdGenerator()
    await coco.map(process_chunk, chunks, file.file_path.path, id_gen, table)
```

The first argument to the function receives each item; additional arguments are passed through to every call. `map()` returns a `list` of the results, in the same order as the input items.

#### When to use `map()` vs `mount_each()`

- Use **`mount_each()`** when each item should be its own processing component — with its own component path, target state ownership, and target states sync boundary.
- Use **`map()`** when you want to process items concurrently *within* the current component, without creating new component boundaries. This is common for sub-item work like processing chunks within a file — the same within-component chunk work the [chunk-embedding pipeline in Core Concepts](./core_concepts#processing-component) walks through.

---

# Declaring functions with @coco.fn

Source: https://cocoindex.io/docs/programming_guide/function/

**Note — Prerequisite**
This page builds on [Core Concepts](./core_concepts), which introduces change detection and function memoization **with diagrams**. If memoization feels abstract, start there.

It's common to factor work into helper functions (for parsing, chunking, embedding, formatting, etc.). In CocoIndex, you can decorate any Python function with `@coco.fn` when you want to add incremental capabilities to it. The decorated function is still a normal Python function: its signature stays the same, and you can call it normally.

```python
@coco.fn
async def process_file(file: FileLike) -> str:
    return await file.read_text()

# Can be called like any normal function
result = await process_file(file)
```

`@coco.fn` preserves the sync/async nature of the underlying function. Decorating a sync function yields a sync function; decorating an async function yields an async function.

## How to think about `@coco.fn`

Decorating a function tells CocoIndex that calls to it are part of the incremental update engine. You still write normal Python, but CocoIndex can now:

- Detect when inputs or code have changed (change detection)
- Skip work when nothing has changed (memoization)

This is what lets CocoIndex avoid rerunning expensive steps on every `app.update()`. See [Processing Component](./processing_component) for how decorated functions are mounted at component paths.

If you don't need any of the above for a helper, keep it as a plain Python function.

## Change detection and memoization

Every `@coco.fn` function participates in CocoIndex's change detection system. With `memo=True`, the function's results are cached and reused when nothing has changed. These two mechanisms — detecting changes and acting on them — work together to enable incremental updates. For the big-picture view — how memoization skips unchanged work across both input and code changes — see the [Function memoization section in Core Concepts](./core_concepts#function-memoization-skip-unchanged-computations).

**Tip — Mental model**
A function's behavior is determined by its **logic** (source code, [`deps`](#deps), [`version`](#version)), its **inputs** (arguments), and the **context values** it reads via [`use_context()`](./context#retrieving-values). If none of those have changed, a memoized function returns its cached result without re-executing. You don't need to reason about *how* CocoIndex tracks any of this; just trust the contract.

### Change detection

CocoIndex detects three kinds of changes:

**Logic changes** — the function's source code, [`deps`](#deps) values, and explicit [`version`](#version) bumps. Tracked by `@coco.fn`. Logic fingerprints **propagate transitively** up the call chain — but only through `@coco.fn` boundaries. If `foo` (memoized) calls `bar` (also `@coco.fn`-decorated, with or without `memo=True`), and `bar`'s logic changes, `foo`'s memo is invalidated. A bare Python helper that isn't decorated is invisible to change detection: editing it will not invalidate `foo`. This is why `@coco.fn` matters for **any** function in the call chain, not just memoized ones — see [Common patterns](#common-patterns).

**Input changes** — the function's arguments. Tracked by `@coco.fn`. When you call a function with different arguments, the fingerprints change. Input fingerprints do not propagate transitively.

**Context changes** — [context values](./context) with [`detect_change=True`](./context#change-detection). Tracked by [`use_context()`](./context#retrieving-values) at the call site — independent of `@coco.fn`. Context reads propagate transitively: if a memoized `foo` calls `bar` and `bar` reads a change-detected context value, then changing that value invalidates `foo`'s memo too — even though `foo` itself never called `use_context()`. What matters is whether the key was read *anywhere* during the memoized call, not where in the call chain.

### Memoization

With `memo=True`, the function's result is cached. On subsequent calls, if no logic, input, or read context values have changed, the cached result is reused without executing the function body — it carries over [target states](./target_state) declared during the function's previous invocation and returns its previous return value.

```python
@coco.fn(memo=True)
def process_chunk(chunk: Chunk) -> list[float]:
    # This computation is skipped if chunk, logic, and context are unchanged
    return embed(chunk.text)
```

**Info — Type annotations**
Add a **return type annotation** to memoized functions so CocoIndex can properly reconstruct cached values. Without a type annotation, cached values may deserialize as basic Python types (`dict`, `list`, etc.) instead of their original types. See [Serialization](./serialization) for details on supported types.

**Caution — `memo=True` constraints**
A memoized function:

- **Must run inside a [processing component](./processing_component)** for memoization to take effect. Outside a component context (e.g., called directly from a script or test), the function still executes correctly but the cache is bypassed silently — every call runs the body.
- **Cannot mount child components** (`coco.mount(...)` / `coco.use_mount(...)` inside the body). Mounting is a side effect the cache cannot replay; CocoIndex raises an error if a memoized function attempts it. Either drop `memo=True`, or restructure so the mount happens in a non-memoized caller.

#### Cache-hit semantics

When a memoized function's cache hits, its **body does not run** — and neither do any nested `@coco.fn` calls inside it. The cached output is replayed directly: the previous return value is returned, and any target states the function declared on its previous run are carried over.

```python
@coco.fn(memo=True)
async def inner(text: str) -> str:
    print("inner ran")
    return await call_llm(text)

@coco.fn(memo=True)
async def outer(text: str) -> str:
    print("outer ran")
    return await inner(text) + "!"

# First call: both print, both bodies execute
result = await outer("hello")

# Second call with same inputs: nothing prints — outer's cached value is returned
# directly, and inner is not invoked at all.
result = await outer("hello")
```

Propagation (logic, `deps`, context) is recorded during the **previous** invocation and replayed on cache lookup. You don't need to model the internals — change anything that affects behavior and the cache invalidates correctly.

**Note — Exceptions and the cache**
If a memoized function raises, no cache entry is written for that call. The next invocation with the same inputs sees a cache miss and re-executes the body — exceptions never poison the cache, so you don't need to wrap calls defensively.

**Tip — When to memoize**

**Cost:** Function return values must be stored for memoization. Larger return values mean higher storage costs.

**Benefit:** Memoization saves more when:

- The computation is expensive
- The function's caller is reprocessed frequently (due to logic or input changes)

**Examples:**

- ✅ **Embedding functions** — good to memoize. Computation is heavy; return value is fixed-size and not too large.
- ❌ **Splitting text into fixed-size chunks** — usually not worth memoizing. Computation is light; return value can be large.
- ✅ **Processing component for files that are mostly stable between runs** — very beneficial to memoize, since unchanged files are skipped entirely. We can save the cost of reading file content and processing them when they haven't changed.
- 🤔 **Chunk embedding when file-level memoization is already enabled** — still beneficial, but less so for stable files. The benefit increases for files that change frequently, or when your code evolves (e.g., adding more features per file triggers file-level reprocessing, but unchanged chunks can still skip embedding).

### Controlling change detection scope

Three parameters on `@coco.fn` let you customize how logic changes are detected:

- **`logic_tracking`** — controls the *scope* of automatic logic change detection
- **`version`** — provides explicit manual control over when dependent memos are invalidated
- **`deps`** — declares external values (e.g. a module-level prompt string) as part of the function's logic, so changing them invalidates dependent memos

These parameters control logic fingerprinting. Data fingerprinting (for arguments, `deps` values, and context values) is controlled by the objects themselves (see [Memoization Keys & States](../advanced_topics/memoization_keys)).

#### `logic_tracking`

The `logic_tracking` parameter controls whether and how logic changes are detected:

- **`"full"` (default):** Track this function's logic AND all transitively called `@coco.fn` functions' logic. A change anywhere in the call chain invalidates dependent memos.
- **`"self"`:** Track only this function's own logic. Changes in called functions do not propagate through this function.
- **`None`:** Don't track this function's logic at all. Logic changes to this function are invisible to the change detection system.

#### `version`

The `version` parameter lets you explicitly invalidate dependent memos by bumping an integer:

```python
@coco.fn(version=2)
def process_chunk(chunk: Chunk) -> list[float]:
    # Bumping version invalidates all memoized callers, even if code looks the same
    return embed(chunk.text)
```

#### `deps`

The `deps` parameter declares external value(s) the function logic depends on but that aren't visible in its body — for example a prompt string or a model identifier defined at module scope. When the value changes, the function's logic fingerprint changes and dependent memos are invalidated, exactly as if the function body had been edited.

```python
SYSTEM_PROMPT = "You are a helpful assistant. Be concise."

@coco.fn(memo=True, deps=SYSTEM_PROMPT)
def summarize(text: str) -> str:
    # Editing SYSTEM_PROMPT invalidates this function's memo
    # (and propagates to memoized callers) just like a logic change would.
    return call_llm(SYSTEM_PROMPT, text)
```

For multiple dependencies, pass a tuple or dict:

```python
SYSTEM_PROMPT = "..."
MODEL = "claude-haiku-4-5"

@coco.fn(memo=True, deps={"prompt": SYSTEM_PROMPT, "model": MODEL})
def summarize(text: str) -> str:
    return call_llm(SYSTEM_PROMPT, text, model=MODEL)
```

The value is canonicalized through the [memoization-key pipeline](../advanced_topics/memoization_keys), which honors `__coco_memo_key__()`, registered memo key functions, and the standard handling for primitives, dataclasses, and Pydantic models.

**Caution — Snapshotted at decoration time**
`deps` is evaluated **once** when the decorator is applied (typically at module import), not re-evaluated per call. For per-call or per-instance values — instance attributes in a bound method, request-scoped config, anything that changes at runtime — pass them as regular function arguments instead, so the memoization layer observes each new value.

`deps` requires `logic_tracking` to be enabled; combining `deps=<value>` with `logic_tracking=None` raises `ValueError`.

#### Common patterns

**Tip — `@coco.fn` on non-memoized helpers**
You can — and often should — decorate helpers with `@coco.fn` even without `memo=True`. The decorator's job is to make the function's logic *visible* to the change detection system. A bare helper is invisible: editing it will not invalidate any memoized caller, leading to silently stale results. Add `@coco.fn` so its fingerprint propagates; only add `memo=True` if caching its return value is worth it.

These parameters can be set on any `@coco.fn` function — not just memoized ones.

**Fully automatic (default)** — use `logic_tracking="full"` (or omit it) without setting `version`. Any logic change in the function or its callees invalidates dependent memos. This always just works.

```python
@coco.fn
async def process_file(file: FileLike) -> list[Chunk]:
    # Any change here or in called @coco.fn functions invalidates dependent memos
    text = await file.read_text()
    return split_and_embed(text)
```

**Manual, precise control** — use `logic_tracking="self"` with `version`. You decide what counts as a behavior change by bumping `version`, without being affected by implementation detail changes (performance optimizations, logging tweaks, refactoring, etc.).

```python
@coco.fn(logic_tracking="self", version=3)
async def process(data: str) -> str:
    # Bump version when behavior changes (e.g., new output format).
    # Internal refactors or logging changes won't trigger reprocessing.
    return await transform(data)
```

**Opt out of tracking** — use `logic_tracking=None` for functions with a stable contract (where logic changes don't affect output), or functions whose changes don't affect behavior (e.g., logging, performance hints). This prevents unnecessary reprocessing when only internals change.

```python
@coco.fn(logic_tracking=None)
def embed(text: str) -> list[float]:
    # Contract is stable: same input always produces the same embedding.
    # Internal changes (e.g., switching backends) are handled by version bumps.
    return model.encode(text)
```

**Note**
[Context changes](./context#change-detection) are independent of `@coco.fn` and `logic_tracking`. Even with `logic_tracking=None`, a change in a change-detected context value still invalidates dependent memos, because context tracking is done by `use_context()`, not by the decorator.

### Debugging unexpected re-runs

If a memoized function is re-running when you expect a cache hit, walk through the inputs in order:

1. **Logic** — did the function's source code change? Did any `@coco.fn` it transitively calls change? Was a `version` bumped? Was a `deps` value edited?
2. **Inputs** — are the arguments byte-identical to the previous call (after canonicalization)? Custom types with unstable equality are a common source of spurious invalidation — see [Memoization Keys & States](../advanced_topics/memoization_keys).
3. **Context** — did any [`detect_change=True`](./context#change-detection) context value read during the previous invocation change? Remember this propagates transitively through nested `@coco.fn` calls.

If none of those changed and the function still re-runs, the most common cause is a non-stable fingerprint on a custom-typed argument or context value — define `__coco_memo_key__` to make it deterministic.

Conversely, if a memo is hitting when you expect invalidation, common causes are:

- A logic change in a helper that is **not** decorated with `@coco.fn`. Add `@coco.fn` to it (no `memo=` needed) so its logic participates in propagation — see [Common patterns](#common-patterns).
- A `deps` value that you changed at runtime: `deps` is [snapshotted at decoration time](#deps), so per-instance or per-request values must be passed as regular arguments instead.
- The function uses [`logic_tracking=None`](#logic_tracking), which opts it out of code-change detection entirely.

### Customizing data fingerprinting

By default, CocoIndex fingerprints function arguments, `deps` values, and context values automatically for most types — primitives, containers, dataclasses, Pydantic models, and picklable objects. For custom types, or when you need multi-level validation (e.g., check mtime first, then content hash), see [Memoization Keys & States](../advanced_topics/memoization_keys).

For per-function overrides — excluding an argument from the memo key, or transforming it just for this function — pass `memo_key={...}` on `@coco.fn` / `@coco.fn.as_async`; see [Override at the call site with `memo_key=`](../advanced_topics/memoization_keys#override-at-the-call-site-with-memo_key).

## Execution capabilities

The following capabilities control *how* the function executes, independent of change detection and memoization.

### Async adapter

Use `@coco.fn.as_async` when you need an **async** interface for a function that has a sync underlying implementation. This is useful for compute-intensive leaf functions, and is required for features like [batching](#batching) and [runner](#runner).

```python
@coco.fn.as_async
def embed(text: str) -> list[float]:
    return model.encode([text])[0]

# External usage: always async, even though the function body is sync
embedding = await embed("hello world")
```

`@coco.fn.as_async` is equivalent to wrapping the function in `asyncio.to_thread()` — the sync function runs on a thread pool and doesn't block the event loop.

You can also call any `@coco.fn`-decorated function asynchronously via the `.as_async()` method, without changing its primary signature:

```python
@coco.fn
def expensive_fn(data: bytes) -> bytes:
    return process(data)

# Primary call is sync:
result = expensive_fn(data)

# Async call via .as_async():
result = await expensive_fn.as_async(data)
```

### Batching

With `batching=True`, multiple concurrent calls to the function are automatically batched together. This is useful for operations that are more efficient when processing multiple inputs at once, such as embedding models.

Batching requires an async interface. If the underlying function is sync, use `@coco.fn.as_async(batching=True)`. If the underlying function is already `async def`, `@coco.fn(batching=True)` works directly.

When batching is enabled:

- The function implementation receives a `list[T]` and returns a `list[R]`
- The external signature becomes `async T -> R` (single input, single output)
- Concurrent calls are collected and processed together

```python
@coco.fn.as_async(batching=True, max_batch_size=32)
def embed(texts: list[str]) -> list[list[float]]:
    # Called with a batch of texts, returns a batch of embeddings
    return model.encode(texts)

# External usage: async, single input, single output
embedding = await embed("hello world")  # Returns list[float]

# Concurrent calls are automatically batched using asyncio.gather
embeddings = await asyncio.gather(
    embed("text1"),
    embed("text2"),
    embed("text3"),
)
```

The `max_batch_size` parameter limits how many inputs can be processed in a single batch.

**Tip — When to use batching**

Batching is beneficial when:

- The underlying operation has significant per-call overhead (e.g., GPU kernel launch)
- The operation can process multiple inputs more efficiently than one at a time
- You have concurrent calls from multiple coroutines

Common use cases:

- **Embedding models** — most embedding APIs and models are optimized for batch processing
- **LLM inference** — batch multiple prompts together for better GPU utilization
- **Database operations** — batch inserts or lookups

### Runner

The `runner` parameter allows functions to execute in a specific context, such as a dedicated GPU runner that serializes GPU workloads.

Like batching, a runner requires an async interface. If the underlying function is sync, use `@coco.fn.as_async(runner=...)` to make it async. If the underlying function is already `async def`, `@coco.fn(runner=...)` works directly.

```python
@coco.fn.as_async(runner=coco.GPU)
def gpu_inference(data: bytes) -> bytes:
    # Runs with GPU serialization
    return model.predict(data)

# External usage: async
result = await gpu_inference(data)
```

The `coco.GPU` runner:

- By default, runs in-process with all functions sharing a queue for serial execution
- Sync functions run on a dedicated GPU thread to avoid blocking the event loop
- Set the environment variable `COCOINDEX_RUN_GPU_IN_SUBPROCESS=1` to run in a subprocess for GPU memory isolation

You can combine batching with a runner:

```python
@coco.fn.as_async(batching=True, max_batch_size=16, runner=coco.GPU)
def batch_gpu_embed(texts: list[str]) -> list[list[float]]:
    # Batched execution with GPU serialization
    return gpu_model.encode(texts)

# External usage: async
embedding = await batch_gpu_embed("hello world")

# Concurrent calls
embeddings = await asyncio.gather(
    batch_gpu_embed("text1"),
    batch_gpu_embed("text2"),
    batch_gpu_embed("text3"),
)
```

**Note**
By default, `coco.GPU` runs functions in-process, so no pickling is required. When using subprocess mode (`COCOINDEX_RUN_GPU_IN_SUBPROCESS=1`), the function and all its arguments must be picklable since they are serialized for subprocess execution.

---

# Sharing resources via context

Source: https://cocoindex.io/docs/programming_guide/context/

CocoIndex provides a **context** mechanism for sharing resources across your pipeline. This is useful for database connections, API clients, configuration objects, or any resource that multiple processing components need to access.

## ContextKey

A `ContextKey[T]` is a typed key that identifies a resource. Define keys at module level:

```python
import asyncpg
import cocoindex as coco
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder

# Database connection — no change detection (swapping credentials shouldn't reprocess)
PG_DB = coco.ContextKey[asyncpg.Pool]("text_embedding_db")

# Embedding model — with change detection (switching models should reprocess)
EMBEDDER = coco.ContextKey[SentenceTransformerEmbedder]("embedder", detect_change=True)
```

The type parameter (`asyncpg.Pool`, `SentenceTransformerEmbedder`) enables type checking — when you retrieve the value, your editor knows its type.

### Change detection

By default, context keys have **change detection disabled** — changing the provided value between runs does not automatically invalidate memoized functions that consumed it via `use_context()`. To opt in to change detection, pass `detect_change=True`. When enabled, [context changes](./function#change-detection) are their own category — tracked by `use_context()` at the call site, independent of `@coco.fn`. When a fingerprint changes, any memoized function whose execution involved a `use_context()` call on that key is invalidated.

Use `detect_change=True` for resources that affect computation results — models, configuration objects, etc. This ensures memoized functions re-execute when those values change. Resources that don't affect computation results — database connections, loggers, debug flags, monitoring clients — can use the default (`detect_change=False`).

**Tip**
Change detection is transitive: if function `foo` (memoized) calls function `bar`, and `bar` calls `use_context(key)` on a change-detected key, then `foo`'s memo is also invalidated when the context value changes.

## ContextKey as stable identity

Beyond sharing resources, a `ContextKey` also serves as the **stable identity** of the resource it points to. When you anchor sources or targets to a `ContextKey`, CocoIndex treats *the key itself* — not the underlying value — as the identifier across runs.

This has two consequences:

1. **The underlying value can change without losing tracked state.** Rotating credentials, moving a database, or relocating a directory won't invalidate memoization or managed state, as long as the same `ContextKey` is used.

2. **Renaming a `ContextKey` is a breaking change.** Two different keys are two different resources, even if they point to the same physical backend. Existing tracked state will be treated as orphaned. When migrating code, reuse the previous key name to preserve continuity.

**Tip — Naming convention**
Pick a `ContextKey` name that reflects the *logical* role of the resource, not its current address. The name is what CocoIndex persists.

- **Applications**: use any descriptive name — e.g., `"text_embedding_db"`, `"docs_root"`.
- **Libraries**: prefix with your package name and a `/` to avoid collisions with application keys or other libraries — e.g., `"my_library/db"`, `"cocoindex.connectors.postgres/pool"`.

## Providing values

In your [lifespan function](./app#defining-a-lifespan), use `builder.provide()` to make resources available:

```python
from typing import AsyncIterator
from cocoindex.connectors import postgres

@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    async with await asyncpg.create_pool(DATABASE_URL) as pool:
        builder.provide(PG_DB, pool)
        builder.provide(EMBEDDER, SentenceTransformerEmbedder(EMBED_MODEL))
        yield
```

The resource is available for the lifetime of the environment. When the lifespan exits (after `yield`), cleanup happens automatically if you use a context manager pattern.

## Retrieving values

In processing components, use `coco.use_context()` to retrieve provided resources:

```python
@coco.fn
async def process_chunk(chunk: Chunk, table: postgres.TableTarget[DocEmbedding]) -> None:
    # Retrieve the embedder from context
    embedding = await coco.use_context(EMBEDDER).embed(chunk.text)
    table.declare_row(row=DocEmbedding(text=chunk.text, embedding=embedding, ...))
```

Some connectors also accept `ContextKey`s directly as a convenience — for example, `postgres.mount_table_target()` takes a `ContextKey[asyncpg.Pool]` and resolves the connection internally:

```python
@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
    # PG_DB is resolved internally by the connector
    table = await postgres.mount_table_target(
        PG_DB,
        table_name="doc_embeddings",
        table_schema=await postgres.TableSchema.from_class(DocEmbedding, primary_key=["id"]),
    )
    # ... mount processing components ...
```

## Complete example

Here's a complete pipeline that uses context to share a database connection and an embedding model across processing components:

```python
from __future__ import annotations

import pathlib
from dataclasses import dataclass
from typing import AsyncIterator, Annotated

import asyncpg
from numpy.typing import NDArray

import cocoindex as coco
from cocoindex.connectors import localfs, postgres
from cocoindex.ops.text import RecursiveSplitter
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder
from cocoindex.resources.chunk import Chunk
from cocoindex.resources.file import FileLike, PatternFilePathMatcher
from cocoindex.resources.id import IdGenerator

DATABASE_URL = "postgres://cocoindex:cocoindex@localhost/cocoindex"
EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"

# 1. Define context keys at module level
PG_DB = coco.ContextKey[asyncpg.Pool]("text_embedding_db")
EMBEDDER = coco.ContextKey[SentenceTransformerEmbedder]("embedder", detect_change=True)

_splitter = RecursiveSplitter()


# 2. Provide values in the lifespan
@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    async with await asyncpg.create_pool(DATABASE_URL) as pool:
        builder.provide(PG_DB, pool)
        builder.provide(EMBEDDER, SentenceTransformerEmbedder(EMBED_MODEL))
        yield


# 3. Use EMBEDDER in type annotations (for vector column schema)
@dataclass
class DocEmbedding:
    id: int
    filename: str
    text: str
    embedding: Annotated[NDArray, EMBEDDER]  # dimension resolved from context


# 4. Retrieve values in processing functions
@coco.fn
async def process_chunk(
    chunk: Chunk,
    filename: pathlib.PurePath,
    id_gen: IdGenerator,
    table: postgres.TableTarget[DocEmbedding],
) -> None:
    table.declare_row(
        row=DocEmbedding(
            id=await id_gen.next_id(chunk.text),
            filename=str(filename),
            text=chunk.text,
            embedding=await coco.use_context(EMBEDDER).embed(chunk.text),
        ),
    )


@coco.fn(memo=True)
async def process_file(
    file: FileLike,
    table: postgres.TableTarget[DocEmbedding],
) -> None:
    text = await file.read_text()
    chunks = _splitter.split(text, chunk_size=2000, chunk_overlap=500, language="markdown")
    id_gen = IdGenerator()
    await coco.map(process_chunk, chunks, file.file_path.path, id_gen, table)


# 5. PG_DB used directly by the connector (resolved internally)
@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
    table = await postgres.mount_table_target(
        PG_DB,
        table_name="doc_embeddings",
        table_schema=await postgres.TableSchema.from_class(
            DocEmbedding, primary_key=["id"],
        ),
    )
    files = localfs.walk_dir(
        sourcedir,
        recursive=True,
        path_matcher=PatternFilePathMatcher(included_patterns=["**/*.md"]),
    )
    await coco.mount_each(process_file, files.items(), table)


app = coco.App(
    coco.AppConfig(name="TextEmbedding"),
    app_main,
    sourcedir=pathlib.Path("./markdown_files"),
)
```

## Accessing context outside processing components

If you need to access context values outside of CocoIndex processing components — for example, in query/serving logic that shares resources with your indexing pipeline — use `env.get_context()`:

```python
# Sync API
db = coco.default_env().get_context(PG_DB)
```

```python
# Async API
db = (await coco.default_env()).get_context(PG_DB)
```

This is useful when your application runs both indexing and serving in the same process and you want to initialize shared resources (like database connection pools or configuration) once in the lifespan.

**Note**
`default_env()` starts the environment if it hasn't been started yet, which runs the lifespan function. If you're using an explicit environment, call `get_context()` directly on that environment instance.

---

# Running in live mode

Source: https://cocoindex.io/docs/programming_guide/live_mode/

By default, calling `app.update()` runs in **catch-up mode**: it scans all sources, processes what changed since the last run (memoized components are skipped, so unchanged work is not redone), syncs target states, and returns. Targets are caught up to the moment the run started, and that's it — to pick up further changes, you call `update()` again.

So catch-up mode is already incremental — but each call still has to scan sources to discover what changed, and changes are only picked up when you trigger a new run.

**Live mode** keeps the app running after catch-up finishes and lets components stream changes continuously from their sources (e.g. a file system watcher or a database change feed), applying them to target states with very low latency. This is useful when:

- You want near-real-time reactions to source changes, instead of waiting for the next `update()` call
- Your sources can push changes more efficiently than a full rescan can discover them

Two things are needed for live mode to work: the app must be **enabled** to stay running, and somewhere in the component tree a component must **react** to changes.

## Enabling live mode

Pass `live=True` when updating the app:

```python
app.update_blocking(live=True)

# Or async
handle = app.update(live=True)
await handle.result()
```

From the CLI:

```bash
cocoindex update --live my_app.py
# or
cocoindex update -L my_app.py
```

The `live` flag propagates top-down through the component tree — both `coco.mount()` and `coco.use_mount()` inherit `live` from the parent, so children are live when the app is live.

Without `live=True` on the app, the app runs in catch-up mode — everything completes after the initial scan, even if a source supports live watching.

## Reacting to changes

Enabling live mode keeps the app running, but something in the component tree needs to actually watch for changes. That something is a [**LiveComponent**](../advanced_topics/live_component) — a component with a long-running `process_live()` method that delivers incremental updates.

You rarely need to write a `LiveComponent` manually. The two most common patterns are:

### Sources with `LiveMapView` or `LiveMapFeed`

Source connectors can provide live capabilities via two [protocols](../advanced_topics/live_component#livemapfeed-and-livemapview):

- **`LiveMapView`** — the source has scannable current state (e.g., a directory or database table). It does a full scan first, then watches for changes. Example: [`localfs.walk_dir(live=True).items()`](../connectors/localfs#live-file-watching).
- **`LiveMapFeed`** — the source only streams changes, with no snapshot to scan (e.g., a Kafka consumer). All data arrives via the change stream. Example: [`kafka.topic_as_map()`](../connectors/kafka#as-source).

When `mount_each()` receives either, it automatically creates a `LiveComponent` internally that:

1. **Scans current state** (if available) — iterates all items and mounts a processing component for each
2. **Signals readiness** — the initial scan is complete (or the stream has caught up), target states are synced
3. **Watches for changes** — the source delivers incremental updates:
   - New or modified items → re-mount the affected component
   - Deleted items → remove the component and its target states

CocoIndex handles change detection, memoization, and target state reconciliation the same way as in catch-up mode.

Without live support on the source, `mount_each()` falls back to catch-up behavior — a one-time iteration over items and that's it.

### Periodic refresh with `coco.auto_refresh`

When the source isn't change-aware but you still want fresh data — say, polling a REST endpoint or re-reading a database table that doesn't emit change events — wrap your processor function in [`coco.auto_refresh`](../advanced_topics/live_component#example-periodic-refresh-with-cocoauto_refresh). It runs your function once, signals readiness, then re-runs it on a fixed delay:

```python
import datetime
import cocoindex as coco

async def sync_users(db, target) -> None:
    rows = await db.fetch_all_users()
    for row in rows:
        target.declare_row(row=UserRow(...))

@coco.fn
async def app_main(db, target) -> None:
    await coco.mount(
        coco.auto_refresh(sync_users, interval=datetime.timedelta(minutes=5)),
        db, target,
    )

app = coco.App(coco.AppConfig(name="UserSync"), app_main, db=..., target=...)
app.update_blocking(live=True)
```

**Catch-up compatibility:** in catch-up mode (the default), `auto_refresh` runs `sync_users` once and exits — observationally identical to mounting `sync_users` directly. The interval is ignored. Same pipeline, choose catch-up or live at run time.

**Handling deletes:** each cycle's declarations are reconciled against the previous run. If a row disappears from the source table between polls, `sync_users` simply doesn't declare it that cycle — CocoIndex automatically deletes the corresponding target. You don't need to track deletions yourself.

### Decoupling stages with an in-memory `LiveMap`

When one part of your pipeline produces keyed data and another consumes it — without going through an external system — put an in-memory [`LiveMap`](../common_resources/live_map) between them. Producers declare `(key, value)` entries; the consumer reads it as a `LiveMapView` via `mount_each`, reacting to adds, updates, and deletes just like a live source.

## Examples

### `localfs` — file watching with `LiveMapView`

The [`localfs`](../connectors/localfs) connector supports live mode via `walk_dir(..., live=True)`, which watches for file system changes using `watchfiles`:

```python
@coco.fn
async def app_main(sourcedir: pathlib.Path, outdir: pathlib.Path) -> None:
    files = localfs.walk_dir(
        sourcedir, recursive=True,
        path_matcher=PatternFilePathMatcher(included_patterns=["**/*.md"]),
        live=True,  # items() returns a LiveMapView
    )
    await coco.mount_each(process_file, files.items(), outdir)

app = coco.App(coco.AppConfig(name="FilesTransform"), app_main, sourcedir=..., outdir=...)
app.update_blocking(live=True)
```

**Catch-up compatibility:** `LiveMapView` sources also work without `live=True` — they do the initial full scan and exit cleanly. You can write your pipeline once and choose catch-up or live at run time.

For a complete working example, see [`files_transform`](https://github.com/cocoindex-io/cocoindex/tree/main/examples/files_transform).

### `kafka` — consuming a topic with `LiveMapFeed`

The [`kafka`](../connectors/kafka) connector treats a topic as a live keyed map — each message is an upsert or delete for a key. Since there's no snapshot to scan, it returns a `LiveMapFeed`:

```python
from confluent_kafka.aio import AIOConsumer
from cocoindex.connectors import kafka

@coco.fn
async def app_main() -> None:
    consumer = AIOConsumer({
        "bootstrap.servers": "localhost:9092",
        "group.id": "my-group",
        "enable.auto.commit": "false",
    })
    items = kafka.topic_as_map(consumer, ["my-topic"])
    await coco.mount_each(process_message, items)

app = coco.App(coco.AppConfig(name="KafkaConsumer"), app_main)
app.update_blocking(live=True)
```

## Going deeper

The abstractions behind live mode, from most general to most specific:

- **[LiveComponent](../advanced_topics/live_component)** — the underlying protocol for components that react to changes incrementally. Most flexible — full control over the lifecycle.
- **[LiveMapFeed / LiveMapView](../advanced_topics/live_component#livemapfeed-and-livemapview)** — represents a changing collection of keyed items. `mount_each()` uses it to construct a `LiveComponent` automatically. Connector authors implement this to add live support.
- **[`coco.auto_refresh`](../advanced_topics/live_component#example-periodic-refresh-with-cocoauto_refresh)** — wraps a regular processor function as a `LiveComponent` that re-runs on a fixed interval. Use when there's no change-event source.
- **Source connectors** — provide `LiveMapView` (e.g., [`localfs`](../connectors/localfs)) or `LiveMapFeed` (e.g., [`kafka`](../connectors/kafka)) from their source APIs. Users just pass the result to `mount_each()`.

For custom change feeds, fine-grained lifecycle control, or implementing live map protocols on your own connector, see [Live Components](../advanced_topics/live_component).

---

# Value serialization for memoization

Source: https://cocoindex.io/docs/programming_guide/serialization/

## Overview

CocoIndex serializes and caches the return values of [memoized functions](./function#memoization) so that unchanged work can be skipped on subsequent runs. Most Python types work automatically — the key thing to get right is the **return type annotation**, which tells CocoIndex how to reconstruct your objects:

```python
@coco.fn(memo=True)
async def process_chunk(chunk: Chunk) -> Embedding:  # return type annotation
    return embed(chunk.text)
```

Without annotations, values may deserialize as basic Python types (`dict`, `list`, `str`, etc.) instead of their original types.

**Info — Advanced: other places where serialization and type annotations matter**
Serialization also applies to [memo states](../advanced_topics/memoization_keys#memo-state-validation) and [tracking records](../advanced_topics/custom_target_connector#targethandler-you-implement). If you're implementing these, add type annotations to:

- **`__coco_memo_state__` `prev_state` parameter** — annotate with the state type you return in `MemoStateOutcome(state=...)`. See [Memo state validation](../advanced_topics/memoization_keys#memo-state-validation).
- **`reconcile()` `prev_possible_records` parameter** — annotate with `Collection[YourTrackingRecord]`. See [Custom Target Connector](../advanced_topics/custom_target_connector#targethandler-you-implement).

## Supported types

The following types all work out of the box — no registration needed:

| Category | Types |
|----------|-------|
| **Primitives** | `bool`, `int`, `float`, `str`, `bytes`, `None` |
| **Collections** | `list`, `tuple`, `dict`, `set`, `frozenset` |
| **Dataclasses** | Any `@dataclass` (including frozen) |
| **NamedTuples** | Any `NamedTuple` subclass |
| **Pydantic models** | Any `pydantic.BaseModel` subclass |
| **msgspec Structs** | Any `msgspec.Struct` subclass |
| **Date/time** | `datetime.datetime`, `datetime.date`, `datetime.time`, `datetime.timedelta`, `datetime.timezone` |
| **Other stdlib** | `uuid.UUID`, `complex`, `pathlib.Path`, `pathlib.PurePath` |
| **NumPy** | `numpy.ndarray`, `numpy.dtype` (when numpy is installed) |

More generally, all types [supported by msgspec](https://msgspec.dev/supported-types) work automatically. These types also work when nested inside collections or other structured types.

## Custom types

If your type isn't in the list above, register it with `@coco.serialize_by_pickle`:

```python
import cocoindex as coco

@coco.serialize_by_pickle
class MySpecialType:
    def __init__(self, data):
        self.data = data
```

For third-party types, call it as a regular function:

```python
import cocoindex as coco
from some_library import SomeType

coco.serialize_by_pickle(SomeType)
```

**Caution — Not for dataclasses, NamedTuples, or `msgspec.Struct`**
Don't apply `@coco.serialize_by_pickle` to dataclasses, NamedTuples, or `msgspec.Struct` — these are already supported natively. Applying it only works at the top level; when nested inside another supported type, the native encoding takes precedence and the decorator has no effect.

If serialization fails because of a problematic *field* inside a dataclass, register that field's type with `@coco.serialize_by_pickle` instead.

## Union types

Unions of a custom type with `None` work fine (`MyDataclass | None`). However, unions involving multiple custom types or a custom type with other non-`None` types require tagged `msgspec.Struct` variants.

For example, this **won't work**:

```python
from dataclasses import dataclass

@dataclass
class Config:
    value: int

class Settings(NamedTuple):
    config: Config | str  # fails at deserialization
```

**Fix** — wrap each variant in a tagged [`msgspec.Struct`](https://msgspec.dev/structs#tagged-unions). The `tag=True` parameter embeds a type tag in the serialized data so that the correct variant can be identified during deserialization:

```python
import msgspec

class ConfigValue(msgspec.Struct, tag=True):
    value: int

class StringValue(msgspec.Struct, tag=True):
    value: str

class Settings(NamedTuple):
    config: ConfigValue | StringValue  # works — variants are distinguished by tag
```

## Troubleshooting

### `DeserializationError: Cannot build msgspec Decoder`

This typically means an unsupported union type. The error message includes a hint about the cause.

**Fix**: Restructure the union to use tagged `msgspec.Struct` variants. See [Union types](#union-types) above.

### `DeserializationError: Failed to deserialize msgspec payload`

The type annotation doesn't match the serialized data. Common causes:

- **Missing return type annotation** on a memoized function — add `-> YourType` to the function signature.
- **Changed type structure** between runs — if you renamed or restructured a dataclass, the cached data won't match. Rebuild the cache by running [`app.update(full_reprocess=True)`](./app#updating-an-app) or [`cocoindex update --full-reprocess`](../cli).
- **Forward reference not resolved** — if your type annotation uses a string forward reference, ensure the type is defined before the function is first called.

### `UnpicklingError: Forbidden global during unpickling`

```
_pickle.UnpicklingError: Forbidden global during unpickling: myapp.models.Summary
```

CocoIndex restricts which types can be deserialized for security. This error means your type isn't in the allow-list. Fix by either:

1. **Converting to a dataclass or NamedTuple** (recommended — supported natively, no registration needed)
2. **Using `@coco.serialize_by_pickle`** to register the type

**Note — Upgrading from older versions**
If you see this error after upgrading, previously cached data may reference types that aren't yet registered. You have two options:

- Add `@coco.serialize_by_pickle` to the type and re-run.
- If the type is already a dataclass or NamedTuple, add `@coco.unpickle_safe` to allow reading the old cached data. Once the cache is rebuilt, the decorator can be removed.

---

# Python SDK overview

Source: https://cocoindex.io/docs/programming_guide/sdk_overview/

This document provides an overview of the CocoIndex Python SDK organization and how async and sync APIs work together.

## Package organization

The CocoIndex SDK is organized into several modules:

### Core package

| Package | Description |
|---------|-------------|
| `cocoindex` | All core APIs — async by default, sync variants have a `_blocking` suffix |

### Sub-packages

| Package | Description |
|---------|-------------|
| `cocoindex.connectors` | Connectors for data sources and targets |
| `cocoindex.resources` | [Common resources](../common_resources/data_types) — shared data types, vector schema annotations, and ID generation utilities |
| `cocoindex.ops` | Built-in operations for common data processing tasks (e.g., text splitting, embedding with SentenceTransformers) |

Import connectors and extras by their specific sub-module:

```python
from cocoindex.connectors import localfs, postgres
from cocoindex.ops.text import RecursiveSplitter
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder
from cocoindex.resources.file import FileLike, PatternFilePathMatcher
from cocoindex.resources.chunk import Chunk
```

## Common types

### StableKey

`StableKey` is a type alias defining what values can be used when creating component paths via `coco.component_subpath()`:

```python
StableKey = None | bool | int | str | bytes | uuid.UUID | Symbol | tuple[StableKey, ...]
```

Common examples include strings (like `"setup"` or `"table"`), integers, and UUIDs. Tuples allow composite keys when needed.
`Symbol` provides predefined names that will never conflict with strings (which typically come from runtime data).

Each processing component must be mounted at a unique path. See [Processing Component](./processing_component) for how the component path tree affects target states and ownership.

## Async vs sync APIs

CocoIndex's API is **async-first**. The APIs fall into three categories:

### Orchestration APIs (async only)

The APIs that shape your pipeline are async:

`mount()`, `use_mount()`, `mount_each()`, `mount_target()`, `map()`

### Entry-point APIs (async + sync)

APIs for starting and running your pipeline have both async and sync variants. Sync variants use a `_blocking` suffix:

| Async | Sync (blocking) |
|-------|-----------------|
| `await app.update(...)` | `app.update_blocking(...)` |
| `await app.drop(...)` | `app.drop_blocking(...)` |
| `await coco.start()` | `coco.start_blocking()` |
| `await coco.stop()` | `coco.stop_blocking()` |
| `async with coco.runtime():` | `with coco.runtime():` |

`app.update()` returns an `UpdateHandle` that is also awaitable — `await app.update()` returns the result directly, or you can use the handle for [progress monitoring](../advanced_topics/progress_monitoring). Use the `_blocking` variants for scripts and CLI usage. See [App](./app) for details.

### Processing functions (your choice)

The `@coco.fn` decorator preserves the sync/async nature of your function — your processing functions can be sync or async. See [Function](./function) for details.

## How sync and async work together

Like any async Python program, **async functions can call into sync code, but not the other way around**. In practice, this means higher-level functions (orchestration) tend to be async, while leaf functions (the actual computation) can be sync.

CocoIndex provides two ways for async code to call into sync functions:

- **Mounting** — When you mount a processing component, the function is scheduled on CocoIndex's runtime, not called directly. So an async function can mount a sync processing function.
- **`@coco.fn.as_async`** — Wraps a sync function with an async interface (runs on a thread pool). Useful for compute-intensive leaf functions. See [Function](./function) for details.

### Example: async orchestration mounting sync leaf functions

A typical pipeline has an async main function that orchestrates the pipeline, while leaf functions that do the actual computation can be sync:

```python
import pathlib

import cocoindex as coco
from cocoindex.connectors import localfs
from cocoindex.resources.file import PatternFilePathMatcher
from docling.document_converter import DocumentConverter

_converter = DocumentConverter()

@coco.fn(memo=True)
def process_file(file: localfs.File, outdir: pathlib.Path) -> None:
    # Sync leaf function — does the actual computation
    markdown = _converter.convert(
        file.file_path.resolve()
    ).document.export_to_markdown()
    outname = file.file_path.path.stem + ".md"
    localfs.declare_file(outdir / outname, markdown, create_parent_dirs=True)

@coco.fn
async def app_main(sourcedir: pathlib.Path, outdir: pathlib.Path) -> None:
    # Async — orchestrates the pipeline, mounts child components
    files = localfs.walk_dir(
        sourcedir,
        recursive=True,
        path_matcher=PatternFilePathMatcher(included_patterns=["**/*.pdf"]),
    )
    await coco.mount_each(process_file, files.items(), outdir)

app = coco.App("PdfToMarkdown", app_main,
               sourcedir=pathlib.Path("./pdf_files"), outdir=pathlib.Path("./out"))
```

Here `app_main` is async because it uses mounting APIs (`mount_each`), while `process_file` is sync because it only does computation. The sync `process_file` is mounted as a child component — mounting schedules it on CocoIndex's runtime, so the async parent can mount a sync child without issues.

## Running an app

Run the app with either an async or sync entry point:

```python
# Async entry point
async def main():
    await app.update(report_to_stdout=True)

asyncio.run(main())
```

```python
# Sync entry point (scripts, CLI)
app.update_blocking(report_to_stdout=True)
```

---

# Common data types

Source: https://cocoindex.io/docs/common_resources/data_types/

The `cocoindex.resources` package provides common data models and abstractions shared across connectors and built-in operations. Connectors provide concrete implementations — for example, `localfs.File` implements `FileLike`, and `localfs.FilePath` extends `FilePath`. See individual [connector docs](../connectors/localfs) for connector-specific details.

## File

The file module (`cocoindex.resources.file`) defines base classes and utilities for working with file-like objects.

### FileLike

`FileLike` is a base class for file objects with async read methods. Each connector provides its own subclass (e.g., `localfs.File`, `amazon_s3.S3Object`).

```python
from cocoindex.resources.file import FileLike

async def process_file(file: FileLike) -> str:
    text = await file.read_text()
    ...
    return text
```

**Properties:**

- `file_path` — A `FilePath` object representing the file's path. Access the relative path via `file_path.path` (`PurePath`).

**Methods:**

- `async size()` — Return the file size in bytes.
- `async read(size=-1)` — Read file content as bytes. Pass `size` to limit bytes read.
- `async read_text(encoding=None, errors="replace")` — Read as text. Auto-detects encoding via BOM if not specified.

**Memoization:**

`FileLike` objects provide a memoization key based on `file_path` (file identity). When used as arguments to a [memoized function](../programming_guide/function#memoization), CocoIndex uses a two-level validation: it checks the modification time first (cheap), then computes a content fingerprint only if the modification time has changed. This means touching a file or moving it won't cause unnecessary recomputation if the content is unchanged.

### FilePath

`FilePath` is a base class that combines a **base directory** (with a stable key) and a **relative path**. This enables stable memoization even when the entire directory tree is moved to a different location.

```python
from cocoindex.resources.file import FilePath
```

Each connector provides its own `FilePath` subclass (e.g., `localfs.FilePath`). The base class defines the common interface.

**Properties:**

- `base_dir` — An object that holds the base directory. Its key is used for stable memoization.
- `path` — The path relative to the base directory (`PurePath`).

**Methods:**

- `resolve()` — Resolve to the full path (type depends on the connector, e.g., `pathlib.Path` for local filesystem).

**Path Operations:**

`FilePath` supports most `pathlib.PurePath` operations:

```python
# Join paths with /
config_path = source_dir / "config" / "settings.json"

# Access path properties
config_path.name      # "settings.json"
config_path.stem      # "settings"
config_path.suffix    # ".json"
config_path.parts     # ("config", "settings.json")
config_path.parent    # FilePath pointing to "config/"

# Modify path components
config_path.with_name("other.json")
config_path.with_suffix(".yaml")
config_path.with_stem("config")

# Pattern matching
config_path.match("*.json")  # True

# Convert to POSIX string
config_path.as_posix()  # "config/settings.json"
```

**Memoization:**

`FilePath` provides a memoization key based on `(base_dir.key, path)`. This means:

- Two `FilePath` objects with the same base directory key and relative path have the same memo key
- Moving the entire project directory doesn't invalidate memoization, as long as the same base directory key is used

For connector-specific usage (e.g., `register_base_dir`), see the individual connector documentation like [Local File System](../connectors/localfs).

### FilePathMatcher

`FilePathMatcher` is a protocol for filtering files and directories during traversal.

```python
from cocoindex.resources.file import FilePathMatcher

class MyMatcher(FilePathMatcher):
    def is_dir_included(self, path: PurePath) -> bool:
        """Return True to traverse this directory."""
        return not path.name.startswith(".")

    def is_file_included(self, path: PurePath) -> bool:
        """Return True to include this file."""
        return path.suffix in (".py", ".md")
```

#### PatternFilePathMatcher

A built-in `FilePathMatcher` implementation using [globset](https://docs.rs/globset/#syntax) patterns:

```python
from cocoindex.resources.file import PatternFilePathMatcher

# Include only Python and Markdown files, exclude tests and hidden dirs
matcher = PatternFilePathMatcher(
    included_patterns=["**/*.py", "**/*.md"],
    excluded_patterns=["**/test_*", "**/.*"],
)
```

**Parameters:**

- `included_patterns` — Glob patterns ([globset](https://docs.rs/globset) syntax) for files to include. Use `**/*.ext` to match at any depth. If `None`, all files are included.
- `excluded_patterns` — Glob patterns ([globset](https://docs.rs/globset) syntax) for files/directories to exclude. Excluded directories are not traversed. A pattern prefixed with `!` negates the exclusion for matching paths (see below).

**Note**
Patterns use [globset](https://docs.rs/globset) semantics: `*.py` matches only in the root directory; use `**/*.py` to match at any depth.

**Gitignore-style negation.** Within `excluded_patterns`, a pattern beginning with `!` *un-excludes* paths that would otherwise be excluded by a preceding pattern. This lets you exclude broadly and then carve out exceptions, instead of enumerating every directory to exclude:

```python
# Exclude all dot-entries, but keep .github through
matcher = PatternFilePathMatcher(
    excluded_patterns=[
        "**/.*",           # exclude all dot-files and dot-directories
        "!**/.github/**",  # …except anything under .github
    ]
)
```

Directory traversal honors negations correctly: a directory excluded by a normal pattern is still traversed when a `!` pattern could match its contents, so the kept paths inside it are reachable.

## Chunk

The chunk module (`cocoindex.resources.chunk`) defines types for representing text chunks produced by [text splitters](../ops/text).

### Chunk
A `Chunk` is a frozen dataclass representing a piece of text with its position information in the original document.

```python
from cocoindex.resources.chunk import Chunk
```

**Fields:**

- `text` (`str`) — The text content of the chunk.
- `start` (`TextPosition`) — Start position in the original text.
- `end` (`TextPosition`) — End position in the original text.

### TextPosition

A frozen dataclass representing a position in text.

**Fields:**

- `byte_offset` (`int`) — Byte offset from the start of the text.
- `char_offset` (`int`) — Character offset from the start of the text.
- `line` (`int`) — 1-based line number.
- `column` (`int`) — 1-based column number.

**Example:**

```python
from cocoindex.ops.text import RecursiveSplitter

splitter = RecursiveSplitter()
chunks = splitter.split(text, chunk_size=2000, chunk_overlap=500, language="markdown")

for chunk in chunks:
    print(f"[{chunk.start.line}:{chunk.start.column}] {chunk.text[:50]}...")
```

## Embedder

The embedder module (`cocoindex.resources.embedder`) defines a protocol for single-text async embedding.

### Embedder Protocol
```python
from cocoindex.resources.embedder import Embedder

class Embedder(Protocol):
    async def embed(self, text: str) -> NDArray[np.float32]: ...
```

This is the call-site contract that consumers like [`resolve_entities`](../ops/entity_resolution) rely on. Both [`LiteLLMEmbedder`](../ops/litellm) and [`SentenceTransformerEmbedder`](../ops/sentence_transformers) satisfy this protocol — `await embedder.embed("some text")` returns a single `NDArray[np.float32]`.

The protocol is deliberately narrow: it does not include `dimension()` or `__coco_vector_schema__()`, which are concerns of connectors and table-schema creation, not of embedding consumers.

```python
# Any embedder works with resolve_entities:
from cocoindex.ops.entity_resolution import resolve_entities

result = await resolve_entities(
    entities={"Apple Inc.", "Apple"},
    embedder=my_embedder,  # LiteLLMEmbedder, SentenceTransformerEmbedder, or your own
    resolve_pair=my_resolver,
)
```

---

# Vector schema annotations

Source: https://cocoindex.io/docs/common_resources/vector_schema/

The schema module (`cocoindex.resources.schema`) defines types that describe vector columns. CocoIndex connectors use these to automatically configure the correct column type (e.g., `vector(384)` in Postgres, `fixed_size_list<float32>(384)` in LanceDB).

## VectorSchema

A frozen dataclass that describes a vector column's dtype and dimension.

```python
from cocoindex.resources.schema import VectorSchema
import numpy as np

schema = VectorSchema(dtype=np.dtype(np.float32), size=768)
```

**Fields:**

- `dtype` — NumPy dtype of each element (e.g., `np.float32`)
- `size` — Number of dimensions in the vector (e.g., `384`)

You can construct `VectorSchema` directly when using a custom embedding model that doesn't implement `VectorSchemaProvider`:

```python
from cocoindex.resources.schema import VectorSchema

# For a custom CLIP model with known dimension
schema = VectorSchema(dtype=np.dtype(np.float32), size=768)

# Use it in a Qdrant vector definition
QDRANT_DB = coco.ContextKey[QdrantClient]("my_qdrant_db")
target_collection = await qdrant.mount_collection_target(
    QDRANT_DB,
    collection_name="image_search",
    schema=await qdrant.CollectionSchema.create(
        vectors=qdrant.QdrantVectorDef(schema=schema, distance="cosine")
    ),
)
```

## VectorSchemaProvider

A protocol for objects that can provide vector schema information. The primary use case is as metadata in `Annotated` type annotations — connectors extract vector column configuration from the annotation automatically.

Any object that implements the `__coco_vector_schema__()` method satisfies this protocol. The built-in [`SentenceTransformerEmbedder`](../ops/sentence_transformers) implements it.

There are three ways to specify vector schema in annotations:

### Using a `ContextKey` (recommended)

Define a [`ContextKey`](../programming_guide/context) for the embedder and use it as the annotation. The connector resolves the key at schema creation time. This is the recommended approach because the embedder is configured once in the lifespan and shared across all functions via context.

```python
import cocoindex as coco
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder

EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
EMBEDDER = coco.ContextKey[SentenceTransformerEmbedder]("embedder")

@dataclass
class DocEmbedding:
    id: int
    text: str
    embedding: Annotated[NDArray, EMBEDDER]  # dimension resolved from context

# In lifespan, provide the embedder:
@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    builder.provide(EMBEDDER, SentenceTransformerEmbedder(EMBED_MODEL))
    yield

# In coco functions, access the embedder:
embedding = await coco.use_context(EMBEDDER).embed(text)
```

### Using a `VectorSchemaProvider` instance

Pass an embedder instance directly as the annotation. Simpler for scripts where the embedder is a module-level constant.

```python
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder

embedder = SentenceTransformerEmbedder("sentence-transformers/all-MiniLM-L6-v2")

@dataclass
class DocEmbedding:
    id: int
    text: str
    embedding: Annotated[NDArray, embedder]  # dimension inferred from model (384)
```

### Using a `VectorSchema`

Specify dimension and dtype explicitly. Useful when using a custom embedding model that doesn't implement `VectorSchemaProvider`.

```python
from cocoindex.resources.schema import VectorSchema

@dataclass
class ImageEmbedding:
    id: int
    embedding: Annotated[NDArray, VectorSchema(dtype=np.dtype(np.float32), size=768)]
```

When a connector's `TableSchema.from_class()` encounters an `Annotated[NDArray, annotation]` field, it resolves the annotation — unwrapping `ContextKey` if needed — and calls `__coco_vector_schema__()` to determine the column's dimension and dtype.

## MultiVectorSchema / MultiVectorSchemaProvider

Analogous types for multi-vector columns (e.g., ColBERT-style token-level embeddings). `MultiVectorSchema` wraps a `VectorSchema` describing the individual vectors. Used by connectors like [Qdrant](../connectors/qdrant) that support multi-vector storage.

```python
from cocoindex.resources.schema import MultiVectorSchema, VectorSchema

multi_schema = MultiVectorSchema(
    vector_schema=VectorSchema(dtype=np.dtype(np.float32), size=128)
)
```

---

# Stable ID generation

Source: https://cocoindex.io/docs/common_resources/id_generation/

The ID module (`cocoindex.resources.id`) provides utilities for generating stable unique IDs and UUIDs that persist across incremental updates.

In an incremental pipeline, using random IDs (like `uuid.uuid4()`) means every reprocessing run generates different IDs for the same data — causing unnecessary churn in your targets (deleting old rows, inserting identical ones with new IDs). CocoIndex's ID utilities produce **stable** IDs: the same inputs produce the same IDs across runs, so unchanged data keeps its identity and targets only see real changes.

## Choosing the right API

| API | Same `dep` produces... | Use when... |
|-----|------------------------|-------------|
| `generate_id(dep)` | **Same** ID every time | Each unique input maps to exactly one ID |
| `IdGenerator.next_id(dep)` | **Distinct** ID each call | You need multiple IDs for potentially non-distinct inputs |

The same distinction applies to `generate_uuid` vs `UuidGenerator`.

## generate_id / generate_uuid

Async functions that return the **same** ID/UUID for the **same** `dep` value. These are idempotent: calling multiple times with identical `dep` yields identical results.

```python
from cocoindex.resources.id import generate_id, generate_uuid

async def process_item(item: Item) -> Row:
    # Same item.key always gets the same ID
    item_id = await generate_id(item.key)
    return Row(id=item_id, data=item.data)

async def process_document(doc: Document) -> Row:
    # Same doc.path always gets the same UUID
    doc_uuid = await generate_uuid(doc.path)
    return Row(id=doc_uuid, content=doc.content)
```

**Parameters:**

- `dep` — Dependency value that determines the ID/UUID. The same `dep` always produces the same result within a component. Defaults to `None`.

**Returns:**

- `generate_id` returns an `int` (IDs start from 1; 0 is reserved)
- `generate_uuid` returns a `uuid.UUID`

## IdGenerator / UuidGenerator

Classes that return a **distinct** ID/UUID on each call, even when called with the same `dep` value. The sequence is stable across runs.

Use these when you need multiple IDs for potentially non-distinct inputs, such as splitting text into chunks where chunks may have identical content but still need unique IDs.

```python
from cocoindex.resources.id import IdGenerator, UuidGenerator

async def process_document(doc: Document) -> list[Row]:
    # Use doc.path to distinguish generators within the same processing component
    id_gen = IdGenerator(deps=doc.path)
    rows = []
    for chunk in split_into_chunks(doc.content):
        # Each call returns a distinct ID, even if chunks are identical
        chunk_id = await id_gen.next_id(chunk.content)
        rows.append(Row(id=chunk_id, content=chunk.content))
    return rows

async def process_with_uuids(doc: Document) -> list[Row]:
    # Use doc.path to distinguish generators within the same processing component
    uuid_gen = UuidGenerator(deps=doc.path)
    rows = []
    for chunk in split_into_chunks(doc.content):
        # Each call returns a distinct UUID, even if chunks are identical
        chunk_uuid = await uuid_gen.next_uuid(chunk.content)
        rows.append(Row(id=chunk_uuid, content=chunk.content))
    return rows
```

**Constructor:**

- `IdGenerator(deps=None)` / `UuidGenerator(deps=None)` — Create a generator. The `deps` parameter distinguishes generators within the same [processing component](../programming_guide/processing_component). Use distinct `deps` values for different generator instances.

**Methods:**

- `async IdGenerator.next_id(dep=None)` — Generate the next unique integer ID (distinct on each call)
- `async UuidGenerator.next_uuid(dep=None)` — Generate the next unique UUID (distinct on each call)

---

# In-memory LiveMap

Source: https://cocoindex.io/docs/common_resources/live_map/

`LiveMap[K, V]` (`cocoindex.resources.live_map`) is an in-memory, keyed collection that sits **between** a producing part of your pipeline and a consuming part. The producing side **declares** `(key, value)` entries; the consuming side reads the map as a [`LiveMapView`](../advanced_topics/live_component#livemapfeed-and-livemapview) via `coco.mount_each`, getting one [processing component](../programming_guide/processing_component) per entry that stays in sync as entries appear, change, and disappear.

Think of it as a connector whose "external system" is an in-process `dict`: entries are declared as [target states](../programming_guide/target_state) — so they participate in CocoIndex's declarative change detection and ownership — and that same `dict` is simultaneously exposed as a live source for downstream components. It lets you split a pipeline into a producer half and a consumer half that are decoupled through one shared, incrementally-maintained collection.

It is designed for [live mode](../programming_guide/live_mode): the producer and consumer run concurrently, and the consumer reacts as the producer updates the map.

## Example

```python
import cocoindex as coco
from cocoindex.resources.live_map import LiveMap


@coco.fn
async def produce_entries(lm: LiveMap[str, str]) -> None:
    # Any component can declare entries — often a live component reading a stream.
    for key, text in fetch_items():
        lm.declare_entry(key, text)


@coco.fn
async def process_entry(value: str) -> None:
    ...  # build something from each entry's value


@coco.fn
async def app_main() -> None:
    lm: LiveMap[str, str] = await LiveMap.create()

    await coco.mount(produce_entries, lm)        # producing side
    await coco.mount_each(process_entry, lm)     # consuming side: one component per entry
```

## Creating a LiveMap

```python
lm: LiveMap[str, str] = await LiveMap.create()
```

`create()` is an async factory and must be called from **inside the app's component tree** (it mounts a backing target). `K` must be a [stable key](../programming_guide/processing_component#component-path); `V` is any value comparable with `==` — no hashability, serialization, or fingerprinting is required, so arbitrary objects (lists, dataclasses, …) work.

## Producing entries

```python
lm.declare_entry(key, value)
```

Call `declare_entry` from inside any component. The entry is a target state **owned by the declaring component**, which gives it normal declarative semantics:

- Declaring a key makes it present, or updates its value.
- The consumer is notified **only when the value actually changes** (compared with `==`) — re-declaring the same value is a no-op for downstream.
- An entry is removed when the component that declared it **stops declaring it** (on a re-run) or disappears. There is no explicit delete verb; deletion follows target-state ownership, exactly like other CocoIndex targets.

Multiple components may declare into the same map, as long as each key has a single owner (two components declaring the same key is a conflict, the same as for any target).

## Consuming entries

A `LiveMap` is a [`LiveMapView`](../advanced_topics/live_component#livemapfeed-and-livemapview), so pass it straight to `coco.mount_each` — it behaves just like a live source:

```python
await coco.mount_each(process_entry, lm)
```

`mount_each` mounts one processing component per entry (keyed by the entry key), scans the current entries first, then reacts to incremental changes — re-mounting a component when its value changes and removing it when its entry is deleted. The processor receives the entry **value**.

A LiveMap supports **one active consumer** (`mount_each`) at a time; a second concurrent consumer raises `RuntimeError`.

## Semantics

- **Live-mode first.** LiveMap is built for [`app.update(live=True)`](../programming_guide/live_mode): the consumer subscribes and reacts as producers update the map. In catch-up mode (plain `app.update()`), the consumer scans whatever entries happen to exist when it runs, which can race producers running concurrently in the same session — so for predictable one-shot results, order the producer ahead of the consumer (await the producer's `handle.ready()` before consuming).
- **In-memory and per-session.** The map lives in process memory and is rebuilt from its producers each time the app starts; its contents do not persist across restarts.

## When to use it

Reach for a LiveMap when you want to decouple a producing stage from a consuming stage through a shared, incrementally-maintained collection. For example: one part of your pipeline watches a stream, extracts entities, and declares them into a map keyed by entity ID; another part builds an index or enrichment for each entity, reacting automatically as entities are added, updated, or removed.

---

# Amazon S3 connector

Source: https://cocoindex.io/docs/connectors/amazon_s3/

The `amazon_s3` connector provides utilities for reading objects from Amazon S3 buckets and S3-compatible services (e.g. MinIO, Tigris).

```python
from cocoindex.connectors import amazon_s3
```

**Note — Installation**
This connector requires the `aiobotocore` library. Install with:

```bash
pip install cocoindex[amazon_s3]
```

## As source

The connector provides two ways to read from S3:

- `list_objects()` — List and iterate over objects in a bucket (with optional prefix and filtering)
- `get_object()` — Fetch a single object by its key
- `read()` — Read object content directly by S3 URI

Both require an aiobotocore S3 client, which you create and manage yourself:

```python
import aiobotocore.session

session = aiobotocore.session.get_session()
async with session.create_client("s3") as client:
    # Use client with list_objects() or get_object()
    ...

# For S3-compatible services:
async with session.create_client("s3", endpoint_url="http://localhost:9000") as client:
    ...

# For Tigris — single global endpoint, no region selection.
# Pass region_name="auto"; boto3 requires a region for SigV4 signing
# but Tigris ignores the value.
async with session.create_client(
    "s3",
    endpoint_url="https://t3.storage.dev",
    region_name="auto",
) as client:
    ...
```

### list_objects

List objects in an S3 bucket. Returns an `S3Walker` that supports async iteration.

```python
def list_objects(
    client: AioBaseClient,
    bucket_name: str,
    *,
    prefix: str = "",
    path_matcher: FilePathMatcher | None = None,
    max_file_size: int | None = None,
) -> S3Walker
```

**Parameters:**

- `client` — An aiobotocore S3 client.
- `bucket_name` — The S3 bucket name.
- `prefix` — Only list objects whose key starts with this prefix. The prefix is stripped from relative paths in the returned files.
- `path_matcher` — Optional filter for files. Patterns are matched against the relative path (after prefix stripping). See [PatternFilePathMatcher](../common_resources/data_types#patternfilepathmatcher).
- `max_file_size` — Skip objects larger than this size in bytes.

**Returns:** An `S3Walker` that can be used with `async for` loops.

### Iterating files

`list_objects()` returns an `S3Walker` that yields `S3File` objects (implementing the [`FileLike`](../common_resources/data_types#filelike) base class):

```python
import aiobotocore.session
from cocoindex.connectors import amazon_s3

session = aiobotocore.session.get_session()
async with session.create_client("s3") as client:
    async for file in amazon_s3.list_objects(client, "my-bucket", prefix="data/"):
        text = await file.read_text()
        ...
```

See [`FileLike`](../common_resources/data_types#filelike) for details on the file objects.

### Keyed iteration with `items()`

`S3Walker.items()` yields `(str, S3File)` pairs, useful for associating each file with a stable string key (its relative path):

```python
async for key, file in amazon_s3.list_objects(client, "my-bucket").items():
    content = await file.read()
```

### Filtering files

Use `PatternFilePathMatcher` to filter which objects are included. Patterns are matched against the relative path (after prefix stripping):

```python
from cocoindex.connectors import amazon_s3
from cocoindex.resources.file import PatternFilePathMatcher

matcher = PatternFilePathMatcher(included_patterns=["**/*.json"])

async for file in amazon_s3.list_objects(client, "my-bucket", prefix="data/", path_matcher=matcher):
    process(file)
```

### Limiting file size

Use `max_file_size` to skip objects that exceed a size threshold:

```python
# Skip objects larger than 10 MB
async for file in amazon_s3.list_objects(client, "my-bucket", max_file_size=10 * 1024 * 1024):
    process(file)
```

### get_object

Fetch a single object from an S3 bucket by its key.

```python
async def get_object(
    client: AioBaseClient,
    bucket_name_or_uri: str,
    key: str | None = None,
) -> S3File
```

**Parameters:**

- `client` — An aiobotocore S3 client.
- `bucket_name_or_uri` — Either a full S3 URI (`s3://bucket/key`) or the bucket name when `key` is supplied separately.
- `key` — The full S3 object key. Required when `bucket_name_or_uri` is a bucket name; must be omitted when a URI is given.

**Returns:** An `S3File` (FileLike) for the specified object.

**Example:**

```python
import aiobotocore.session
from cocoindex.connectors import amazon_s3

session = aiobotocore.session.get_session()
async with session.create_client("s3") as client:
    # Via S3 URI:
    f = await amazon_s3.get_object(client, "s3://my-bucket/data/config.json")
    data = await f.read()

    # Via bucket name + key:
    f = await amazon_s3.get_object(client, "my-bucket", "data/config.json")
    data = await f.read()
```

### read

Read object content directly from an S3 URI, without fetching metadata first.

```python
async def read(
    client: AioBaseClient,
    uri: str,
    size: int = -1,
) -> bytes
```

**Parameters:**

- `client` — An aiobotocore S3 client.
- `uri` — An S3 URI (`s3://bucket/key`).
- `size` — Number of bytes to read. If -1 (default), read the entire object.

**Returns:** The object content as bytes.

**Example:**

```python
async with session.create_client("s3") as client:
    data = await amazon_s3.read(client, "s3://my-bucket/data/config.json")
```

### S3FilePath

Each file returned by the connector has an `S3FilePath` — a [`FilePath`](../common_resources/data_types#filepath) specialized for S3:

- **Relative path** (`file.file_path.path`) — The object key relative to the walker prefix (or the full key if no prefix was used).
- **Resolved path** (`file.file_path.resolve()`) — The full S3 object key.

For example, with `prefix="data/"` and an object key `"data/docs/readme.md"`:
- `file.file_path.path` → `PurePath("docs/readme.md")`
- `file.file_path.resolve()` → `"data/docs/readme.md"`

### Example

```python
import aiobotocore.session
import cocoindex as coco
from cocoindex.connectors import amazon_s3
from cocoindex.resources.file import FileLike, PatternFilePathMatcher

@coco.fn
async def app_main(bucket: str) -> None:
    session = aiobotocore.session.get_session()
    async with session.create_client("s3") as client:
        matcher = PatternFilePathMatcher(included_patterns=["**/*.md"])

        walker = amazon_s3.list_objects(
            client, bucket, prefix="docs/", path_matcher=matcher,
        )

        with coco.component_subpath("file"):
            async for key, file in walker.items():
                coco.mount(
                    coco.component_subpath(key),
                    process_file,
                    file,
                )

@coco.fn(memo=True)
async def process_file(file: FileLike[str]) -> None:
    text = await file.read_text()
    # ... process the file content ...
```

---

# Apache Doris connector

Source: https://cocoindex.io/docs/connectors/doris/

The `doris` connector provides utilities for writing rows to Apache Doris databases, with support for vector indexes (HNSW, IVF) and inverted indexes for full-text search.

```python
from cocoindex.connectors import doris
```

**Note — Dependencies**
This connector requires additional dependencies. Install with:

```bash
pip install cocoindex[doris]
```

## Connection setup

### DorisConnectionConfig

Configure the connection to your Doris cluster:

```python
from cocoindex.connectors import doris

config = doris.DorisConnectionConfig(
    fe_host="localhost",
    database="my_database",
    fe_http_port=8080,
    query_port=9030,
    username="root",
    password="",
)
```

**Parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `fe_host` | `str` | *(required)* | Frontend host address |
| `database` | `str` | *(required)* | Database name |
| `fe_http_port` | `int` | `8080` | Frontend HTTP port (for stream load) |
| `query_port` | `int` | `9030` | MySQL-compatible query port |
| `username` | `str` | `"root"` | Username |
| `password` | `str` | `""` | Password |
| `enable_https` | `bool` | `False` | Use HTTPS for stream load |
| `be_load_host` | `str \| None` | `None` | Override backend host for stream load (defaults to `fe_host`) |
| `batch_size` | `int` | `10000` | Max rows per stream load batch |
| `stream_load_timeout` | `int` | `600` | Timeout (seconds) for stream load |
| `replication_num` | `int` | `1` | Replication factor for new tables |
| `buckets` | `int \| str` | `"auto"` | Bucket count for new tables |

### connect

Create a managed connection:

```python
def connect(config: DorisConnectionConfig) -> ManagedConnection
```

**Example:**

```python
conn = doris.connect(doris.DorisConnectionConfig(
    fe_host="localhost",
    database="my_database",
))
```

## As target

The `doris` connector provides target state APIs for writing rows to tables. CocoIndex tracks what rows should exist and automatically handles upserts and deletions via Doris stream load.

### Declaring target states

#### Setting up a connection

Create a `ContextKey[doris.ManagedConnection]` to identify your connection, then provide it in your lifespan:

**Note**
The key name is load-bearing across runs — it's the stable identity CocoIndex uses to track managed rows. See [ContextKey as stable identity](../programming_guide/context#contextkey-as-stable-identity) before renaming.

```python
import cocoindex as coco
from cocoindex.connectors import doris

DORIS_DB = coco.ContextKey[doris.ManagedConnection]("my_doris")

@coco.lifespan
def coco_lifespan(builder: coco.EnvironmentBuilder) -> Iterator[None]:
    conn = doris.connect(doris.DorisConnectionConfig(
        fe_host="localhost",
        database="my_database",
    ))
    builder.provide(DORIS_DB, conn)
    yield
    # conn is cleaned up after yield
```

#### Tables (parent state)

Declares a table as a target state. Returns a `DorisTableTarget` for declaring rows.

```python
def declare_table_target(
    db: ContextKey[ManagedConnection],
    table_name: str,
    table_schema: TableSchema[RowT],
    *,
    managed_by: Literal["system", "user"] = "system",
    vector_indexes: list[VectorIndexDef] | None = None,
    inverted_indexes: list[InvertedIndexDef] | None = None,
) -> DorisTableTarget[RowT, coco.PendingS]
```

**Parameters:**

- `db` — A `ContextKey[doris.ManagedConnection]` identifying the connection to use.
- `table_name` — Name of the table.
- `table_schema` — Schema definition including columns and primary key (see [Table schema](#table-schema-from-python-class)).
- `managed_by` — Whether CocoIndex manages the table lifecycle (`"system"`) or assumes it exists (`"user"`).
- `vector_indexes` — Optional list of vector index definitions (see [Vector indexes](#vector-indexes)).
- `inverted_indexes` — Optional list of inverted index definitions (see [Inverted indexes](#inverted-indexes)).

**Returns:** A pending `DorisTableTarget`. Use the convenience wrapper `await doris.mount_table_target(...)` to resolve.

#### Rows (child states)

Once a `DorisTableTarget` is resolved, declare rows to be upserted:

```python
def DorisTableTarget.declare_row(
    self,
    *,
    row: RowT,
) -> None
```

**Parameters:**

- `row` — A row object (dict, dataclass, NamedTuple, or Pydantic model). Must include all primary key columns.

### Table schema: from Python class

Define the table structure using a Python class:

```python
@classmethod
async def TableSchema.from_class(
    cls,
    record_type: type[RowT],
    primary_key: list[str],
    *,
    column_overrides: dict[str, DorisType | VectorSchemaProvider] | None = None,
) -> TableSchema[RowT]
```

**Parameters:**

- `record_type` — A record type whose fields define table columns.
- `primary_key` — List of column names forming the primary key.
- `column_overrides` — Optional per-column overrides for type mapping or vector configuration.

**Example:**

```python
@dataclass
class DocEmbedding:
    id: int
    text: str
    embedding: Annotated[NDArray, embedder]

schema = await doris.TableSchema.from_class(
    DocEmbedding,
    primary_key=["id"],
)
```

Python types are automatically mapped to Doris types:

| Python Type | Doris Type |
|-------------|------------|
| `bool` | `BOOLEAN` |
| `int` | `BIGINT` |
| `float` | `DOUBLE` |
| `str` | `STRING` |
| `bytes` | `STRING` (base64) |
| `uuid.UUID` | `VARCHAR(36)` |
| `datetime.datetime` | `DATETIME` |
| `datetime.date` | `DATE` |
| `list`, `dict`, nested structs | `JSON` |
| `NDArray` (with vector schema) | `ARRAY<FLOAT>` |

#### DorisType

Use `DorisType` to specify a custom Doris type:

```python
from typing import Annotated
from cocoindex.connectors.doris import DorisType

@dataclass
class MyRow:
    id: Annotated[int, DorisType("INT")]
    value: Annotated[float, DorisType("FLOAT")]
```

### Vector indexes

Doris supports vector similarity search via HNSW and IVF indexes. Define them with `VectorIndexDef`:

```python
from cocoindex.connectors.doris import VectorIndexDef

vector_idx = VectorIndexDef(
    field_name="embedding",
    index_type="HNSW",       # or "IVF"
    metric_type="l2_distance",  # or "cosine_distance"
)
```

**Parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `field_name` | `str` | *(required)* | Column to index |
| `index_type` | `str` | `"HNSW"` | Index type: `"HNSW"` or `"IVF"` |
| `metric_type` | `str` | `"l2_distance"` | Distance metric: `"l2_distance"` or `"cosine_distance"` |
| `max_degree` | `int \| None` | `None` | HNSW max degree |
| `ef_construction` | `int \| None` | `None` | HNSW construction parameter |
| `nlist` | `int \| None` | `None` | IVF number of partitions |

### Inverted indexes

Doris supports inverted indexes for full-text search. Define them with `InvertedIndexDef`:

```python
from cocoindex.connectors.doris import InvertedIndexDef

inverted_idx = InvertedIndexDef(
    field_name="text",
    parser="unicode",  # or "english", "chinese", etc.
)
```

**Parameters:**

- `field_name` — Column to index.
- `parser` — Optional tokenizer for full-text search (e.g., `"unicode"`, `"english"`, `"chinese"`). If `None`, the index supports exact matching only.

### Query helpers

#### build_vector_search_query

Build a vector similarity search SQL query:

```python
def build_vector_search_query(
    table: str,
    vector_field: str,
    query_vector: list[float],
    metric: str = "l2_distance",
    limit: int = 10,
    select_columns: list[str] | None = None,
    where_clause: str | None = None,
) -> str
```

**Example:**

```python
sql = doris.build_vector_search_query(
    table="doc_embeddings",
    vector_field="embedding",
    query_vector=query_vec.tolist(),
    metric="cosine_distance",
    limit=5,
)
```

#### connect_async

Create an async MySQL connection for running queries:

```python
async def connect_async(
    fe_host: str,
    query_port: int = 9030,
    username: str = "root",
    password: str = "",
    database: str | None = None,
) -> Any  # aiomysql connection
```

### Example

```python
from typing import Annotated, Iterator
from dataclasses import dataclass

from numpy.typing import NDArray

import cocoindex as coco
from cocoindex.connectors import doris
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder

DORIS_DB = coco.ContextKey[doris.ManagedConnection]("my_doris")

embedder = SentenceTransformerEmbedder("sentence-transformers/all-MiniLM-L6-v2")

@dataclass
class DocEmbedding:
    id: int
    text: str
    embedding: Annotated[NDArray, embedder]

@coco.lifespan
def coco_lifespan(builder: coco.EnvironmentBuilder) -> Iterator[None]:
    conn = doris.connect(doris.DorisConnectionConfig(
        fe_host="localhost",
        database="my_database",
    ))
    builder.provide(DORIS_DB, conn)
    yield

@coco.fn
async def app_main() -> None:
    table = await doris.mount_table_target(
        DORIS_DB,
        "doc_embeddings",
        await doris.TableSchema.from_class(
            DocEmbedding,
            primary_key=["id"],
        ),
        vector_indexes=[
            doris.VectorIndexDef(
                field_name="embedding",
                index_type="HNSW",
                metric_type="cosine_distance",
            ),
        ],
    )

    # Declare rows
    for doc in documents:
        table.declare_row(row=doc)
```

---

# FalkorDB connector

Source: https://cocoindex.io/docs/connectors/falkordb/

The `falkordb` connector writes records to FalkorDB, a Cypher-compatible graph database that runs as a Redis module. It supports node tables (labels), relationship tables (edge types), per-graph multitenancy (one Redis instance, many isolated graphs), and vector indexes.

```python
from cocoindex.connectors import falkordb
```

**Note — Dependencies**
This connector requires additional dependencies. Install with:

```bash
pip install cocoindex[falkordb]
```

## Connection setup

Create a `ConnectionFactory` and provide it via a `ContextKey`. The factory holds the FalkorDB URI plus the target graph name, and yields a graph handle on demand.

**Note**
The key name is load-bearing across runs — it's the stable identity CocoIndex uses to track managed rows. See [ContextKey as stable identity](../programming_guide/context#contextkey-as-stable-identity) before renaming.

```python
from collections.abc import AsyncIterator
from cocoindex.connectors import falkordb
import cocoindex as coco

KG_DB: coco.ContextKey[falkordb.ConnectionFactory] = coco.ContextKey("kg_db")

@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    builder.provide(
        KG_DB,
        falkordb.ConnectionFactory(
            uri="falkor://localhost:6379",
            graph="knowledge_graph",
        ),
    )
    yield
```

### Multitenancy

A single Redis instance can host many fully isolated graphs. Pair each graph with its own `ContextKey` and `ConnectionFactory(graph=...)`:

```python
KG_DB: coco.ContextKey[falkordb.ConnectionFactory] = coco.ContextKey("kg_db")
APIS_DB: coco.ContextKey[falkordb.ConnectionFactory] = coco.ContextKey("apis_db")

@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    uri = "falkor://localhost:6379"
    builder.provide(KG_DB, falkordb.ConnectionFactory(uri=uri, graph="knowledge_graph"))
    builder.provide(APIS_DB, falkordb.ConnectionFactory(uri=uri, graph="apis_graph"))
    yield
```

Different `ContextKey`s with different graph names produce fully separate target-state trees — changes to one never spill into the other.

## As target

The `falkordb` connector provides target state APIs for writing records to node tables and relation tables. CocoIndex tracks what records should exist and automatically handles upserts and deletions.

Each `graph.query` call against FalkorDB is its own atomic unit (FalkorDB does not expose multi-statement transactions); the connector orders writes within a batch as **node upserts → relation upserts → relation deletes → node deletes** so dependent edges always see their endpoints.

### Declaring target states

#### Node tables (parent state)

Declares a node label as a target state. Returns a `TableTarget` for declaring records.

```python
def declare_table_target(
    db: ContextKey,
    table_name: str,
    table_schema: TableSchema[RowT] | None = None,
    *,
    primary_key: str = "id",
    managed_by: Literal["system", "user"] = "system",
) -> TableTarget[RowT, coco.PendingS]
```

**Parameters:**

- `db` — A `ContextKey[falkordb.ConnectionFactory]` for the FalkorDB connection.
- `table_name` — The Cypher node label (e.g. `"Document"`).
- `table_schema` — Optional schema definition (see [Table Schema](#table-schema-from-python-class)). FalkorDB does not enforce per-property types server-side, so the schema participates in CocoIndex's fingerprint (so two flows declaring the same label must agree) but no per-column DDL is emitted.
- `primary_key` — Single property name used as the node's primary key. Defaults to `"id"`. Compound primary keys are not supported in v1.0.
- `managed_by` — Whether CocoIndex manages the table lifecycle (`"system"`) or assumes it exists (`"user"`).

**Returns:** A pending `TableTarget`. Use `await falkordb.mount_table_target(KG_DB, ...)` to get a resolved target.

#### Records (child states)

Once a `TableTarget` is resolved, declare records to be upserted (translated to `MERGE (n:Label {pk: $key_0}) SET n += $props`):

```python
def TableTarget.declare_record(
    self,
    *,
    row: RowT,
) -> None
```

**Parameters:**

- `row` — A row object (dict, dataclass, NamedTuple, or Pydantic model). Must include the `primary_key` field declared above.

`declare_row` is an alias for `declare_record`, for compatibility with Postgres and other RDBMS targets.

#### Relation tables (parent state)

Declares a relationship type as a target state. Returns a `RelationTarget` for declaring edges.

```python
def declare_relation_target(
    db: ContextKey,
    table_name: str,
    from_table: TableTarget,
    to_table: TableTarget,
    table_schema: TableSchema[RowT] | None = None,
    *,
    primary_key: str = "id",
    managed_by: Literal["system", "user"] = "system",
) -> RelationTarget[RowT, coco.PendingS]
```

**Parameters:**

- `db` — A `ContextKey[falkordb.ConnectionFactory]` for the FalkorDB connection.
- `table_name` — The Cypher relationship type (e.g. `"MENTION"`).
- `from_table` — The `TableTarget` whose nodes are the *source* endpoints of edges in this relationship.
- `to_table` — The `TableTarget` whose nodes are the *target* endpoints of edges in this relationship.
- `table_schema` — Optional schema for the relationship's own properties (see [Table Schema](#table-schema-from-python-class)). The relationship's `primary_key` field uniquely identifies each edge.
- `primary_key` — Single property name used as the edge's primary key. Defaults to `"id"`.
- `managed_by` — Whether CocoIndex manages the relationship lifecycle (`"system"`) or assumes it exists (`"user"`).

**Returns:** A pending `RelationTarget`. Use `await falkordb.mount_relation_target(KG_DB, ...)` to get a resolved target.

#### Relations (child states)

Once a `RelationTarget` is resolved, declare edges. Each declaration produces a triple-MERGE: source endpoint, target endpoint, then the relationship.

```python
def RelationTarget.declare_relation(
    self,
    *,
    from_id: Any,
    to_id: Any,
    record: RowT | None = None,
) -> None
```

**Parameters:**

- `from_id` — The source node's primary-key value. The connector MERGEs `(s:FromLabel {pk: $from_id})` so endpoints are auto-created if absent.
- `to_id` — The target node's primary-key value. Same MERGE behavior.
- `record` — Optional row object whose fields populate the relationship's properties. Must include the relationship's `primary_key` field if provided.

If `record` is omitted, the connector derives a deterministic edge id from `(from_label, from_id, to_label, to_id)`. This is convenient when an edge has no properties of its own.

#### Vector indexes (attachment)

Declares a vector index on a column of a node table. Vector indexes are an [attachment](../advanced_topics/custom_target_connector#implementing-attachment-providers) to a `TableTarget`:

```python
def TableTarget.declare_vector_index(
    self,
    *,
    name: str | None = None,
    field: str,
    metric: Literal["cosine", "euclidean", "ip"] = "cosine",
    dimension: int,
) -> None
```

**Parameters:**

- `name` — Optional logical name for the index. Defaults to `f"idx_{table_name}__{field}"`.
- `field` — The node property holding the vector.
- `metric` — Similarity metric: `"cosine"`, `"euclidean"`, or `"ip"` (inner product). Translated to FalkorDB's `similarityFunction` option.
- `dimension` — The vector's dimension. Required.

The connector emits `CREATE VECTOR INDEX FOR (e:Label) ON (e.field) OPTIONS {dimension: N, similarityFunction: '...'}`. Vectors are float32 only — wider vector dtypes are not supported.

### Table schema: from Python class

Build a `TableSchema` by introspecting a record type:

```python
@classmethod
async def TableSchema.from_class(
    cls,
    record_type: type[RowT],
    *,
    primary_key: str = "id",
    column_overrides: dict[str, FalkorType | VectorSchemaProvider] | None = None,
) -> TableSchema[RowT]
```

**Parameters:**

- `record_type` — A dataclass, NamedTuple, or Pydantic model.
- `primary_key` — Field name to use as the table's primary key. Defaults to `"id"`.
- `column_overrides` — Optional dict mapping field names to `FalkorType` or `VectorSchemaProvider` to override the default Python-to-FalkorDB type mapping.

**Returns:** A `TableSchema[RowT]` populated from the class's fields.

#### Default Python → FalkorDB type mapping

| Python type | FalkorDB type | Notes |
|---|---|---|
| `bool` | `boolean` | |
| `int`, NumPy integer scalars | `integer` | |
| `float`, NumPy float scalars | `float` | |
| `decimal.Decimal` | `string` | Encoded via `str()` — FalkorDB has no decimal type. |
| `str` | `string` | |
| `bytes` | `string` | Encoded as base64. |
| `uuid.UUID` | `string` | Encoded via `str()`. |
| `datetime.date` / `datetime.datetime` / `datetime.time` | `string` | Encoded via `.isoformat()`. |
| `datetime.timedelta` | `integer` | Encoded as milliseconds (`int(td.total_seconds() * 1000)`). |
| `numpy.ndarray` (with `VectorSchema` annotation) | `vector<float32, N>` | Encoded as `list[float]`. |
| `dict`, list, nested record, `Any` | `map` / `array` | Passed through native parameter binding. |

#### FalkorType

Override the default mapping for a single column with `FalkorType`:

```python
class FalkorType(NamedTuple):
    falkor_type: str
    encoder: ValueEncoder | None = None
```

Use with `typing.Annotated`:

```python
from typing import Annotated
from dataclasses import dataclass
from cocoindex.connectors.falkordb import FalkorType

@dataclass
class Row:
    id: str
    score: Annotated[float, FalkorType("decimal", encoder=str)]
```

The `falkor_type` string is metadata-only — it participates in the schema fingerprint (so two flows declaring the same table must agree) but no DDL is emitted from it.

#### VectorSchemaProvider

For NumPy `ndarray` columns, attach a `VectorSchema` annotation to specify dtype + dimension. See [VectorSchema](../common_resources/vector_schema) for details.

### Table schema: explicit column definitions

Build a `TableSchema` directly from a dict of column definitions when the row type is dynamic:

```python
from cocoindex.connectors.falkordb import TableSchema, ColumnDef

schema = TableSchema(
    columns={
        "filename": ColumnDef(type="string"),
        "title": ColumnDef(type="string"),
        "summary": ColumnDef(type="string", nullable=True),
    },
    primary_key="filename",
)
```

`ColumnDef` fields:

- `type` — The FalkorDB type string (metadata only; see table above).
- `nullable` — Whether the column may be `None`. Defaults to `True`.
- `encoder` — Optional `Callable[[Any], Any]` applied to non-`None` values before they're sent to FalkorDB.

### DDL: indexes and constraints

For each managed table, the connector creates the supporting Cypher index on the primary key field on first run:

- For node tables: `CREATE INDEX FOR (e:Label) ON (e.<pk>)`.
- For relation tables: `CREATE INDEX FOR ()-[e:RelType]-() ON (e.<pk>)`.

It then attempts a uniqueness constraint via the `GRAPH.CONSTRAINT CREATE` Redis command (best-effort — failures are logged but do not abort). Indexes and constraints are dropped on `cocoindex drop` or when the table is no longer declared.

When `managed_by="user"` is set, the connector skips DDL entirely — you're responsible for creating and dropping the schema. Record-level upserts and deletes still work.

### Example: Node tables

```python
from collections.abc import AsyncIterator
from dataclasses import dataclass
import cocoindex as coco
from cocoindex.connectors import falkordb

KG_DB: coco.ContextKey[falkordb.ConnectionFactory] = coco.ContextKey("kg_db")


@dataclass
class Document:
    filename: str
    title: str
    summary: str


@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    builder.provide(KG_DB, falkordb.ConnectionFactory(
        uri="falkor://localhost:6379", graph="knowledge_graph",
    ))
    yield


@coco.fn
async def app_main() -> None:
    schema = await falkordb.TableSchema.from_class(Document, primary_key="filename")
    documents = await falkordb.mount_table_target(
        KG_DB, "Document", schema, primary_key="filename",
    )
    documents.declare_record(
        row=Document(
            filename="overview.md",
            title="Overview",
            summary="An overview of CocoIndex...",
        )
    )


app = coco.App(coco.AppConfig(name="docs_to_falkordb"), app_main)
```

### Example: Relation tables (knowledge graph)

```python
@dataclass
class Entity:
    value: str


@dataclass
class RelationshipRow:
    id: str
    predicate: str


@coco.fn
async def kg_app_main() -> None:
    documents = await falkordb.mount_table_target(
        KG_DB, "Document",
        await falkordb.TableSchema.from_class(Document, primary_key="filename"),
        primary_key="filename",
    )
    entities = await falkordb.mount_table_target(
        KG_DB, "Entity",
        await falkordb.TableSchema.from_class(Entity, primary_key="value"),
        primary_key="value",
    )
    relationships = await falkordb.mount_relation_target(
        KG_DB, "RELATIONSHIP",
        entities, entities,
        await falkordb.TableSchema.from_class(RelationshipRow, primary_key="id"),
        primary_key="id",
    )

    # populate ...
    documents.declare_record(row=Document(filename="overview.md", title="Overview", summary="..."))
    entities.declare_record(row=Entity(value="CocoIndex"))
    entities.declare_record(row=Entity(value="FalkorDB"))
    relationships.declare_relation(
        from_id="CocoIndex",
        to_id="FalkorDB",
        record=RelationshipRow(id="rel-1", predicate="writes_to"),
    )


kg_app = coco.App(coco.AppConfig(name="kg_app"), kg_app_main)
```

The `Entity` table is declared up-front (via `mount_table_target`) so its index and constraint are reconciled before any `RELATIONSHIP` edge MERGEs entity endpoints. The relationship's three-MERGE pattern (source endpoint → target endpoint → edge) means missing endpoints are auto-created — but it's good practice to declare them explicitly so deletion-cascade behavior stays predictable.

---

# Google Drive connector

Source: https://cocoindex.io/docs/connectors/google_drive/

The `google_drive` connector provides utilities for reading files from Google Drive using a service account.

```python
from cocoindex.connectors import google_drive
```

**Note — Dependencies**
This connector requires additional dependencies. Install with:

```bash
pip install cocoindex[google_drive]
```

## As source

The connector provides two ways to read from Google Drive:

- `GoogleDriveSource` — high-level source class with async iteration
- `list_files()` — lower-level function returning a sync iterator

Both require a Google service account with access to the target Drive folders.

### Setting up a service account

1. Create a service account in the [Google Cloud Console](https://console.cloud.google.com/iam-admin/serviceaccounts)
2. Download the JSON credential file
3. Share the target Drive folders with the service account's email address

**Note — Google Workspace CLI**
[`gws`](https://github.com/googleworkspace/cli) is an optional, unofficial Google Workspace CLI. It is actively developed and subject to change, but can be useful for exploring or validating Drive API access before configuring CocoIndex's service-account flow. For example:

```bash
gws auth setup
gws auth login
gws drive files list
```

In headless or agent workflows, `gws` can also read credentials from `GOOGLE_WORKSPACE_CLI_CREDENTIALS_FILE`. CocoIndex still expects the service account JSON path in `service_account_credential_path`; use the `gws` credentials setting for `gws` commands themselves.

### GoogleDriveSource

The primary source class for iterating over Google Drive files.

```python
class GoogleDriveSource(
    *,
    service_account_credential_path: str,
    root_folder_ids: Sequence[str],
    mime_types: Sequence[str] | None = None,
)
```

**Parameters:**

- `service_account_credential_path` — Path to the service account JSON credential file.
- `root_folder_ids` — List of Google Drive folder IDs to scan. Subfolders are traversed recursively.
- `mime_types` — Optional list of MIME types to include. If `None`, all file types are included.

### Iterating files

`GoogleDriveSource` provides async iteration via `files()`, yielding `DriveFile` objects (implementing the [`FileLike`](../common_resources/data_types#filelike) base class):

```python
source = google_drive.GoogleDriveSource(
    service_account_credential_path="./credentials.json",
    root_folder_ids=["1abc...xyz"],
)

async for file in source.files():
    text = await file.read_text()
    ...
```

### Keyed iteration with `items()`

`items()` yields `(str, DriveFile)` pairs, where the key is the file's name path. This is useful with `mount_each()`:

```python
async for key, file in source.items():
    content = await file.read()
```

### Filtering by MIME type

Use `mime_types` to restrict which files are returned:

```python
source = google_drive.GoogleDriveSource(
    service_account_credential_path="./credentials.json",
    root_folder_ids=["1abc...xyz"],
    mime_types=["application/pdf", "text/plain"],
)
```

Google Workspace files (Docs, Sheets, Slides) are automatically exported:

| Google Workspace type | Exported as |
|---|---|
| Google Docs | Plain text |
| Google Sheets | CSV |
| Google Slides | Plain text |

### list_files

A lower-level sync iterator for listing files:

```python
def list_files(spec: GoogleDriveSourceSpec) -> Iterator[DriveFile]
```

**Parameters:**

- `spec` — A `GoogleDriveSourceSpec` with the same fields as `GoogleDriveSource` constructor parameters.

**Returns:** A sync iterator of `DriveFile` objects.

### DriveFile

`DriveFile` implements [`FileLike`](../common_resources/data_types#filelike) with Google Drive-specific behavior:

- `file_path` — A `DriveFilePath` where `resolve()` returns the Google Drive file ID.
- `read()` / `read_text()` — Downloads file content via the Google Drive API. Partial reads (`size` parameter) are not supported.

### Example

```python
import cocoindex as coco
from cocoindex.connectors import google_drive
from cocoindex.resources.file import FileLike

@coco.fn(memo=True)
async def process_file(file: FileLike) -> None:
    text = await file.read_text()
    # ... process the file content ...

@coco.fn
async def app_main(credential_path: str, folder_ids: list[str]) -> None:
    source = google_drive.GoogleDriveSource(
        service_account_credential_path=credential_path,
        root_folder_ids=folder_ids,
    )

    with coco.component_subpath("file"):
        async for key, file in source.items():
            await coco.mount(
                coco.component_subpath(key),
                process_file,
                file,
            )

app = coco.App(
    "GoogleDriveIngestion",
    app_main,
    credential_path="./credentials.json",
    folder_ids=["1abc...xyz"],
)
```

---

# Apache Iggy connector

Source: https://cocoindex.io/docs/connectors/iggy/

The `iggy` connector supports [Apache Iggy](https://iggy.apache.org/docs/) as
both a **source** and a **target**. Iggy organizes messages as
**streams → topics → partitions**; CocoIndex follows that model and treats an
Iggy topic partition either as a raw live stream or as an application-keyed
live map.

```python
from cocoindex.connectors import iggy
```

Install the optional dependency:

```bash
pip install cocoindex[iggy]
```

The connector expects an `apache_iggy.IggyClient` that you create, connect, and
provide through your app context. Streams and topics are user-managed:
CocoIndex does not create or drop them.

## As Source

### As a live stream

Use `topic_as_stream()` when every Iggy message is an event and downstream code
does not need map-style deletion semantics.

```python
def topic_as_stream(
    client: IggyClient,
    consumer_group: str,
    stream: str,
    topic: str,
    *,
    partition_id: int = 0,
    batch_length: int = 100,
    allow_replay: bool = False,
    initial_high_watermark: int | None = None,
) -> TopicStream
```

`TopicStream.payloads()` adapts the stream to `LiveStream[bytes]` when the
processing logic only needs message payload bytes.

```python
events = iggy.topic_as_stream(
    client,
    consumer_group="cocoindex-worker",
    stream="orders",
    topic="events",
).payloads()
```

### As a live keyed map

Use `topic_as_map()` when message payloads encode an application-level key and
you want CocoIndex to treat the topic as a live map.

```python
def topic_as_map(
    client: IggyClient,
    consumer_group: str,
    stream: str,
    topic: str,
    *,
    key: KeyFn,
    is_deletion: IsDeleteFn | None = None,
    partition_id: int = 0,
    batch_length: int = 100,
    allow_replay: bool = False,
    initial_high_watermark: int | None = None,
) -> LiveMapFeed[StableKey, ReceiveMessage]
```

Iggy Python messages do not expose Kafka-style keys or tombstones, so `key` is
required. Return `None` from `key` to skip a message. Pass `is_deletion` if your
payload format has application-level delete events.

```python
import json


def key(message) -> str | None:
    payload = json.loads(message.payload())
    return payload.get("id")


items = iggy.topic_as_map(
    client,
    consumer_group="cocoindex-worker",
    stream="orders",
    topic="events",
    key=key,
)
```

### Readiness and offsets

The source connector disables Iggy auto-commit and stores offsets after the
downstream readiness handle completes. This mirrors Kafka-style back-pressure:
an offset is stored only after CocoIndex has finished processing the message.

For single-partition topics, the connector can infer the initial high watermark
from Iggy's topic details. For multi-partition topics, pass
`initial_high_watermark` for the consumed partition; the current Python SDK does
not expose per-partition high-watermark callbacks.

## As Target

The target connector sends bytes or strings to a user-managed Iggy
stream/topic/partition.

```python
IGGY = coco.ContextKey[IggyClient]("iggy")

target = await iggy.mount_iggy_topic_target(
    IGGY,
    stream="orders",
    topic="derived-events",
    partition=0,
)

target.declare_target_state(key="order-123", value=b'{"status":"ready"}')
```

Deletes need an application-level delete payload because Iggy does not have
Kafka-style tombstones:

```python
target = await iggy.mount_iggy_topic_target(
    IGGY,
    stream="orders",
    topic="derived-events",
    deletion_value_fn=lambda key: f'{{"id":{key!r},"deleted":true}}',
)
```

### Target APIs

```python
def declare_iggy_topic_target(
    client: ContextKey[IggyClient],
    stream: str,
    topic: str,
    *,
    partition: int = 0,
    deletion_value_fn: DeletionValueFn | None = None,
) -> IggyTopicTarget[PendingS]
```

```python
async def mount_iggy_topic_target(
    client: ContextKey[IggyClient],
    stream: str,
    topic: str,
    *,
    partition: int = 0,
    deletion_value_fn: DeletionValueFn | None = None,
) -> IggyTopicTarget[ResolvedS]
```

---

# Kafka connector

Source: https://cocoindex.io/docs/connectors/kafka/

The `kafka` connector supports Kafka as both a **source** (consuming messages as a live keyed map, or as a raw event stream) and a **target** (producing messages for declared target states).

```python
from cocoindex.connectors import kafka
```

**Note — Dependencies**
This connector requires additional dependencies. Install with:

```bash
pip install cocoindex[kafka]
```

## As source

The `kafka` connector can treat a Kafka topic as a live keyed map — each message is an upsert or delete for a key. It returns a [`LiveMapFeed`](../advanced_topics/live_component#livemapfeed-and-livemapview) for use with `mount_each()`.

### Setting up a consumer

Create an `AIOConsumer` directly — no `ContextKey` needed. The consumer must be **unsubscribed** (the connector handles subscription internally to manage partition rebalance callbacks).

```python
from confluent_kafka.aio import AIOConsumer

consumer = AIOConsumer({
    "bootstrap.servers": "localhost:9092",
    "group.id": "my-group",
    "enable.auto.commit": "false",
})
```

### `topic_as_map()`

```python
def topic_as_map(
    consumer: AIOConsumer,
    topics: list[str],
    *,
    is_deletion: IsDeleteFn | None = None,
) -> LiveMapFeed[bytes | str, Message]:
```

**Parameters:**

- `consumer` — An unsubscribed `AIOConsumer`. Auto-commit should be disabled.
- `topics` — Topics to subscribe to.
- `is_deletion` — Optional predicate `(message: Message) -> bool` for custom deletion detection on non-tombstone messages (see [Deletion handling](#deletion-handling)).

**Returns:** A `LiveMapFeed[bytes | str, Message]` where each item is keyed by the message key and the value is the full `confluent_kafka.Message` object.

### Deletion handling
Messages with `None` value (Kafka tombstones) are **always** treated as deletions. The optional `is_deletion` predicate provides additional deletion logic for non-tombstone messages:

```python
# Default: only tombstones are deletions
items = kafka.topic_as_map(consumer, ["my-topic"])

# Custom: also treat messages with a specific header as deletions
items = kafka.topic_as_map(
    consumer, ["my-topic"],
    is_deletion=lambda msg: msg.value() == b"DELETED",
)
```

### Offset management

Offsets are committed automatically with at-least-once semantics. Messages are processed in parallel, but an offset is only committed after all earlier messages in the same partition have been fully processed. Messages with `None` keys are logged as errors and skipped.

### Readiness

The feed signals readiness after catching up to the high watermark offsets that existed when consumption started. After that, it continues consuming indefinitely until the app is stopped.

### Example

```python
from collections.abc import AsyncIterator

from confluent_kafka import Message
from confluent_kafka.aio import AIOConsumer
from cocoindex.connectors import kafka, localfs
import cocoindex as coco


@coco.fn(memo=True)
async def process_message(msg: Message, target: localfs.DirTarget) -> None:
    key = msg.key()
    value = msg.value()
    if isinstance(key, bytes):
        key = key.decode()
    target.declare_file(filename=f"{key}.bin", content=value)


@coco.fn
async def app_main(outdir: pathlib.Path) -> None:
    target = await localfs.mount_dir_target(outdir)

    consumer = AIOConsumer({
        "bootstrap.servers": "localhost:9092",
        "group.id": "my-group",
        "enable.auto.commit": "false",
    })
    items = kafka.topic_as_map(consumer, ["my-topic"])
    await coco.mount_each(process_message, items, target)


app = coco.App(
    coco.AppConfig(name="KafkaToFiles"),
    app_main,
    outdir=pathlib.Path("./out"),
)
app.update_blocking(live=True)
```

## As a live stream

Some downstream connectors don't model Kafka messages as a keyed map — they treat each message as an opaque event payload (e.g. the [OCI Object Storage](./oci_object_storage#live-bucket-watching) live mode ingests Object Storage event JSON delivered through OCI Streaming). For these, the `kafka` connector exposes a topic as a [`LiveStream`](../advanced_topics/live_component#livestream) — CocoIndex's keyless counterpart to `LiveMapFeed`, an opaque sequence of messages with the same readiness and back-pressure machinery but no key/delete semantics.

Set up the consumer the same way as for [`topic_as_map()`](#setting-up-a-consumer).

### `topic_as_stream()`

```python
def topic_as_stream(
    consumer: AIOConsumer,
    topics: list[str],
) -> TopicStream:
```

**Parameters:**

- `consumer` — An unsubscribed `AIOConsumer`. Auto-commit should be disabled.
- `topics` — Topics to subscribe to.

**Returns:** A `TopicStream` (a `LiveStream[Message]`) that delivers raw `confluent_kafka.Message` objects to its subscriber. [Offset management](#offset-management) and [readiness](#readiness) behave identically to `topic_as_map()`.

### `TopicStream.payloads()`

```python
def payloads(self) -> LiveStream[bytes]:
```

A view over the same `TopicStream` that yields each message's value (the byte payload) instead of the full `Message`. Null-valued messages (Kafka tombstones) are filtered out of this view; consumers that need tombstone semantics should subscribe to the `TopicStream` directly.

This is the typical input for sources that consume opaque event payloads:

```python
from confluent_kafka.aio import AIOConsumer
from cocoindex.connectors import kafka, oci_object_storage

consumer = AIOConsumer({
    "bootstrap.servers": "localhost:9092",
    "group.id": "my-group",
    "enable.auto.commit": "false",
})
topic_stream = kafka.topic_as_stream(consumer, ["object-storage-events"])

walker = oci_object_storage.list_objects(
    client, "my-namespace", "my-bucket",
    live_stream=topic_stream.payloads(),
)
```

For a complete app using this pattern, see the [OCI Object Storage live-mode example](./oci_object_storage#example).

**Caution**
A `TopicStream` (and any `payloads()` view of it) supports at most one active watcher — the underlying consumer can hold only one subscription. A second concurrent `watch()` raises `RuntimeError`.

## As target

The `kafka` connector provides target state APIs for producing messages to Kafka topics. Topics are user-managed (CocoIndex does not create or drop topics) — CocoIndex only produces messages to them.

### Setting up a producer

Create a `ContextKey[AIOProducer]` to identify your producer, then provide it in your lifespan:

**Note**
The key name is load-bearing across runs — it's the stable identity CocoIndex uses to track target state for topics produced through this key. See [ContextKey as stable identity](../programming_guide/context#contextkey-as-stable-identity) before renaming.

```python
from confluent_kafka import AIOProducer
import cocoindex as coco

KAFKA_PRODUCER = coco.ContextKey[AIOProducer]("my_kafka_producer")

@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    producer = AIOProducer({"bootstrap.servers": "localhost:9092"})
    builder.provide(KAFKA_PRODUCER, producer)
    yield
```

### Declaring target states

#### Topics (parent state)

Declares a topic as a target state. Returns a `KafkaTopicTarget` for declaring messages.

```python
def declare_kafka_topic_target(
    producer: ContextKey[AIOProducer],
    topic: str,
    *,
    deletion_value_fn: DeletionValueFn | None = None,
) -> KafkaTopicTarget[coco.PendingS]
```

**Parameters:**

- `producer` — A `ContextKey[AIOProducer]` identifying the producer to use.
- `topic` — The Kafka topic name.
- `deletion_value_fn` — Optional callback that produces a deletion value for a given key (see [Deletion handling](#deletion-handling)).

**Returns:** A pending `KafkaTopicTarget`. Use the async convenience wrapper to resolve:

```python
topic_target = await kafka.mount_kafka_topic_target(
    KAFKA_PRODUCER, "my-topic"
)
```

#### Messages (child states)

Once a `KafkaTopicTarget` is resolved, declare target states to produce messages:

```python
def KafkaTopicTarget.declare_target_state(
    self,
    *,
    key: bytes | str,
    value: bytes | str,
) -> None
```

**Parameters:**

- `key` — The message key, used as the stable identity for change detection.
- `value` — The message value.

CocoIndex fingerprints the value and only produces a message when it has changed since the last run.

### Deletion handling

When a previously declared target state is no longer declared, CocoIndex produces a deletion message. The behavior depends on `deletion_value_fn`:

- **Without callback** (default): Produces a message with the key and no value (Kafka tombstone).
- **With callback**: Calls `deletion_value_fn(key)` to produce the deletion value.

```python
# Tombstone on deletion (default)
topic_target = await kafka.mount_kafka_topic_target(
    KAFKA_PRODUCER, "my-topic"
)

# Custom deletion value
topic_target = await kafka.mount_kafka_topic_target(
    KAFKA_PRODUCER, "my-topic",
    deletion_value_fn=lambda key: b'{"deleted": true}',
)
```

### Example

```python
from collections.abc import AsyncIterator

from confluent_kafka import AIOProducer
from cocoindex.connectors import kafka, localfs
import cocoindex as coco

KAFKA_PRODUCER = coco.ContextKey[AIOProducer]("my_kafka_producer")


@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    producer = AIOProducer({"bootstrap.servers": "localhost:9092"})
    builder.provide(KAFKA_PRODUCER, producer)
    yield


@coco.fn(memo=True)
async def process_file(
    file: localfs.File, topic_target: kafka.KafkaTopicTarget
) -> None:
    content = await file.read_bytes()
    topic_target.declare_target_state(
        key=file.file_path.path.as_posix().encode(),
        value=content,
    )


@coco.fn
async def app_main() -> None:
    topic_target = await kafka.mount_kafka_topic_target(
        KAFKA_PRODUCER, "file-contents"
    )

    files = localfs.walk_dir(localfs.FilePath(path="./data"))
    await coco.mount_each(process_file, files.items(), topic_target)


app = coco.App(
    coco.AppConfig(name="FilesToKafka"),
    app_main,
)
app.update_blocking(report_to_stdout=True)
```

---

# LanceDB connector

Source: https://cocoindex.io/docs/connectors/lancedb/

The `lancedb` connector provides utilities for writing rows to LanceDB tables, with automatic schema inference from Python classes and support for declaring vector and full-text search (FTS) indexes. CocoIndex manages the table lifecycle — creating, dropping, and evolving the schema — and keeps rows in sync via incremental upserts and deletions.

```python
from cocoindex.connectors import lancedb
```

**Note — Dependencies**
This connector requires additional dependencies. Install with:

```bash
pip install cocoindex[lancedb]
```

## Connection setup

LanceDB connections are created directly via the LanceDB library. CocoIndex exposes thin wrappers:

```python
async def connect_async(uri: str, **options: Any) -> LanceAsyncConnection
def connect(uri: str, **options: Any) -> lancedb.DBConnection
```

**Parameters:**

- `uri` — LanceDB URI (local path like `"./lancedb_data"` or cloud URI like `"s3://bucket/path"`).
- `**options` — Additional options passed directly to `lancedb.connect_async()` / `lancedb.connect()`.

**Returns:** A LanceDB connection.

**Example:**

```python
conn = await lancedb.connect_async("./lancedb_data")
```

## As target

The `lancedb` connector provides target state APIs for writing rows to tables. CocoIndex tracks what rows should exist and automatically handles upserts and deletions.

### Declaring target states

#### Setting up a connection

Create a `ContextKey[lancedb.LanceAsyncConnection]` to identify your LanceDB connection, then provide it in your lifespan:

**Note**
The key name is load-bearing across runs — it's the stable identity CocoIndex uses to track managed tables. See [ContextKey as stable identity](../programming_guide/context#contextkey-as-stable-identity) before renaming.

```python
import cocoindex as coco

LANCE_DB = coco.ContextKey[lancedb.LanceAsyncConnection]("main_db")

@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    conn = await lancedb.connect_async(LANCEDB_URI)
    builder.provide(LANCE_DB, conn)
    yield
```

#### Tables (parent state)

Declares a table as a target state. Returns a `TableTarget` for declaring rows.

```python
def declare_table_target(
    db: ContextKey[LanceAsyncConnection],
    table_name: str,
    table_schema: TableSchema[RowT],
    *,
    managed_by: Literal["system", "user"] = "system",
    num_transactions_before_optimize: int = 50,
) -> TableTarget[RowT, coco.PendingS]
```

**Parameters:**

- `db` — A `ContextKey[LanceAsyncConnection]` identifying the connection to use.
- `table_name` — Name of the table.
- `table_schema` — Schema definition including columns and primary key (see [Table Schema](#table-schema-from-python-class)).
- `managed_by` — Whether CocoIndex manages the table lifecycle (`"system"`) or assumes it exists (`"user"`).
- `num_transactions_before_optimize` — Number of successful row mutation batches before scheduling a background LanceDB `table.optimize()` call.

**Returns:** A pending `TableTarget`. Use the convenience wrapper `await lancedb.mount_table_target(LANCE_DB, table_name, table_schema)` to resolve.

#### Rows (child states)

Once a `TableTarget` is resolved, declare rows to be upserted:

```python
def TableTarget.declare_row(
    self,
    *,
    row: RowT,
) -> None
```

**Parameters:**

- `row` — A row object (dict, dataclass, NamedTuple, or Pydantic model). Must include all primary key columns.

#### Vector indexes (attachment)

Declare a vector index on a vector column to accelerate similarity search. Vector indexes are an [attachment](../advanced_topics/custom_target_connector#implementing-attachment-providers) to a `TableTarget`:

```python
def TableTarget.declare_vector_index(
    self,
    *,
    name: str | None = None,
    column: str,
    metric: Literal["cosine", "l2", "dot"] = "cosine",
    index_type: Literal["ivf_pq", "hnsw_pq"] = "ivf_pq",
    num_partitions: int | None = None,
    num_sub_vectors: int | None = None,
    num_bits: int | None = None,
    m: int | None = None,
    ef_construction: int | None = None,
) -> None
```

**Parameters:**

- `name` — Logical index name (defaults to `column`).
- `column` — Vector column to index.
- `metric` — Distance metric: `"cosine"` (default), `"l2"`, or `"dot"`.
- `index_type` — Index algorithm: `"ivf_pq"` (IVF-PQ, default) or `"hnsw_pq"` (HNSW-PQ).
- `num_partitions` — *(IVF-PQ only)* Number of IVF partitions.
- `num_sub_vectors` — *(IVF-PQ / HNSW-PQ)* Number of PQ sub-vectors.
- `num_bits` — *(IVF-PQ / HNSW-PQ)* Number of bits per PQ code.
- `m` — *(HNSW-PQ only)* Maximum number of HNSW edges per node.
- `ef_construction` — *(HNSW-PQ only)* Size of the HNSW candidate list during build.

Parameters left as `None` fall back to LanceDB's defaults.

**Example:**

```python
table.declare_vector_index(column="embedding", metric="cosine")
```

#### FTS indexes (attachment)

Declare a full-text search (FTS) index on a text column to enable keyword and phrase search. Like vector indexes, FTS indexes are an [attachment](../advanced_topics/custom_target_connector#implementing-attachment-providers) to a `TableTarget`:

```python
def TableTarget.declare_fts_index(
    self,
    *,
    name: str | None = None,
    column: str,
    language: str = "English",
    with_position: bool = True,
) -> None
```

**Parameters:**

- `name` — Logical index name (defaults to `column`).
- `column` — Text column to index.
- `language` — Tokenizer language (e.g. `"English"`, `"Chinese"`).
- `with_position` — Whether to store token positions (enables phrase queries). Defaults to `True`.

**Example:**

```python
table.declare_fts_index(column="content")
```

**Note**
Indexes are reconciled as part of the table's target state: changing a declaration replaces the index in place, removing a declaration drops the index, and dropping the table removes all its indexes.

### Table schema: from Python class

Define the table structure using a Python class (dataclass, NamedTuple, or Pydantic model):

```python
@classmethod
async def TableSchema.from_class(
    cls,
    record_type: type[RowT],
    primary_key: list[str],
    *,
    column_specs: dict[str, LanceType | VectorSchemaProvider] | None = None,
) -> TableSchema[RowT]
```

**Parameters:**

- `record_type` — A record type whose fields define table columns.
- `primary_key` — List of column names forming the primary key.
- `column_specs` — Optional per-column overrides for type mapping or vector configuration.

**Example:**

```python
@dataclass
class OutputDocument:
    doc_id: str
    title: str
    content: str
    embedding: Annotated[NDArray, embedder]

schema = await lancedb.TableSchema.from_class(
    OutputDocument,
    primary_key=["doc_id"],
)
```

Python types are automatically mapped to PyArrow types:

| Python Type | PyArrow Type |
|-------------|--------------|
| `bool` | `bool` |
| `int` | `int64` |
| `float` | `float64` |
| `str` | `string` |
| `bytes` | `binary` |
| `list`, `dict`, nested structs | `string` (JSON encoded) |
| `NDArray` (with vector schema) | `fixed_size_list<float>` |

To override the default mapping, provide a `LanceType` or `VectorSchemaProvider` via:

- **Type annotation** — using `typing.Annotated` on the field
- **`column_specs`** — passing overrides when constructing `TableSchema`

#### LanceType

Use `LanceType` to specify a custom PyArrow type or encoder:

```python
from typing import Annotated
from cocoindex.connectors.lancedb import LanceType
import pyarrow as pa

@dataclass
class MyRow:
    id: Annotated[int, LanceType(pa.int32())]
    value: Annotated[float, LanceType(pa.float32())]
```

#### VectorSchemaProvider

For `NDArray` fields, a `VectorSchemaProvider` annotation specifies the vector dimension and dtype. The annotation accepts a `VectorSchemaProvider`, a `ContextKey`, or an explicit `VectorSchema`. See [Vector Schema](../common_resources/vector_schema#vectorschemaprovider) for details.

### Table schema: explicit column definitions

Define columns directly using `ColumnDef`:

```python
def TableSchema.__init__(
    self,
    columns: dict[str, ColumnDef],
    primary_key: list[str],
) -> None
```

**Example:**

```python
schema = lancedb.TableSchema(
    {
        "doc_id": lancedb.ColumnDef(type=pa.string(), nullable=False),
        "title": lancedb.ColumnDef(type=pa.string()),
        "content": lancedb.ColumnDef(type=pa.string()),
        "embedding": lancedb.ColumnDef(type=pa.list_(pa.float32(), list_size=384)),
    },
    primary_key=["doc_id"],
)
```

### Example

```python
import cocoindex as coco
from cocoindex.connectors import lancedb

LANCEDB_URI = "./lancedb_data"

LANCE_DB = coco.ContextKey[lancedb.LanceAsyncConnection]("main_db")

@dataclass
class OutputDocument:
    doc_id: str
    title: str
    content: str
    embedding: Annotated[NDArray, embedder]

@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    conn = await lancedb.connect_async(LANCEDB_URI)
    builder.provide(LANCE_DB, conn)
    yield

@coco.fn
async def app_main() -> None:
    # Declare table target state
    table = await lancedb.mount_table_target(
        LANCE_DB,
        "documents",
        await lancedb.TableSchema.from_class(
            OutputDocument,
            primary_key=["doc_id"],
        ),
    )

    # Declare a vector index for similarity search
    table.declare_vector_index(column="embedding", metric="cosine")

    # Declare rows
    for doc in documents:
        table.declare_row(row=doc)
```

---

# Local filesystem connector

Source: https://cocoindex.io/docs/connectors/localfs/

The `localfs` connector provides utilities for reading files from and writing files to the local file system.

```python
from cocoindex.connectors import localfs
```

## Stable memoization with FilePath

A key feature of the `localfs` connector is [**stable memoization**](../programming_guide/function) through `FilePath`. When you move your entire project directory, memoization keys remain stable as long as you use the same `ContextKey` key string for the base directory.

### Using ContextKey for stable base directories

Define a `ContextKey[pathlib.Path]` with a stable string identifier. Provide the actual path in your app's lifespan, then pass the key directly to `walk_dir()`, `declare_dir_target()`, and related functions.

```python
import pathlib
import cocoindex as coco
from cocoindex.connectors import localfs

# Define a stable key (the string "source_dir" is the stable memo identifier)
SOURCE_DIR = coco.ContextKey[pathlib.Path]("source_dir")

@coco.fn
async def app_main() -> None:
    async for file in localfs.walk_dir(SOURCE_DIR, recursive=True):
        # file.file_path has a stable memo key based on "source_dir"
        await process(file)

# In your lifespan, provide the actual path:
async def lifespan(builder: coco.EnvBuilder) -> None:
    builder.provide(SOURCE_DIR, pathlib.Path("./data"))
```

When you move your project to a different location, just update the path in `builder.provide()` — memoization keys remain the same because they're based on the stable key string (`"source_dir"`), not the filesystem path.

### FilePath

`FilePath` combines an optional base directory key and a relative path. Passing a `ContextKey[Path]` directly to any localfs function is equivalent to constructing `FilePath(base_dir=key)`.

`FilePath` supports all `pathlib.PurePath` operations:

```python
SOURCE_DIR = coco.ContextKey[pathlib.Path]("source_dir")
source = localfs.FilePath(base_dir=SOURCE_DIR)

# Create paths using the / operator
config_path = source / "config" / "settings.json"

# Access path properties
print(config_path.name)      # "settings.json"
print(config_path.suffix)    # ".json"
print(config_path.parent)    # FilePath pointing to "config/"

# Resolve to absolute path (requires active component context)
abs_path = config_path.resolve()  # pathlib.Path
```

See [FilePath](../common_resources/data_types#filepath) in Resource Types for full details.

## As source

Use `walk_dir()` to iterate over files in a directory. It returns a `DirWalker` that supports both synchronous and asynchronous iteration.

```python
def walk_dir(
    path: FilePath | Path | ContextKey[Path],
    *,
    live: bool = False,
    recursive: bool = False,
    path_matcher: FilePathMatcher | None = None,
) -> DirWalker
```

**Parameters:**

- `path` — The root directory path to walk through. Can be a `FilePath`, a `pathlib.Path`, or a `ContextKey[Path]` (equivalent to `FilePath(base_dir=path)`).
- `live` — If `True`, `items()` returns a [`LiveMapView`](../advanced_topics/live_component#livemapfeed-and-livemapview) that supports live file watching via `mount_each()`.
- `recursive` — If `True`, recursively walk subdirectories.
- `path_matcher` — Optional filter for files and directories. See [PatternFilePathMatcher](../common_resources/data_types#patternfilepathmatcher).

**Returns:** A `DirWalker` that supports async iteration via `async for`.

### Iterating files

`walk_dir()` returns a `DirWalker` that supports async iteration, yielding `File` objects (implementing the [`FileLike`](../common_resources/data_types#filelike) base class):

```python
async for file in localfs.walk_dir("/path/to/documents", recursive=True):
    text = await file.read_text()
    ...
```

File I/O runs in a thread pool, keeping the event loop responsive.

### Keyed iteration with `items()`

`DirWalker.items()` yields keyed `(str, File)` pairs, useful for associating each file with a stable string key (its relative path):

```python
async for key, file in localfs.walk_dir("/path/to/dir", recursive=True).items():
    content = await file.read()
```

### Filtering files

Use `PatternFilePathMatcher` to filter which files and directories are included:

```python
from cocoindex.connectors import localfs
from cocoindex.resources.file import PatternFilePathMatcher

# Include only .py and .md files, exclude hidden directories and test files
matcher = PatternFilePathMatcher(
    included_patterns=["**/*.py", "**/*.md"],
    excluded_patterns=["**/.*", "**/test_*", "**/__pycache__"],
)

async for file in localfs.walk_dir("/path/to/project", recursive=True, path_matcher=matcher):
    await process(file)
```

### Live file watching

When `live=True`, `items()` returns a [`LiveMapView`](../advanced_topics/live_component#livemapfeed-and-livemapview) instead of a plain `AsyncIterable`. Combined with [`mount_each()`](../programming_guide/processing_component#mount_each), this enables automatic incremental file watching — new, modified, and deleted files are processed without a full rescan:

```python
files = localfs.walk_dir(
    sourcedir, recursive=True,
    path_matcher=PatternFilePathMatcher(included_patterns=["**/*.md"]),
    live=True,
)
await coco.mount_each(process_file, files.items(), target)
```

See [Live Mode](../programming_guide/live_mode) for how this works and how to enable it on the app.

### Example

```python
import pathlib
import cocoindex as coco
from cocoindex.connectors import localfs
from cocoindex.resources.file import FileLike, PatternFilePathMatcher

SOURCE_DIR = coco.ContextKey[pathlib.Path]("source_dir")

@coco.fn
async def app_main() -> None:
    matcher = PatternFilePathMatcher(included_patterns=["**/*.md"])

    async for file in localfs.walk_dir(SOURCE_DIR, recursive=True, path_matcher=matcher):
        await coco.mount(
            coco.component_subpath("file", str(file.file_path.path)),
            process_file,
            file,
        )

@coco.fn(memo=True)
async def process_file(file: FileLike) -> None:
    text = await file.read_text()
    # ... process the file content ...
```

## As target

The `localfs` connector provides target state APIs for writing files. CocoIndex tracks what files should exist and automatically handles creation, updates, and deletion.

### declare_file

Declare a single file target. This is the simplest way to write a file.

```python
@coco.fn
def declare_file(
    path: FilePath | Path | ContextKey[Path],
    content: bytes | str,
    *,
    create_parent_dirs: bool = False,
) -> None
```

**Parameters:**

- `path` — The filesystem path for the file. Can be a `FilePath`, a `pathlib.Path`, or a `ContextKey[Path]`.
- `content` — The file content (bytes or str).
- `create_parent_dirs` — If `True`, create parent directories if they don't exist.

**Example:**

```python
OUTPUT_DIR = coco.ContextKey[pathlib.Path]("output_dir")

@coco.fn
def app_main() -> None:
    coco.mount(
        localfs.declare_file,
        localfs.FilePath("readme.txt", base_dir=OUTPUT_DIR),
        content="Hello, world!",
        create_parent_dirs=True,
    )
```

### declare_dir_target

Declare a directory target for writing multiple files. Returns a `DirTarget` for declaring files within.

```python
@coco.fn
def declare_dir_target(
    path: FilePath | Path | ContextKey[Path],
    *,
    create_parent_dirs: bool = True,
) -> DirTarget[coco.PendingS]
```

**Parameters:**

- `path` — The filesystem path for the directory. Can be a `FilePath`, a `pathlib.Path`, or a `ContextKey[Path]`.
- `create_parent_dirs` — If `True`, create parent directories if they don't exist. Defaults to `True`.

**Returns:** A pending `DirTarget`. Use `await coco.mount_target(...)` or the convenience wrapper `await localfs.mount_dir_target(path)` to resolve.

### DirTarget.declare_file

Declares a file to be written within the directory.

```python
def declare_file(
    self,
    filename: str | PurePath,
    content: bytes | str,
    *,
    create_parent_dirs: bool = False,
) -> None
```

**Parameters:**

- `filename` — The name of the file (can include subdirectory path).
- `content` — The file content (bytes or str).
- `create_parent_dirs` — If `True`, create parent directories within the target directory.

### DirTarget.declare_dir_target

Declares a subdirectory target within the directory.

```python
def declare_dir_target(
    self,
    path: str | PurePath,
    *,
    create_parent_dirs: bool = False,
) -> DirTarget[coco.PendingS]
```

**Parameters:**

- `path` — The path of the subdirectory (relative to this directory).
- `create_parent_dirs` — If `True`, create parent directories.

**Returns:** A `DirTarget` for the subdirectory.

### Target example

```python
import pathlib
import cocoindex as coco
from cocoindex.connectors import localfs
from cocoindex.resources.file import FileLike, PatternFilePathMatcher

SOURCE_DIR = coco.ContextKey[pathlib.Path]("source_dir")
OUTPUT_DIR = coco.ContextKey[pathlib.Path]("output_dir")

@coco.fn
async def app_main() -> None:
    # Declare output directory target using context key
    target = await localfs.mount_dir_target(OUTPUT_DIR)

    # Process files and write outputs
    await coco.mount_each(process_file, localfs.walk_dir(SOURCE_DIR, recursive=True).items(), target)

@coco.fn(memo=True)
async def process_file(file: FileLike, target: localfs.DirTarget) -> None:
    # Transform the file
    content = (await file.read_text()).upper()

    # Write to output with same relative path
    target.declare_file(
        filename=file.file_path.path,
        content=content,
        create_parent_dirs=True,
    )
```

---

# Neo4j connector

Source: https://cocoindex.io/docs/connectors/neo4j/

The `neo4j` connector writes records to [Neo4j](https://neo4j.com), a property graph database. It supports node tables (labels), relationship tables (edge types), per-database multitenancy (one Neo4j cluster, many isolated databases), real Cypher uniqueness constraints, and vector indexes via the `CREATE VECTOR INDEX` DDL form.

```python
from cocoindex.connectors import neo4j
```

**Note — Dependencies**
This connector requires additional dependencies. Install with:

```bash
pip install cocoindex[neo4j]
```

Targets Neo4j 5.18+. Vector-index DDL (`CREATE VECTOR INDEX … OPTIONS { indexConfig: { … } }`) shipped in 5.18 — older 5.x servers will reject the DDL the connector emits.

## Connection setup

Create a `ConnectionFactory` and provide it via a `ContextKey`. The factory holds the Bolt URI, optional auth, and the target database name; it lazily opens a Neo4j async driver and returns a graph handle on demand.

**Note**
The key name is load-bearing across runs — it's the stable identity CocoIndex uses to track managed rows. See [ContextKey as stable identity](../programming_guide/context#contextkey-as-stable-identity) before renaming.

```python
from collections.abc import AsyncIterator
from cocoindex.connectors import neo4j
import cocoindex as coco

KG_DB: coco.ContextKey[neo4j.ConnectionFactory] = coco.ContextKey("kg_db")

@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    builder.provide(
        KG_DB,
        neo4j.ConnectionFactory(
            uri="bolt://localhost:7687",
            auth=("neo4j", "cocoindex"),
            database="neo4j",
        ),
    )
    yield
```

`auth` is optional — omit it for unauthenticated dev instances. `database` defaults to `"neo4j"` (the default db that ships with every Neo4j 5 installation).

### Multitenancy

A single Neo4j cluster can host many isolated databases. Pair each database with its own `ContextKey` and `ConnectionFactory(database=...)`:

```python
KG_DB: coco.ContextKey[neo4j.ConnectionFactory] = coco.ContextKey("kg_db")
APIS_DB: coco.ContextKey[neo4j.ConnectionFactory] = coco.ContextKey("apis_db")

@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    uri = "bolt://localhost:7687"
    auth = ("neo4j", "cocoindex")
    builder.provide(KG_DB, neo4j.ConnectionFactory(uri=uri, auth=auth, database="kg"))
    builder.provide(APIS_DB, neo4j.ConnectionFactory(uri=uri, auth=auth, database="apis"))
    yield
```

Different `ContextKey`s with different database names produce fully separate target-state trees — changes to one never spill into the other.

## As target

The `neo4j` connector provides target state APIs for writing records to node tables and relation tables. CocoIndex tracks what records should exist and automatically handles upserts and deletions.

Each apply batch is wrapped in a single Neo4j transaction (`tx.commit()` on success, rollback on exception), so partial writes never leak into the database. Within a batch, writes are ordered as **node upserts → relation upserts → relation deletes → node deletes** so dependent edges always see their endpoints.

### Declaring target states

#### Node tables (parent state)

Declares a node label as a target state. Returns a `TableTarget` for declaring records.

```python
def declare_table_target(
    db: ContextKey,
    table_name: str,
    table_schema: TableSchema[RowT] | None = None,
    *,
    primary_key: str = "id",
    managed_by: Literal["system", "user"] = "system",
) -> TableTarget[RowT, coco.PendingS]
```

**Parameters:**

- `db` — A `ContextKey[neo4j.ConnectionFactory]` for the Neo4j connection.
- `table_name` — The Cypher node label (e.g. `"Document"`).
- `table_schema` — Optional schema definition (see [Table Schema](#table-schema-from-python-class)). The schema participates in CocoIndex's fingerprint (so two flows declaring the same label must agree); per-property type DDL is not emitted in v1.
- `primary_key` — Single property name used as the node's primary key. Defaults to `"id"`. Compound primary keys are not supported in v1.0.
- `managed_by` — Whether CocoIndex manages the table lifecycle (`"system"`) or assumes it exists (`"user"`).

**Returns:** A pending `TableTarget`. Use `await neo4j.mount_table_target(KG_DB, ...)` to get a resolved target.

#### Records (child states)

Once a `TableTarget` is resolved, declare records to be upserted (translated to `MERGE (n:Label {pk: $key_0}) SET n += $props`):

```python
def TableTarget.declare_record(
    self,
    *,
    row: RowT,
) -> None
```

**Parameters:**

- `row` — A row object (dict, dataclass, NamedTuple, or Pydantic model). Must include the `primary_key` field declared above.

`declare_row` is an alias for `declare_record`, for compatibility with Postgres and other RDBMS targets.

#### Relation tables (parent state)

Declares a relationship type as a target state. Returns a `RelationTarget` for declaring edges.

```python
def declare_relation_target(
    db: ContextKey,
    table_name: str,
    from_table: TableTarget,
    to_table: TableTarget,
    table_schema: TableSchema[RowT] | None = None,
    *,
    primary_key: str = "id",
    managed_by: Literal["system", "user"] = "system",
) -> RelationTarget[RowT, coco.PendingS]
```

**Parameters:**

- `db` — A `ContextKey[neo4j.ConnectionFactory]` for the Neo4j connection.
- `table_name` — The Cypher relationship type (e.g. `"MENTION"`).
- `from_table` — The `TableTarget` whose nodes are the *source* endpoints of edges in this relationship.
- `to_table` — The `TableTarget` whose nodes are the *target* endpoints of edges in this relationship.
- `table_schema` — Optional schema for the relationship's own properties. The relationship's `primary_key` field uniquely identifies each edge.
- `primary_key` — Single property name used as the edge's primary key. Defaults to `"id"`.
- `managed_by` — Whether CocoIndex manages the relationship lifecycle (`"system"`) or assumes it exists (`"user"`).

**Returns:** A pending `RelationTarget`. Use `await neo4j.mount_relation_target(KG_DB, ...)` to get a resolved target.

#### Relations (child states)

Once a `RelationTarget` is resolved, declare edges. Each declaration produces a triple-MERGE: source endpoint, target endpoint, then the relationship.

```python
def RelationTarget.declare_relation(
    self,
    *,
    from_id: Any,
    to_id: Any,
    record: RowT | None = None,
) -> None
```

**Parameters:**

- `from_id` — The source node's primary-key value. The connector MERGEs `(s:FromLabel {pk: $from_id})` so endpoints are auto-created if absent.
- `to_id` — The target node's primary-key value. Same MERGE behavior.
- `record` — Optional row object whose fields populate the relationship's properties. Must include the relationship's `primary_key` field if provided.

If `record` is omitted, the connector derives a deterministic edge id of the form `{from_label}_{from_id}_{to_label}_{to_id}`. Convenient when an edge has no properties of its own.

#### Vector indexes (attachment)

Declares a vector index on a column of a node table. Vector indexes are an [attachment](../advanced_topics/custom_target_connector#implementing-attachment-providers) to a `TableTarget`:

```python
def TableTarget.declare_vector_index(
    self,
    *,
    name: str | None = None,
    field: str,
    metric: Literal["cosine", "euclidean"] = "cosine",
    dimension: int,
) -> None
```

**Parameters:**

- `name` — Optional logical name for the index. Defaults to `f"vec_{table_name}__{field}"`.
- `field` — The node property holding the vector.
- `metric` — Similarity metric: `"cosine"` or `"euclidean"`. Translated to Neo4j's `vector.similarity_function` option.
- `dimension` — The vector's dimension. Required.

The connector emits:

```cypher
CREATE VECTOR INDEX `coco_vec_<Label>__<field>` IF NOT EXISTS
FOR (n:`Label`) ON n.`field`
OPTIONS { indexConfig: {
  `vector.dimensions`: <N>,
  `vector.similarity_function`: '<metric>'
} }
```

Vectors are float32 only.

### Table schema: from Python class

Build a `TableSchema` by introspecting a record type:

```python
@classmethod
async def TableSchema.from_class(
    cls,
    record_type: type[RowT],
    *,
    primary_key: str = "id",
    column_overrides: dict[str, Neo4jType | VectorSchemaProvider] | None = None,
) -> TableSchema[RowT]
```

**Parameters:**

- `record_type` — A dataclass, NamedTuple, or Pydantic model.
- `primary_key` — Field name to use as the table's primary key. Defaults to `"id"`.
- `column_overrides` — Optional dict mapping field names to `Neo4jType` or `VectorSchemaProvider` to override the default Python-to-Neo4j type mapping.

**Returns:** A `TableSchema[RowT]` populated from the class's fields.

#### Default Python → Neo4j type mapping

Most types pass through native Bolt encoding — no per-value transform applied:

| Python type | Neo4j type | Notes |
|---|---|---|
| `bool` | `BOOLEAN` | |
| `int`, NumPy integer scalars | `INTEGER` | |
| `float`, NumPy float scalars | `FLOAT` | |
| `decimal.Decimal` | `STRING` | Encoded via `str()` — Neo4j has no decimal type. |
| `str` | `STRING` | |
| `bytes` | `BYTES` | Native Bolt type — no encoder. |
| `uuid.UUID` | `STRING` | Encoded via `str()`. |
| `datetime.date` | `DATE` | Native Bolt type. |
| `datetime.datetime` | `ZONED_DATETIME` | Native Bolt type. |
| `datetime.time` | `LOCAL_TIME` | Native Bolt type. |
| `datetime.timedelta` | `DURATION` | Native Bolt type. |
| `numpy.ndarray` (with `VectorSchema` annotation) | `LIST<FLOAT>` | Encoded via `tolist()`; paired with vector-index DDL. |
| `dict`, list, nested record, `Any` | `MAP` / `LIST<ANY>` | Passed through native parameter binding. |

#### Neo4jType

Override the default mapping for a single column with `Neo4jType`:

```python
class Neo4jType(NamedTuple):
    neo4j_type: str
    encoder: ValueEncoder | None = None
```

Use with `typing.Annotated`:

```python
from typing import Annotated
from dataclasses import dataclass
from cocoindex.connectors.neo4j import Neo4jType

@dataclass
class Row:
    id: str
    score: Annotated[float, Neo4jType("STRING", encoder=str)]
```

The `neo4j_type` string is metadata-only — it participates in the schema fingerprint (so two flows declaring the same table must agree) but no per-property type DDL is emitted from it.

#### VectorSchemaProvider

For NumPy `ndarray` columns, attach a `VectorSchema` annotation to specify dtype + dimension. See [VectorSchema](../common_resources/vector_schema) for details.

### Table schema: explicit column definitions

Build a `TableSchema` directly from a dict of column definitions when the row type is dynamic:

```python
from cocoindex.connectors.neo4j import TableSchema, ColumnDef

schema = TableSchema(
    columns={
        "filename": ColumnDef(type="STRING"),
        "title": ColumnDef(type="STRING"),
        "summary": ColumnDef(type="STRING", nullable=True),
    },
    primary_key="filename",
)
```

`ColumnDef` fields:

- `type` — The Neo4j type string (metadata only; see table above).
- `nullable` — Whether the column may be `None`. Defaults to `True`.
- `encoder` — Optional `Callable[[Any], Any]` applied to non-`None` values before they're sent to Neo4j.

### DDL: indexes and constraints

For each managed table, the connector creates supporting Cypher artifacts on first run:

- For node tables: a uniqueness constraint on the primary key —
  ```cypher
  CREATE CONSTRAINT `coco_uniq_<Label>__<pk>` IF NOT EXISTS
  FOR (n:`<Label>`) REQUIRE n.`<pk>` IS UNIQUE
  ```
  Neo4j auto-creates a backing index for each constraint, so a separate `CREATE INDEX` is redundant on nodes.
- For relation tables:
  ```cypher
  CREATE INDEX `coco_idx_rel_<RelType>__<pk>` IF NOT EXISTS
  FOR ()-[r:`<RelType>`]-() ON (r.`<pk>`)
  ```

Indexes and constraints are dropped on `cocoindex drop` or when the table is no longer declared.

When `managed_by="user"` is set, the connector skips DDL entirely — you're responsible for creating and dropping the schema. Record-level upserts and deletes still work.

### Example: Node tables

```python
from collections.abc import AsyncIterator
from dataclasses import dataclass
import cocoindex as coco
from cocoindex.connectors import neo4j

KG_DB: coco.ContextKey[neo4j.ConnectionFactory] = coco.ContextKey("kg_db")


@dataclass
class Document:
    filename: str
    title: str
    summary: str


@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    builder.provide(KG_DB, neo4j.ConnectionFactory(
        uri="bolt://localhost:7687",
        auth=("neo4j", "cocoindex"),
        database="neo4j",
    ))
    yield


@coco.fn
async def app_main() -> None:
    schema = await neo4j.TableSchema.from_class(Document, primary_key="filename")
    documents = await neo4j.mount_table_target(
        KG_DB, "Document", schema, primary_key="filename",
    )
    documents.declare_record(
        row=Document(
            filename="overview.md",
            title="Overview",
            summary="An overview of CocoIndex...",
        )
    )


app = coco.App(coco.AppConfig(name="docs_to_neo4j"), app_main)
```

### Example: Relation tables (knowledge graph)

```python
@dataclass
class Entity:
    value: str


@dataclass
class RelationshipRow:
    id: str
    predicate: str


@coco.fn
async def kg_app_main() -> None:
    documents = await neo4j.mount_table_target(
        KG_DB, "Document",
        await neo4j.TableSchema.from_class(Document, primary_key="filename"),
        primary_key="filename",
    )
    entities = await neo4j.mount_table_target(
        KG_DB, "Entity",
        await neo4j.TableSchema.from_class(Entity, primary_key="value"),
        primary_key="value",
    )
    relationships = await neo4j.mount_relation_target(
        KG_DB, "RELATIONSHIP",
        entities, entities,
        await neo4j.TableSchema.from_class(RelationshipRow, primary_key="id"),
        primary_key="id",
    )

    # populate ...
    documents.declare_record(row=Document(filename="overview.md", title="Overview", summary="..."))
    entities.declare_record(row=Entity(value="CocoIndex"))
    entities.declare_record(row=Entity(value="Neo4j"))
    relationships.declare_relation(
        from_id="CocoIndex",
        to_id="Neo4j",
        record=RelationshipRow(id="rel-1", predicate="writes_to"),
    )


kg_app = coco.App(coco.AppConfig(name="kg_app"), kg_app_main)
```

The `Entity` table is declared up-front (via `mount_table_target`) so its uniqueness constraint is reconciled before any `RELATIONSHIP` edge MERGEs entity endpoints. The relationship's three-MERGE pattern (source endpoint → target endpoint → edge) means missing endpoints are auto-created — but it's good practice to declare them explicitly so deletion-cascade behavior stays predictable.

---

# OCI Object Storage connector

Source: https://cocoindex.io/docs/connectors/oci_object_storage/

The `oci_object_storage` connector provides utilities for reading objects from Oracle Cloud Infrastructure (OCI) Object Storage buckets, with an optional [live mode](../programming_guide/live_mode) driven by OCI Object Storage events delivered through OCI Streaming.

```python
from cocoindex.connectors import oci_object_storage
```

**Note — Installation**
This connector requires the `oci` SDK. Install with:

```bash
pip install cocoindex[oci_object_storage]
```

For live mode, you also need the Kafka connector to consume OCI Streaming:

```bash
pip install cocoindex[kafka]
```

## As source

The connector provides three ways to read from OCI Object Storage:

- `list_objects()` — List and iterate over objects in a bucket (with optional prefix and filtering)
- `get_object()` — Fetch a single object by its name
- `read()` — Read object content directly without first fetching metadata

All require an `oci.object_storage.ObjectStorageClient`, which you create and manage yourself. The OCI SDK is synchronous; this connector wraps SDK calls with `asyncio.to_thread`, so the public API remains async.

```python
import oci
from oci.object_storage import ObjectStorageClient

config = oci.config.from_file()  # or instance principals, etc.
client = ObjectStorageClient(config)
```

### list_objects

List objects in an OCI Object Storage bucket. Returns an `OCIWalker` that supports async iteration.

```python
def list_objects(
    client: ObjectStorageClient,
    namespace: str,
    bucket_name: str,
    *,
    prefix: str = "",
    path_matcher: FilePathMatcher | None = None,
    max_file_size: int | None = None,
    live_stream: LiveStream[bytes] | None = None,
) -> OCIWalker
```

**Parameters:**

- `client` — An `oci.object_storage.ObjectStorageClient`.
- `namespace` — The OCI Object Storage namespace.
- `bucket_name` — The bucket name.
- `prefix` — Only list objects whose name starts with this prefix. The prefix is stripped from relative paths in the returned files.
- `path_matcher` — Optional filter for files. Patterns are matched against the relative path (after prefix stripping). See [PatternFilePathMatcher](../common_resources/data_types#patternfilepathmatcher).
- `max_file_size` — Skip objects larger than this size in bytes.
- `live_stream` — Optional `LiveStream[bytes]` of OCI Object Storage event payloads. When provided, `OCIWalker.items()` returns a [`LiveMapView`](../advanced_topics/live_component#livemapfeed-and-livemapview) that performs an initial scan and continues watching for changes via the supplied stream. See [Live mode](#live-bucket-watching).

**Returns:** An `OCIWalker` that can be used with `async for` loops.

### Iterating files

`list_objects()` returns an `OCIWalker` that yields `OCIFile` objects (implementing the [`FileLike`](../common_resources/data_types#filelike) base class):

```python
import oci
from oci.object_storage import ObjectStorageClient
from cocoindex.connectors import oci_object_storage

config = oci.config.from_file()
client = ObjectStorageClient(config)

async for file in oci_object_storage.list_objects(client, "my-namespace", "my-bucket", prefix="data/"):
    text = await file.read_text()
    ...
```

See [`FileLike`](../common_resources/data_types#filelike) for details on the file objects.

### Keyed iteration with `items()`

`OCIWalker.items()` yields `(str, OCIFile)` pairs, useful for associating each file with a stable string key (its relative path):

```python
async for key, file in oci_object_storage.list_objects(client, "ns", "my-bucket").items():
    content = await file.read()
```

### Filtering files

Use `PatternFilePathMatcher` to filter which objects are included. Patterns are matched against the relative path (after prefix stripping):

```python
from cocoindex.connectors import oci_object_storage
from cocoindex.resources.file import PatternFilePathMatcher

matcher = PatternFilePathMatcher(included_patterns=["**/*.json"])

async for file in oci_object_storage.list_objects(
    client, "ns", "my-bucket", prefix="data/", path_matcher=matcher,
):
    process(file)
```

### Limiting file size

Use `max_file_size` to skip objects that exceed a size threshold:

```python
# Skip objects larger than 10 MB
async for file in oci_object_storage.list_objects(
    client, "ns", "my-bucket", max_file_size=10 * 1024 * 1024,
):
    process(file)
```

### get_object

Fetch a single object from an OCI bucket by its full object name.

```python
async def get_object(
    client: ObjectStorageClient,
    namespace: str,
    bucket_name: str,
    object_name: str,
) -> OCIFile
```

**Parameters:**

- `client` — An `oci.object_storage.ObjectStorageClient`.
- `namespace` — The OCI namespace.
- `bucket_name` — The bucket name.
- `object_name` — The full object name.

**Returns:** An `OCIFile` (FileLike) for the specified object, with its metadata pre-populated.

**Example:**

```python
f = await oci_object_storage.get_object(
    client, "my-namespace", "my-bucket", "data/config.json",
)
data = await f.read()
```

### read

Read object content directly without first fetching metadata.

```python
async def read(
    client: ObjectStorageClient,
    namespace: str,
    bucket_name: str,
    object_name: str,
    size: int = -1,
) -> bytes
```

**Parameters:**

- `client` — An `oci.object_storage.ObjectStorageClient`.
- `namespace` — The OCI namespace.
- `bucket_name` — The bucket name.
- `object_name` — The full object name.
- `size` — Number of bytes to read. If -1 (default), read the entire object.

**Returns:** The object content as bytes.

**Example:**

```python
data = await oci_object_storage.read(client, "my-namespace", "my-bucket", "data/config.json")
```

### OCIFilePath

Each file returned by the connector has an `OCIFilePath` — a [`FilePath`](../common_resources/data_types#filepath) specialized for OCI Object Storage:

- **Relative path** (`file.file_path.path`) — The object name relative to the walker prefix (or the full name if no prefix was used).
- **Resolved path** (`file.file_path.resolve()`) — The full OCI object name (the value passed to `head_object` / `get_object`).
- **Namespace and bucket** — `file.file_path.namespace`, `file.file_path.bucket_name`.

For example, with `prefix="data/"` and an object named `"data/docs/readme.md"`:
- `file.file_path.path` → `PurePath("docs/readme.md")`
- `file.file_path.resolve()` → `"data/docs/readme.md"`

### OCIFile.exists()

`OCIFile` provides an async `exists()` method that probes whether the object currently exists in OCI:

```python
if await oci_file.exists():
    data = await oci_file.read()
else:
    print("Object no longer exists")
```

Call `exists()` first and branch on the result; on a `True` verdict, subsequent `size()` / `read()` calls reuse the cached metadata without re-probing.

### Live bucket watching

When `live_stream` is provided, `items()` returns a [`LiveMapView`](../advanced_topics/live_component#livemapfeed-and-livemapview) instead of a plain `AsyncIterable`. Combined with [`mount_each()`](../programming_guide/processing_component#mount_each), this enables automatic incremental bucket watching — newly created, updated, and deleted objects are processed without a full rescan.

The typical setup uses **OCI Streaming**. Configure an [OCI Events Rule](https://docs.oracle.com/en-us/iaas/Content/Events/Concepts/eventsoverview.htm) that forwards Object Storage events (`createobject`, `updateobject`, `deleteobject`) to an OCI Streaming stream, then consume that stream as a `LiveStream[bytes]` via the Kafka connector — see [`topic_as_stream()` and `payloads()`](./kafka#as-a-live-stream) for the consumer / stream setup — and pass it as `live_stream`:

```python
from confluent_kafka.aio import AIOConsumer
from cocoindex.connectors import kafka, oci_object_storage

consumer = AIOConsumer({
    "bootstrap.servers": "<oci-streaming-bootstrap-host>:9092",
    "group.id": "my-group",
    "enable.auto.commit": "false",
})
topic_stream = kafka.topic_as_stream(consumer, ["object-storage-events"])

walker = oci_object_storage.list_objects(
    client, "my-namespace", "my-bucket",
    path_matcher=PatternFilePathMatcher(included_patterns=["**/*.md"]),
    live_stream=topic_stream.payloads(),
)
await coco.mount_each(process_file, walker.items(), target)
```

See [Live Mode](../programming_guide/live_mode) for how live mode works and how to enable it on the app.

### Example

```python
import pathlib

from confluent_kafka.aio import AIOConsumer
import oci
from oci.object_storage import ObjectStorageClient

import cocoindex as coco
from cocoindex.connectors import kafka, oci_object_storage, localfs
from cocoindex.resources.file import FileLike, PatternFilePathMatcher


@coco.fn(memo=True)
async def process_file(file: FileLike[str], target: localfs.DirTarget) -> None:
    text = await file.read_text()
    target.declare_file(filename=file.file_path.path.as_posix(), content=text)


@coco.fn
async def app_main(
    namespace: str, bucket: str, outdir: pathlib.Path,
) -> None:
    target = await localfs.mount_dir_target(outdir)

    config = oci.config.from_file()
    client = ObjectStorageClient(config)

    consumer = AIOConsumer({
        "bootstrap.servers": "<oci-streaming-host>:9092",
        "group.id": "my-group",
        "enable.auto.commit": "false",
    })
    topic_stream = kafka.topic_as_stream(consumer, ["object-storage-events"])

    matcher = PatternFilePathMatcher(included_patterns=["**/*.md"])
    walker = oci_object_storage.list_objects(
        client, namespace, bucket,
        prefix="docs/",
        path_matcher=matcher,
        live_stream=topic_stream.payloads(),
    )

    await coco.mount_each(process_file, walker.items(), target)


app = coco.App(
    coco.AppConfig(name="OCIToFiles"),
    app_main,
    namespace="my-namespace",
    bucket="my-bucket",
    outdir=pathlib.Path("./out"),
)
app.update_blocking(live=True)
```

---

# Postgres connector

Source: https://cocoindex.io/docs/connectors/postgres/

The `postgres` connector provides utilities for reading rows from and writing rows to PostgreSQL databases, with built-in support for pgvector.

```python
from cocoindex.connectors import postgres
```

**Note — Dependencies**
This connector requires additional dependencies. Install with:

```bash
pip install cocoindex[postgres]
```

## Connection setup

Create an [`asyncpg`](https://magicstack.github.io/asyncpg/current/) connection pool directly:

```python
import asyncpg

pool = await asyncpg.create_pool("postgresql://user:pass@localhost/dbname")
```

The connector handles pgvector extension setup automatically when a table uses vector columns — no special pool initialization is needed.

## As source

Use `PgTableSource` to read rows from a PostgreSQL table. It returns a `RowFetcher` that supports both synchronous and asynchronous iteration.

### PgTableSource

```python
class PgTableSource(Generic[RowT]):
    def __init__(
        self,
        pool: asyncpg.Pool,
        *,
        table_name: str,
        columns: Sequence[str] | None = None,
        pg_schema_name: str | None = None,
        row_type: type[RowT] | None = None,
        row_factory: Callable[[dict[str, Any]], RowT] | None = None,
    ) -> None

    def fetch_rows(self) -> RowFetcher[RowT]
```

**Parameters:**

- `pool` — An asyncpg connection pool.
- `table_name` — Name of the table to read from.
- `columns` — List of column names to select. If omitted with `row_type`, uses the record's field names. If omitted without `row_type`, uses `SELECT *`.
- `pg_schema_name` — Optional PostgreSQL schema name (defaults to `"public"`).
- `row_type` — Optional record type (dataclass, NamedTuple, or Pydantic model) for automatic row conversion. When provided, `columns` (if specified) must be a subset of the record's fields.
- `row_factory` — Optional callable to transform each row dict. Mutually exclusive with `row_type`.

### Row mapping

By default, rows are returned as `dict[str, Any]`, with PostgreSQL types converted to Python types using [asyncpg's type conversion](https://magicstack.github.io/asyncpg/current/usage.html#type-conversion). You can configure automatic conversion to custom types using `row_type` or `row_factory`.

#### Using `row_type`

Pass a record type (dataclass, NamedTuple, or Pydantic model) to automatically convert rows. When `columns` is omitted, the record's field names are used:

```python
from dataclasses import dataclass

@dataclass
class Product:
    id: int
    name: str
    price: float

source = postgres.PgTableSource(
    pool,
    table_name="products",
    row_type=Product,  # columns inferred as ["id", "name", "price"]
)
```

#### Using `row_factory`

For custom transformations, pass a callable:

```python
source = postgres.PgTableSource(
    pool,
    table_name="products",
    columns=["id", "name", "price"],
    row_factory=lambda row: (row["name"], row["price"] * 1.1),  # Add 10% markup
)
```

### Iterating rows

`fetch_rows()` returns a `RowFetcher` that supports both sync and async iteration:

```python
# Synchronous iteration
for row in source.fetch_rows():
    print(row.name, row.price)

# Asynchronous iteration (streams rows using a cursor)
async for row in source.fetch_rows():
    print(row.name, row.price)
```

### Example

```python
import cocoindex as coco
from cocoindex.connectors import postgres

@dataclass
class SourceProduct:
    product_id: str
    name: str
    description: str

@coco.fn
async def app_main(pool: asyncpg.Pool) -> None:
    source = postgres.PgTableSource(
        pool,
        table_name="products",
        row_type=SourceProduct,
    )

    async for product in source.fetch_rows():
        coco.mount(
            coco.component_subpath("product", product.product_id),
            process_product,
            product,
        )
```

## As target

The `postgres` connector provides target state APIs for writing rows to tables. With it, CocoIndex tracks what rows should exist and automatically handles upserts and deletions.

### Declaring target states

#### Setting up a connection

Create a `ContextKey[asyncpg.Pool]` to identify your connection pool, then provide the pool directly in your lifespan:

**Note**
The key name is load-bearing across runs — it's the stable identity CocoIndex uses to track managed rows. See [ContextKey as stable identity](../programming_guide/context#contextkey-as-stable-identity) before renaming.

```python
import asyncpg
import cocoindex as coco

PG_DB = coco.ContextKey[asyncpg.Pool]("my_db")

@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    async with await asyncpg.create_pool(DATABASE_URL) as pool:
        builder.provide(PG_DB, pool)
        yield
```

#### Tables (parent state)

Declares a table as a target state. Returns a `TableTarget` for declaring rows.

```python
def declare_table_target(
    db: ContextKey[asyncpg.Pool],
    table_name: str,
    table_schema: TableSchema[RowT],
    *,
    pg_schema_name: str | None = None,
    managed_by: Literal["system", "user"] = "system",
) -> TableTarget[RowT, coco.PendingS]
```

**Parameters:**

- `db` — A `ContextKey[asyncpg.Pool]` identifying the connection pool to use.
- `table_name` — Name of the table.
- `table_schema` — Schema definition including columns and primary key (see [Table Schema](#table-schema-from-python-class)).
- `pg_schema_name` — Optional PostgreSQL schema name (defaults to `"public"`).
- `managed_by` — Whether CocoIndex manages the table lifecycle (`"system"`) or assumes it exists (`"user"`).

**Returns:** A pending `TableTarget`. Use the convenience wrapper `await postgres.mount_table_target(PG_DB, table_name, table_schema)` to resolve.

#### Rows (child states)

Once a `TableTarget` is resolved, declare rows to be upserted:

```python
def TableTarget.declare_row(
    self,
    *,
    row: RowT,
) -> None
```

**Parameters:**

- `row` — A row object (dict, dataclass, NamedTuple, or Pydantic model). Must include all primary key columns.

#### Vector indexes (attachment)

Declare a pgvector index on a vector column of the table. CocoIndex tracks the index spec and automatically creates, recreates, or drops the index as needed.

```python
def TableTarget.declare_vector_index(
    self,
    *,
    name: str | None = None,
    column: str,
    metric: Literal["cosine", "l2", "ip"] = "cosine",
    method: Literal["ivfflat", "hnsw"] = "ivfflat",
    lists: int | None = None,
    m: int | None = None,
    ef_construction: int | None = None,
) -> None
```

The actual PostgreSQL index is named `{table_name}__vector__{name}`.

**Parameters:**

- `name` — Logical index name (defaults to `column`).
- `column` — Column to index (must be a vector column).
- `metric` — Distance metric: `"cosine"`, `"l2"`, or `"ip"` (inner product).
- `method` — Index method: `"ivfflat"` or `"hnsw"`.
- `lists` — Number of lists (ivfflat only).
- `m` — Maximum number of connections per layer (hnsw only).
- `ef_construction` — Size of the dynamic candidate list for construction (hnsw only).

**Example:**

```python
# Creates a PostgreSQL index named "products__vector__embedding"
table.declare_vector_index(
    column="embedding",
    metric="cosine",
    method="hnsw",
    m=16,
    ef_construction=64,
)
```

#### SQL command attachments

Declare an arbitrary SQL command that CocoIndex manages alongside the table. The setup SQL runs when the attachment is created or changed; the optional teardown SQL runs when the attachment is removed or before re-running setup on change.

```python
def TableTarget.declare_sql_command_attachment(
    self,
    *,
    name: str,
    setup_sql: str,
    teardown_sql: str | None = None,
) -> None
```

**Parameters:**

- `name` — Stable identifier for the attachment.
- `setup_sql` — SQL to execute on creation or change.
- `teardown_sql` — SQL to execute on removal or before re-running setup (optional). If omitted, no cleanup is performed when the attachment is removed.

**Example:**

```python
table.declare_sql_command_attachment(
    name="content_fts_idx",
    setup_sql='CREATE INDEX "content_fts" ON "products" USING gin (to_tsvector(\'english\', "description"))',
    teardown_sql='DROP INDEX IF EXISTS "content_fts"',
)
```

### Table schema: from Python class

Define the table structure using a Python class (dataclass, NamedTuple, or Pydantic model):

```python
@classmethod
async def TableSchema.from_class(
    cls,
    record_type: type[RowT],
    primary_key: list[str],
    *,
    column_overrides: dict[str, PgType | VectorSchemaProvider] | None = None,
) -> TableSchema[RowT]
```

**Parameters:**

- `record_type` — A record type whose fields define table columns.
- `primary_key` — List of column names forming the primary key.
- `column_overrides` — Optional per-column overrides for type mapping or vector configuration.

**Example:**

```python
@dataclass
class OutputProduct:
    category: str
    name: str
    price: float
    embedding: Annotated[NDArray, embedder]

schema = await postgres.TableSchema.from_class(
    OutputProduct,
    primary_key=["category", "name"],
)
```

Python types are automatically mapped to PostgreSQL types:

| Python Type | PostgreSQL Type |
|-------------|-----------------|
| `bool` | `boolean` |
| `int` | `bigint` |
| `float` | `double precision` |
| `decimal.Decimal` | `numeric` |
| `str` | `text` |
| `bytes` | `bytea` |
| `uuid.UUID` | `uuid` |
| `datetime.date` | `date` |
| `datetime.time` | `time with time zone` |
| `datetime.datetime` | `timestamp with time zone` |
| `datetime.timedelta` | `interval` |
| `list`, `dict`, nested structs | `jsonb` |
| `NDArray` (with vector schema) | `vector(n)` or `halfvec(n)` |

**Note — U+0000 (NUL) in strings**
U+0000 (NUL) is a valid Unicode codepoint, but Postgres cannot store it — neither in `text`-family columns nor inside strings in `jsonb` (the `\u0000` escape is rejected at parse time). CocoIndex automatically strips U+0000 from strings before writing to Postgres, recursively for nested strings and dict keys in `jsonb` payloads. For example, `"Hello\0World"` is written as `"HelloWorld"`.

To override the default mapping, provide a `PgType` or `VectorSchemaProvider` via:

- **Type annotation** — using `typing.Annotated` on the field
- **`column_overrides`** — passing overrides when constructing `TableSchema`

#### PgType

Use `PgType` to specify a custom PostgreSQL type:

```python
from typing import Annotated
from cocoindex.connectors.postgres import PgType

@dataclass
class MyRow:
    id: Annotated[int, PgType("integer")]           # instead of bigint
    value: Annotated[float, PgType("real")]         # instead of double precision
    created_at: Annotated[datetime.datetime, PgType("timestamp")]  # without timezone
```

Or via `column_overrides`:

```python
schema = postgres.TableSchema(
    MyRow,
    primary_key=["id"],
    column_overrides={
        "created_at": postgres.PgType("timestamp"),
    },
)
```

#### VectorSchemaProvider

For `NDArray` fields, a [`VectorSchemaProvider`](../common_resources/vector_schema#vectorschemaprovider) annotation specifies the vector dimension and dtype. The connector has built-in pgvector support and automatically creates the extension when needed. See [Vector Schema](../common_resources/vector_schema#vectorschemaprovider) for the full list of annotation options (`ContextKey`, embedder instance, or explicit `VectorSchema`).

### Table schema: explicit column definitions

Define columns directly using `ColumnDef`:

```python
def TableSchema.__init__(
    self,
    columns: dict[str, ColumnDef],
    primary_key: list[str],
) -> None
```

**Example:**

```python
schema = postgres.TableSchema(
    {
        "category": postgres.ColumnDef(type="text", nullable=False),
        "name": postgres.ColumnDef(type="text", nullable=False),
        "price": postgres.ColumnDef(type="numeric"),
        "embedding": postgres.ColumnDef(type="vector(384)"),
    },
    primary_key=["category", "name"],
)
```

### Example

```python
import asyncpg
import cocoindex as coco
from cocoindex.connectors import postgres

DATABASE_URL = "postgresql://localhost/mydb"

PG_DB = coco.ContextKey[asyncpg.Pool]("main_db")

@dataclass
class OutputProduct:
    category: str
    name: str
    description: str
    embedding: Annotated[NDArray, embedder]

@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    async with await asyncpg.create_pool(DATABASE_URL) as pool:
        builder.provide(PG_DB, pool)
        yield

@coco.fn
async def app_main() -> None:
    # Declare table target state
    table = await postgres.mount_table_target(
        PG_DB,
        "products",
        await postgres.TableSchema.from_class(
            OutputProduct,
            primary_key=["category", "name"],
        ),
    )

    # Declare rows
    for product in products:
        table.declare_row(row=product)

    # Declare a vector index on the embedding column
    table.declare_vector_index(
        column="embedding",
        metric="cosine",
        method="hnsw",
    )
```

---

# Qdrant connector

Source: https://cocoindex.io/docs/connectors/qdrant/

The `qdrant` connector provides utilities for writing points to Qdrant vector databases, with support for both single and named vectors, as well as multi-vector configurations.

```python
from cocoindex.connectors import qdrant
```

**Note — Dependencies**
This connector requires additional dependencies. Install with:

```bash
pip install cocoindex[qdrant]
```

## Connection setup

`create_client()` creates a Qdrant client connection with optional gRPC support.

```python
def create_client(
    url: str,
    *,
    prefer_grpc: bool = True,
    **kwargs: Any,
) -> QdrantClient
```

**Parameters:**

- `url` — Qdrant server URL (e.g., `"http://localhost:6333"`).
- `prefer_grpc` — Whether to prefer gRPC over HTTP (default: `True`).
- `**kwargs` — Additional arguments passed directly to `QdrantClient`.

**Returns:** A Qdrant client instance.

**Example:**

```python
client = qdrant.create_client("http://localhost:6333")
```

## As target

The `qdrant` connector provides target state APIs for writing points to collections. CocoIndex tracks what points should exist and automatically handles upserts and deletions.

### Declaring target states

#### Setting up a connection

Create a `ContextKey[QdrantClient]` to identify your Qdrant client, then provide it in your lifespan:

**Note**
The key name is load-bearing across runs — it's the stable identity CocoIndex uses to track managed collections. See [ContextKey as stable identity](../programming_guide/context#contextkey-as-stable-identity) before renaming.

```python
from qdrant_client import QdrantClient
import cocoindex as coco

QDRANT_DB = coco.ContextKey[QdrantClient]("my_vectors")

@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    client = qdrant.create_client(QDRANT_URL)
    builder.provide(QDRANT_DB, client)
    yield
```

#### Collections (parent state)

Declares a collection as a target state. Returns a `CollectionTarget` for declaring points.

```python
def declare_collection_target(
    db: ContextKey[QdrantClient],
    collection_name: str,
    schema: CollectionSchema,
    *,
    managed_by: Literal["system", "user"] = "system",
) -> CollectionTarget[coco.PendingS]
```

**Parameters:**

- `db` — A `ContextKey[QdrantClient]` identifying the Qdrant client to use.
- `collection_name` — Name of the collection.
- `schema` — Schema definition specifying vector configurations (see [Collection Schema](#collection-schema)).
- `managed_by` — Whether CocoIndex manages the collection lifecycle (`"system"`) or assumes it exists (`"user"`).

**Returns:** A pending `CollectionTarget`. Use the convenience wrapper `await qdrant.mount_collection_target(QDRANT_DB, collection_name, schema)` to resolve.

#### Points (child states)

Once a `CollectionTarget` is resolved, declare points to be upserted using `qdrant.PointStruct`, which is an alias of `qdrant_client.http.models.PointStruct`:

```python
def CollectionTarget.declare_point(
    self,
    point: qdrant.PointStruct,
) -> None
```

**Parameters:**

- `point` — A `qdrant.PointStruct` (alias of `qdrant_client.http.models.PointStruct`) containing:
  - `id` — Point ID (str, int, or UUID)
  - `vector` — Vector data (single vector or dict of named vectors)
  - `payload` — Optional metadata as a JSON-serializable dict

### Collection schema

Define vector configurations for a collection using `CollectionSchema`. Unlike row-oriented databases, Qdrant uses a point-oriented model where each point has schemaless payload and one or more vectors with predefined dimensions.

```python
class CollectionSchema:
    @classmethod
    async def create(
        cls,
        vectors: QdrantVectorDef | dict[str, QdrantVectorDef],
    ) -> CollectionSchema
```

**Parameters:**

- `vectors` — Either:
  - A single `QdrantVectorDef` for an unnamed vector
  - A dict mapping vector names to `QdrantVectorDef` for named vectors

#### QdrantVectorDef

Specifies vector configuration including dimension, distance metric, and multi-vector settings:

```python
class QdrantVectorDef(NamedTuple):
    schema: VectorSchemaProvider | MultiVectorSchemaProvider
    distance: Literal["cosine", "dot", "euclid"] = "cosine"
    multivector_comparator: Literal["max_sim"] = "max_sim"
```

**Parameters:**

- `schema` — A `VectorSchemaProvider` or `MultiVectorSchemaProvider` that defines vector dimensions
- `distance` — Distance metric for similarity search (default: `"cosine"`)
- `multivector_comparator` — Comparator for multi-vector fields (only applies to `MultiVectorSchemaProvider`)

#### Single (unnamed) vector

For collections with a single unnamed vector:

```python
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder

embedder = SentenceTransformerEmbedder("sentence-transformers/all-MiniLM-L6-v2")

schema = await qdrant.CollectionSchema.create(
    vectors=qdrant.QdrantVectorDef(schema=embedder)
)
```

Points use the vector directly:

```python
point = qdrant.PointStruct(
    id="doc-123",
    vector=embedding.tolist(),  # Single vector
    payload={"text": "...", "metadata": {...}},
)
```

#### Named vectors

For collections with multiple named vectors:

```python
from cocoindex.resources.schema import VectorSchema
import numpy as np

schema = await qdrant.CollectionSchema.create(
    vectors={
        "text_embedding": qdrant.QdrantVectorDef(
            schema=VectorSchema(dtype=np.float32, size=384),
            distance="cosine",
        ),
        "image_embedding": qdrant.QdrantVectorDef(
            schema=VectorSchema(dtype=np.float32, size=512),
            distance="dot",
        ),
    }
)
```

Points use a dict of vectors:

```python
point = qdrant.PointStruct(
    id="doc-123",
    vector={
        "text_embedding": text_vec.tolist(),
        "image_embedding": image_vec.tolist(),
    },
    payload={"text": "...", "metadata": {...}},
)
```

#### VectorSchemaProvider

The `schema` field of `QdrantVectorDef` accepts a [`VectorSchemaProvider`](../common_resources/vector_schema#vectorschemaprovider), a `ContextKey`, or an explicit `VectorSchema` to specify the vector dimension and dtype. See [Vector Schema](../common_resources/vector_schema#vectorschemaprovider) for details.

#### Multi-vector support

For multi-vector configurations (multiple vectors per point stored together):

```python
from cocoindex.resources.schema import MultiVectorSchema, VectorSchema
import numpy as np

schema = await qdrant.CollectionSchema.create(
    vectors=qdrant.QdrantVectorDef(
        schema=MultiVectorSchema(
            vector_schema=VectorSchema(dtype=np.float32, size=384)
        ),
        multivector_comparator="max_sim",
    )
)
```

### Distance metrics

The `distance` parameter in `QdrantVectorDef` specifies the similarity metric:

- `"cosine"` — Cosine similarity (default, normalized dot product)
- `"dot"` — Dot product similarity
- `"euclid"` — Euclidean distance (L2)

### Example: single vector

```python
from qdrant_client import QdrantClient
import cocoindex as coco
from cocoindex.connectors import qdrant
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder
from typing import AsyncIterator

QDRANT_URL = "http://localhost:6333"
QDRANT_DB = coco.ContextKey[QdrantClient]("main_vectors")

embedder = SentenceTransformerEmbedder("sentence-transformers/all-MiniLM-L6-v2")

@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    client = qdrant.create_client(QDRANT_URL)
    builder.provide(QDRANT_DB, client)
    yield

@coco.fn
async def process_document(
    doc_id: str,
    text: str,
    target: qdrant.CollectionTarget,
) -> None:
    embedding = await embedder.embed(text)

    point = qdrant.PointStruct(
        id=doc_id,
        vector=embedding.tolist(),
        payload={"text": text},
    )
    target.declare_point(point)

@coco.fn
async def app_main() -> None:
    # Declare collection target state
    collection = await qdrant.mount_collection_target(
        QDRANT_DB,
        "documents",
        await qdrant.CollectionSchema.create(
            vectors=qdrant.QdrantVectorDef(schema=embedder)
        ),
    )

    # Declare points
    for doc_id, text in documents:
        await coco.mount(
            coco.component_subpath("doc", doc_id),
            process_document,
            doc_id,
            text,
            collection,
        )
```

### Example: named vectors

```python
from cocoindex.resources.schema import VectorSchema
import numpy as np

@coco.fn
async def app_main() -> None:
    collection = await qdrant.mount_collection_target(
        QDRANT_DB,
        "multimodal_docs",
        await qdrant.CollectionSchema.create(
            vectors={
                "text": qdrant.QdrantVectorDef(
                    schema=text_embedder,
                    distance="cosine",
                ),
                "image": qdrant.QdrantVectorDef(
                    schema=VectorSchema(dtype=np.float32, size=512),
                    distance="dot",
                ),
            }
        ),
    )

    # Declare points with named vectors
    for doc in documents:
        point = qdrant.PointStruct(
            id=doc.id,
            vector={
                "text": doc.text_embedding.tolist(),
                "image": doc.image_embedding.tolist(),
            },
            payload={"title": doc.title, "url": doc.url},
        )
        collection.declare_point(point)
```

## Point IDs

Qdrant supports the following point ID types:

- `str` — String identifiers
- `int` — Integer identifiers (unsigned 64-bit)
- `uuid.UUID` — UUID identifiers (converted to string)

All other types are converted to strings automatically.

## Payloads

Point payloads are schemaless JSON objects. Any JSON-serializable Python data structure can be used:

```python
payload = {
    "text": "Document content",
    "metadata": {
        "author": "Alice",
        "tags": ["machine-learning", "nlp"],
        "published": "2024-01-15",
    },
    "stats": {
        "views": 1500,
        "likes": 42,
    },
}
```

## Vector search

The connector focuses on writing points to Qdrant. For vector search, use the Qdrant client directly:

```python
from qdrant_client.http import models as qdrant_models

# Get the registered client
client = qdrant.create_client("http://localhost:6333")

# Perform search
results = client.search(
    collection_name="documents",
    query_vector=query_embedding.tolist(),
    limit=10,
)

for result in results:
    print(f"Score: {result.score}, ID: {result.id}")
    print(f"Payload: {result.payload}")
```

For named vectors:

```python
results = client.search(
    collection_name="documents",
    query_vector=("text", query_embedding.tolist()),  # Search using "text" vector
    limit=10,
)
```

---

# SQLite connector

Source: https://cocoindex.io/docs/connectors/sqlite/

The `sqlite` connector provides utilities for writing rows to SQLite databases, with optional vector support via the sqlite-vec extension.

```python
from cocoindex.connectors import sqlite
```

**Note — Vector Support**
For vector operations, install the sqlite-vec extension:

```bash
pip install cocoindex[sqlite]
```

Note: The default SQLite library bundled with macOS does not support extensions. Use Homebrew Python (`brew install python`) or build SQLite with extension support.

## Connection setup

### connect

`connect()` creates a managed SQLite connection with sensible defaults, including automatic sqlite-vec loading and thread-safe access.

```python
def connect(
    database: str | Path,
    *,
    timeout: float = 5.0,
    load_vec: bool | Literal["auto"] = "auto",
    **kwargs: Any,
) -> ManagedConnection
```

**Parameters:**

- `database` — Path to the SQLite database file, or `":memory:"` for an in-memory database.
- `timeout` — How long to wait for locks before raising an error.
- `load_vec` — Whether to load the sqlite-vec extension for vector support:
  - `"auto"` (default): Try to load, silently ignore if unavailable.
  - `True`: Load and raise an error if unavailable.
  - `False`: Don't attempt to load.
- `**kwargs` — Additional arguments passed directly to `sqlite3.connect()`.

**Returns:** A `ManagedConnection` with thread-safe access and extension tracking.

**Example:**

```python
managed_conn = sqlite.connect("mydb.sqlite")  # Auto-loads sqlite-vec if available
# Or for in-memory:
managed_conn = sqlite.connect(":memory:")
# Or explicitly require vector support:
managed_conn = sqlite.connect("mydb.sqlite", load_vec=True)
# Or disable auto-loading:
managed_conn = sqlite.connect("mydb.sqlite", load_vec=False)
```

### ManagedConnection

A wrapper around `sqlite3.Connection` that provides thread-safe access and tracks loaded extensions. The connection uses autocommit mode internally.

**Methods:**

- `transaction()` — Context manager that acquires a lock and executes within a transaction (`BEGIN`...`COMMIT`/`ROLLBACK`). Use for write operations that should be atomic.
- `readonly()` — Context manager that acquires a lock for read-only operations. No transaction is started since the connection uses autocommit mode.
- `close()` — Closes the underlying connection.

**Properties:**

- `loaded_extensions` — A read-only `Set[str]` of loaded extension names (e.g., `"sqlite-vec"`).

## As target

The `sqlite` connector provides target state APIs for writing rows to tables. With it, CocoIndex tracks what rows should exist and automatically handles upserts and deletions.

### Declaring target states

#### Setting up a connection

Create a `ContextKey[sqlite.ManagedConnection]` to identify your SQLite connection, then provide it in your lifespan using `sqlite.managed_connection()`:

**Note**
The key name is load-bearing across runs — it's the stable identity CocoIndex uses to track managed rows. See [ContextKey as stable identity](../programming_guide/context#contextkey-as-stable-identity) before renaming.

```python
import cocoindex as coco

SQLITE_DB = coco.ContextKey[sqlite.ManagedConnection]("main_db")

@coco.lifespan
def coco_lifespan(builder: coco.EnvironmentBuilder) -> Iterator[None]:
    with sqlite.managed_connection("mydb.sqlite") as conn:
        builder.provide(SQLITE_DB, conn)
        yield
```

#### Tables (parent state)

Declares a table as a target state. Returns a `TableTarget` for declaring rows.

```python
def declare_table_target(
    db: ContextKey[ManagedConnection],
    table_name: str,
    table_schema: TableSchema[RowT],
    *,
    managed_by: Literal["system", "user"] = "system",
    virtual_table_def: Vec0TableDef | None = None,
) -> TableTarget[RowT, coco.PendingS]
```

**Parameters:**

- `db` — A `ContextKey[ManagedConnection]` identifying the connection to use.
- `table_name` — Name of the table.
- `table_schema` — Schema definition including columns and primary key (see [Table Schema](#table-schema-from-python-class)).
- `managed_by` — Whether CocoIndex manages the table lifecycle (`"system"`) or assumes it exists (`"user"`).
- `virtual_table_def` — Optional `Vec0TableDef` to create a vec0 virtual table instead of a regular table.

**Returns:** A pending `TableTarget`. Use the convenience wrapper `await sqlite.mount_table_target(SQLITE_DB, table_name, table_schema)` to resolve.

#### Rows (child states)

Once a `TableTarget` is resolved, declare rows to be upserted:

```python
def TableTarget.declare_row(
    self,
    *,
    row: RowT,
) -> None
```

**Parameters:**

- `row` — A row object (dict, dataclass, NamedTuple, or Pydantic model). Must include all primary key columns.

### Table schema: from Python class

Define the table structure using a Python class (dataclass, NamedTuple, or Pydantic model):

```python
@classmethod
async def TableSchema.from_class(
    cls,
    record_type: type[RowT],
    primary_key: list[str],
    *,
    column_overrides: dict[str, SqliteType | VectorSchemaProvider] | None = None,
) -> TableSchema[RowT]
```

**Parameters:**

- `record_type` — A record type whose fields define table columns.
- `primary_key` — List of column names forming the primary key.
- `column_overrides` — Optional per-column overrides for type mapping or vector configuration.

**Example:**

```python
@dataclass
class OutputProduct:
    category: str
    name: str
    price: float
    embedding: Annotated[NDArray, embedder]

schema = await sqlite.TableSchema.from_class(
    OutputProduct,
    primary_key=["category", "name"],
)
```

Python types are automatically mapped to SQLite type affinities:

| Python Type | SQLite Type |
|-------------|-------------|
| `bool` | `INTEGER` (0/1) |
| `int` | `INTEGER` |
| `float` | `REAL` |
| `decimal.Decimal` | `TEXT` |
| `str` | `TEXT` |
| `bytes` | `BLOB` |
| `uuid.UUID` | `TEXT` |
| `datetime.date` | `TEXT` (ISO format) |
| `datetime.time` | `TEXT` (ISO format) |
| `datetime.datetime` | `TEXT` (ISO format) |
| `datetime.timedelta` | `REAL` (total seconds) |
| `list`, `dict`, nested structs | `TEXT` (JSON) |
| `NDArray` (with vector schema) | `float[N]` (sqlite-vec type, e.g., `float[384]`) |

To override the default mapping, provide a `SqliteType` or `VectorSchemaProvider` via:

- **Type annotation** — using `typing.Annotated` on the field
- **`column_overrides`** — passing overrides when constructing `TableSchema`

#### SqliteType

Use `SqliteType` to specify a custom SQLite type and optional encoder:

```python
from typing import Annotated
from cocoindex.connectors.sqlite import SqliteType

@dataclass
class MyRow:
    id: int
    value: Annotated[float, SqliteType("REAL")]
    data: Annotated[dict, SqliteType("TEXT", encoder=lambda v: json.dumps(v))]
```

Or via `column_overrides`:

```python
schema = sqlite.TableSchema(
    MyRow,
    primary_key=["id"],
    column_overrides={
        "data": sqlite.SqliteType("TEXT", encoder=lambda v: json.dumps(v)),
    },
)
```

#### VectorSchemaProvider

For `NDArray` fields, a [`VectorSchemaProvider`](../common_resources/vector_schema#vectorschemaprovider) annotation specifies the vector dimension and dtype. Vectors are stored as BLOBs in sqlite-vec compatible float32 format. See [Vector Schema](../common_resources/vector_schema#vectorschemaprovider) for the full list of annotation options (`ContextKey`, embedder instance, or explicit `VectorSchema`).

### Table schema: explicit column definitions

Define columns directly using `ColumnDef`:

```python
def TableSchema.__init__(
    self,
    columns: dict[str, ColumnDef],
    primary_key: list[str],
) -> None
```

**Example:**

```python
schema = sqlite.TableSchema(
    {
        "category": sqlite.ColumnDef(type="TEXT", nullable=False),
        "name": sqlite.ColumnDef(type="TEXT", nullable=False),
        "price": sqlite.ColumnDef(type="REAL"),
        "embedding": sqlite.ColumnDef(type="float[384]"),  # sqlite-vec vector type
    },
    primary_key=["category", "name"],
)
```

### Virtual tables

SQLite virtual tables allow custom storage backends and specialized functionality. The `sqlite` connector supports creating virtual tables through the same `declare_table_target()` API used for regular tables.

#### Vec0 virtual tables

The `vec0` module from sqlite-vec provides optimized vector storage for similarity search. Use vec0 virtual tables when:

- You need efficient vector similarity search with built-in indexing
- You want to partition vectors by categories for faster queries
- You're working with large vector datasets

**Requirements:**

- Exactly one `INTEGER` primary key column
- At least one `float[N]` vector column
- The sqlite-vec extension must be loaded (`load_vec=True`)

#### Vec0TableDef

Configure vec0-specific features using `Vec0TableDef`:

```python
from cocoindex.connectors.sqlite import Vec0TableDef

virtual_table_def = Vec0TableDef(
    partition_key_columns=["category"],  # Optional: partition index by these columns
    auxiliary_columns=["metadata"],      # Optional: columns excluded from KNN filters
)
```

**Parameters:**

- `partition_key_columns` — List of column names used to partition the vector index. Queries can filter by partition keys efficiently. Multiple partition keys create a composite partition.
- `auxiliary_columns` — List of column names to mark as auxiliary (stored but not usable in KNN filters). Useful for metadata that doesn't need to participate in similarity search.

#### Creating vec0 virtual tables

Pass `virtual_table_def` to `declare_table_target()`:

```python
@dataclass
class VectorDocument:
    id: int
    category: str
    content: str
    embedding: Annotated[NDArray, embedder]  # e.g., float[384]
    metadata: str

# Create vec0 virtual table with partition key and auxiliary column
table = await sqlite.mount_table_target(
    SQLITE_DB,
    "documents",
    await sqlite.TableSchema.from_class(
        VectorDocument,
        primary_key=["id"],
    ),
    virtual_table_def=sqlite.Vec0TableDef(
        partition_key_columns=["category"],
        auxiliary_columns=["metadata"],
    ),
)
```

**Warning — Current Limitations**
Vec0 virtual tables have the following limitations:

- **Schema changes are not supported incrementally**. When you modify the table schema (add/remove columns, change `virtual_table_def` settings), the table will be recreated and **existing data will be lost**.
- **Switching between regular and virtual tables** will also recreate the table and clear existing data.

A future schema versioning mechanism will allow preserving row data across table recreations.

### Example

```python
import cocoindex as coco
from cocoindex.connectors import sqlite

DATABASE_PATH = "mydb.sqlite"

SQLITE_DB = coco.ContextKey[sqlite.ManagedConnection]("main_db")

@dataclass
class OutputProduct:
    category: str
    name: str
    description: str
    embedding: Annotated[NDArray, embedder]

@coco.lifespan
def coco_lifespan(builder: coco.EnvironmentBuilder) -> Iterator[None]:
    with sqlite.managed_connection(DATABASE_PATH, load_vec=True) as conn:  # Enable vector support
        builder.provide(SQLITE_DB, conn)
        yield

@coco.fn
async def app_main() -> None:
    # Declare table target state
    table = await sqlite.mount_table_target(
        SQLITE_DB,
        "products",
        await sqlite.TableSchema.from_class(
            OutputProduct,
            primary_key=["category", "name"],
        ),
    )

    # Declare rows
    for product in products:
        table.declare_row(row=product)
```

### Example: Vec0 Virtual Table

```python
import cocoindex as coco
from cocoindex.connectors import sqlite
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder
from dataclasses import dataclass
from typing import Annotated
from numpy.typing import NDArray

DATABASE_PATH = "vectors.sqlite"
SQLITE_DB = coco.ContextKey[sqlite.ManagedConnection]("vec_db")

embedder = SentenceTransformerEmbedder("sentence-transformers/all-MiniLM-L6-v2")

@dataclass
class VectorDocument:
    id: int
    category: str
    title: str
    content: str
    embedding: Annotated[NDArray, embedder]  # float[384]
    metadata: str  # Will be marked as auxiliary

@coco.lifespan
def coco_lifespan(builder: coco.EnvironmentBuilder) -> Iterator[None]:
    with sqlite.managed_connection(DATABASE_PATH, load_vec=True) as conn:
        builder.provide(SQLITE_DB, conn)
        yield

@coco.fn
async def app_main() -> None:
    # Create vec0 virtual table with partition key and auxiliary column
    table = await sqlite.mount_table_target(
        SQLITE_DB,
        "documents",
        await sqlite.TableSchema.from_class(
            VectorDocument,
            primary_key=["id"],
        ),
        virtual_table_def=sqlite.Vec0TableDef(
            partition_key_columns=["category"],  # Partition index by category
            auxiliary_columns=["metadata"],       # Store but don't index for KNN
        ),
    )

    # Declare document rows
    docs = [
        VectorDocument(
            id=1,
            category="tech",
            title="Introduction to AI",
            content="Artificial intelligence is...",
            embedding=await embedder.embed("Artificial intelligence is..."),
            metadata='{"source": "blog", "date": "2025-01-15"}',
        ),
        # ... more documents
    ]

    for doc in docs:
        table.declare_row(row=doc)
```

---

# SurrealDB connector

Source: https://cocoindex.io/docs/connectors/surrealdb/

The `surrealdb` connector provides utilities for writing records to SurrealDB databases, with support for normal tables, relation (graph edge) tables, optional schema enforcement, and vector indexes.

```python
from cocoindex.connectors import surrealdb
```

**Note — Dependencies**
This connector requires additional dependencies. Install with:

```bash
pip install cocoindex[surrealdb]
```

## Connection setup

Create a `ConnectionFactory` and provide it via a `ContextKey`. It holds connection parameters and creates authenticated connections on demand.

**Note**
The key name is load-bearing across runs — it's the stable identity CocoIndex uses to track managed rows. See [ContextKey as stable identity](../programming_guide/context#contextkey-as-stable-identity) before renaming.

```python
from cocoindex.connectors import surrealdb
import cocoindex as coco

SURREAL_DB: coco.ContextKey[surrealdb.ConnectionFactory] = coco.ContextKey("main_db")

@coco.lifespan
def coco_lifespan(builder: coco.EnvironmentBuilder) -> Iterator[None]:
    builder.provide(
        SURREAL_DB,
        surrealdb.ConnectionFactory(
            url="ws://localhost:8000/rpc",
            namespace="test",
            database="test",
            credentials={"username": "root", "password": "root"},
        ),
    )
    yield
```

## As target

The `surrealdb` connector provides target state APIs for writing records to normal tables and relation tables. CocoIndex tracks what records should exist and automatically handles upserts and deletions.

All tables within the same database share a single transaction sink, so changes across related tables and relations are applied atomically.

### Declaring target states

#### Normal tables (parent state)

Declares a table as a target state. Returns a `TableTarget` for declaring records.

```python
def declare_table_target(
    db: ContextKey,
    table_name: str,
    table_schema: TableSchema[RowT] | None = None,
    *,
    managed_by: Literal["system", "user"] = "system",
) -> TableTarget[RowT, coco.PendingS]
```

**Parameters:**

- `db` — A `ContextKey[surrealdb.ConnectionFactory]` for the SurrealDB connection.
- `table_name` — Name of the table.
- `table_schema` — Optional schema definition (see [Table Schema](#table-schema-from-python-class)). When provided, the table is `SCHEMAFULL`; when omitted, the table is `SCHEMALESS`.
- `managed_by` — Whether CocoIndex manages the table lifecycle (`"system"`) or assumes it exists (`"user"`).

**Returns:** A pending `TableTarget`. Use `await surrealdb.mount_table_target(SURREAL_DB, ...)` to get a resolved target.

#### Records (child states)

Once a `TableTarget` is resolved, declare records to be upserted:

```python
def TableTarget.declare_record(
    self,
    *,
    row: RowT,
) -> None
```

**Parameters:**

- `row` — A row object (dict, dataclass, NamedTuple, or Pydantic model). Must include an `id` field.

`declare_row` is an alias for `declare_record`, for compatibility with Postgres and other RDBMS targets.

#### Relation tables (parent state)

Declares a relation (graph edge) table. Returns a `RelationTarget` for declaring relation records.

```python
def declare_relation_target(
    db: ContextKey,
    table_name: str,
    from_table: TableTarget | Collection[TableTarget],
    to_table: TableTarget | Collection[TableTarget],
    table_schema: TableSchema[RowT] | None = None,
    *,
    managed_by: Literal["system", "user"] = "system",
) -> RelationTarget[RowT, coco.PendingS]
```

**Parameters:**

- `db` — A `ContextKey[surrealdb.ConnectionFactory]` for the SurrealDB connection.
- `table_name` — Name of the relation table.
- `from_table` — Source table(s). Pass a single `TableTarget` or a collection for polymorphic relations.
- `to_table` — Target table(s). Same rules as `from_table`.
- `table_schema` — Optional schema. The schema does **not** require an `id` field (unlike normal tables).
- `managed_by` — Whether CocoIndex manages the table lifecycle.

**Returns:** A pending `RelationTarget`. Use `await surrealdb.mount_relation_target(SURREAL_DB, ...)` to get a resolved target.

#### Relations (child states)

Once a `RelationTarget` is resolved, declare relation records:

```python
def RelationTarget.declare_relation(
    self,
    *,
    from_id: Any,
    to_id: Any,
    record: RowT | None = None,
    from_table: TableTarget | None = None,
    to_table: TableTarget | None = None,
) -> None
```

**Parameters:**

- `from_id` — ID of the source record.
- `to_id` — ID of the target record.
- `record` — Optional data fields for the relation. The `id` field is optional: when absent, the record id is auto-derived from the endpoints as `"{from_table}_{from_id}_{to_table}_{to_id}"`.
- `from_table` / `to_table` — Required when the relation was declared with multiple (polymorphic) source/target tables.

#### Vector indexes (attachment)

Declare a vector index on a field of the table. CocoIndex tracks the index spec and automatically creates, recreates, or drops the index as needed.

```python
def TableTarget.declare_vector_index(
    self,
    *,
    name: str | None = None,
    field: str,
    metric: Literal["cosine", "euclidean", "manhattan"] = "cosine",
    method: Literal["mtree", "hnsw"] = "mtree",
    dimension: int | None = None,
    vector_type: Literal["f32", "f64", "i16", "i32", "i64"] = "f32",
) -> None
```

**Parameters:**

- `name` — Index name (defaults to `idx_{table}__{field}`).
- `field` — Field to index (must be a vector/array field).
- `metric` — Distance metric: `"cosine"`, `"euclidean"`, or `"manhattan"`.
- `method` — Index method: `"mtree"` or `"hnsw"`.
- `dimension` — Vector dimension (required).
- `vector_type` — Vector element type: `"f32"`, `"f64"`, `"i16"`, `"i32"`, or `"i64"`.

### Table schema: from Python class

Define the table structure using a Python class (dataclass, NamedTuple, or Pydantic model):

```python
@classmethod
async def TableSchema.from_class(
    cls,
    record_type: type[RowT],
    *,
    column_overrides: dict[str, SurrealType | VectorSchemaProvider] | None = None,
) -> TableSchema[RowT]
```

**Parameters:**

- `record_type` — A record type whose fields define table columns. For normal tables, must include an `id` field. For relation tables, `id` is optional.
- `column_overrides` — Optional per-column overrides for type mapping or vector configuration.

**Example:**

```python
@dataclass
class Product:
    id: str
    name: str
    price: float
    embedding: Annotated[NDArray, embedder]

schema = await surrealdb.TableSchema.from_class(Product)
```

Python types are automatically mapped to SurrealDB types:

| Python Type | SurrealDB Type |
|-------------|----------------|
| `bool` | `bool` |
| `int` | `int` |
| `float` | `float` |
| `decimal.Decimal` | `decimal` |
| `str` | `string` |
| `bytes` | `bytes` |
| `uuid.UUID` | `uuid` |
| `datetime.datetime` | `datetime` |
| `datetime.date` | `datetime` |
| `datetime.time` | `datetime` |
| `datetime.timedelta` | `duration` |
| `list`, `dict`, nested structs | `object` |
| `NDArray` (with vector schema) | `array<float, N>` |

#### SurrealType

Use `SurrealType` to override the default type mapping:

```python
from typing import Annotated
from cocoindex.connectors.surrealdb import SurrealType

@dataclass
class MyRow:
    id: str
    value: Annotated[float, SurrealType("decimal")]
```

Or via `column_overrides`:

```python
schema = await surrealdb.TableSchema.from_class(
    MyRow,
    column_overrides={"value": surrealdb.SurrealType("decimal")},
)
```

#### VectorSchemaProvider

For `NDArray` fields, a [`VectorSchemaProvider`](../common_resources/vector_schema#vectorschemaprovider) annotation specifies the vector dimension and dtype. See [Vector Schema](../common_resources/vector_schema#vectorschemaprovider) for the full list of annotation options.

### Table schema: explicit column definitions

Define columns directly using `ColumnDef`:

```python
def TableSchema.__init__(
    self,
    columns: dict[str, ColumnDef],
    *,
    row_type: type[RowT] | None = None,
) -> None
```

**Example:**

```python
schema = surrealdb.TableSchema(
    {
        "id": surrealdb.ColumnDef(type="string", nullable=False),
        "name": surrealdb.ColumnDef(type="string", nullable=False),
        "price": surrealdb.ColumnDef(type="float"),
    },
)
```

### Example: Normal tables

```python
import cocoindex as coco
from cocoindex.connectors import surrealdb

SURREAL_DB: coco.ContextKey[surrealdb.ConnectionFactory] = coco.ContextKey("main_db")

@dataclass
class Product:
    id: str
    name: str
    price: float
    embedding: Annotated[NDArray, embedder]

@coco.lifespan
def coco_lifespan(builder: coco.EnvironmentBuilder) -> Iterator[None]:
    builder.provide(
        SURREAL_DB,
        surrealdb.ConnectionFactory(
            url="ws://localhost:8000/rpc",
            namespace="test",
            database="test",
            credentials={"username": "root", "password": "root"},
        ),
    )
    yield

@coco.fn
async def app_main() -> None:
    # Declare table target state
    table = await surrealdb.mount_table_target(
        SURREAL_DB,
        "products",
        await surrealdb.TableSchema.from_class(Product),
    )

    # Declare records
    for product in products:
        table.declare_record(row=product)

    # Declare a vector index
    table.declare_vector_index(
        field="embedding",
        metric="cosine",
        method="hnsw",
        dimension=384,
    )
```

### Example: Relation tables

```python
@dataclass
class Person:
    id: str
    name: str

@dataclass
class Post:
    id: str
    title: str

@coco.fn
async def app_main() -> None:
    person_schema = await surrealdb.TableSchema.from_class(Person)
    person_target = await surrealdb.mount_table_target(SURREAL_DB, "person", person_schema)
    for p in persons:
        person_target.declare_record(row=p)

    post_schema = await surrealdb.TableSchema.from_class(Post)
    post_target = await surrealdb.mount_table_target(SURREAL_DB, "post", post_schema)
    for p in posts:
        post_target.declare_record(row=p)

    # Declare a relation table (schemaless, no id needed)
    likes_target = await surrealdb.mount_relation_target(
        SURREAL_DB,
        "likes",
        from_table=person_target,
        to_table=post_target,
    )

    # Declare relations — id is auto-derived from endpoints
    for like in likes:
        likes_target.declare_relation(
            from_id=like["person_id"],
            to_id=like["post_id"],
        )
```

---

# Turbopuffer connector

Source: https://cocoindex.io/docs/connectors/turbopuffer/

The `turbopuffer` connector provides utilities for writing rows to [Turbopuffer](https://turbopuffer.com/) namespaces, with support for both single and named vectors.

```python
from cocoindex.connectors import turbopuffer
```

**Note — Dependencies**
This connector requires additional dependencies. Install with:

```bash
pip install cocoindex[turbopuffer]
```

## Connection setup

Turbopuffer uses a single client object that owns the API key and region. Construct one using `AsyncTurbopuffer`:

```python
from cocoindex.connectors import turbopuffer

client = turbopuffer.AsyncTurbopuffer(
    region="gcp-us-central1",
    api_key=os.environ["TURBOPUFFER_API_KEY"],
)
```

`turbopuffer.AsyncTurbopuffer` is re-exported from the [Turbopuffer Python SDK](https://github.com/turbopuffer/turbopuffer-python); importing it directly via `from turbopuffer import AsyncTurbopuffer` works too.

## As target

The `turbopuffer` connector provides target state APIs for writing rows to namespaces. CocoIndex tracks what rows should exist and automatically handles upserts and deletions. Turbopuffer creates namespaces implicitly on the first write, so there is no separate "create namespace" step — but the connector still tracks namespace-level configuration (vector schema and distance metric) and clears the namespace if it must be rebuilt.

### Declaring target states

#### Setting up a connection

Create a `ContextKey[AsyncTurbopuffer]` to identify your client, then provide it in your lifespan:

**Note**
The key name is load-bearing across runs — it's the stable identity CocoIndex uses to track managed namespaces. See [ContextKey as stable identity](../programming_guide/context#contextkey-as-stable-identity) before renaming.

```python
from cocoindex.connectors import turbopuffer
import cocoindex as coco

TPUF = coco.ContextKey[turbopuffer.AsyncTurbopuffer]("my_vectors")

@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    client = turbopuffer.AsyncTurbopuffer(
        region="gcp-us-central1",
        api_key=os.environ["TURBOPUFFER_API_KEY"],
    )
    builder.provide(TPUF, client)
    yield
```

#### Namespaces (parent state)

Declares a namespace as a target state. Returns a `NamespaceTarget` for declaring rows.

```python
def declare_namespace_target(
    db: ContextKey[AsyncTurbopuffer],
    namespace_name: str,
    schema: NamespaceSchema,
    *,
    managed_by: Literal["system", "user"] = "system",
) -> NamespaceTarget[coco.PendingS]
```

**Parameters:**

- `db` — A `ContextKey[AsyncTurbopuffer]` identifying the client to use.
- `namespace_name` — Name of the namespace.
- `schema` — Schema definition specifying vector configuration and distance metric (see [Namespace schema](#namespace-schema)).
- `managed_by` — Whether CocoIndex manages the namespace lifecycle (`"system"`) or assumes it exists (`"user"`).

**Returns:** A pending `NamespaceTarget`. Use the convenience wrapper `await turbopuffer.mount_namespace_target(TPUF, namespace_name, schema)` to resolve.

#### Rows (child states)

Once a `NamespaceTarget` is resolved, declare rows to be upserted using `turbopuffer.Row`:

```python
def NamespaceTarget.declare_row(
    self,
    row: turbopuffer.Row,
) -> None
```

`Row` is a small dataclass:

```python
@dataclass
class Row:
    id: str | int
    vector: Sequence[float] | np.ndarray | dict[str, Sequence[float] | np.ndarray]
    attributes: dict[str, Any] | None = None
```

- `id` — Document id (string or integer).
- `vector` — For an unnamed-vector schema, pass a single sequence. For a named-vectors schema, pass a dict mapping vector field name to its sequence.
- `attributes` — Non-vector attributes (text, tags, metadata, etc.). Turbopuffer infers attribute types from the data.

### Namespace schema

Define vector configuration and distance metric for a namespace using `NamespaceSchema`:

```python
class NamespaceSchema:
    @classmethod
    async def create(
        cls,
        vectors: VectorDef | dict[str, VectorDef],
        *,
        distance: Literal["cosine_distance", "euclidean_squared"] = "cosine_distance",
    ) -> NamespaceSchema
```

**Parameters:**

- `vectors` — Either:
  - A single `VectorDef` for an unnamed vector (stored under turbopuffer's default `"vector"` field).
  - A dict mapping vector names to `VectorDef` for named vectors.
- `distance` — Distance metric applied to all vector columns in the namespace. Turbopuffer applies a single distance metric per namespace.

#### VectorDef

Specifies a vector field's dimension and dtype:

```python
class VectorDef(NamedTuple):
    schema: VectorSchemaProvider | ContextKey[VectorSchemaProvider]
```

The `schema` field accepts a [`VectorSchemaProvider`](../common_resources/vector_schema#vectorschemaprovider), a `ContextKey`, or an explicit `VectorSchema`. The dtype on the `VectorSchema` (must be `np.float32` or `np.float16`) controls turbopuffer's vector type — `[N]f32` or `[N]f16`.

#### Single (unnamed) vector

For namespaces with a single unnamed vector:

```python
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder

embedder = SentenceTransformerEmbedder("sentence-transformers/all-MiniLM-L6-v2")

schema = await turbopuffer.NamespaceSchema.create(
    vectors=turbopuffer.VectorDef(schema=embedder),
)
```

Rows pass the vector directly:

```python
target.declare_row(turbopuffer.Row(
    id="doc-123",
    vector=embedding,
    attributes={"text": "...", "tags": ["a", "b"]},
))
```

#### Named vectors

Namespaces can have multiple named vector columns (turbopuffer supports up to two per namespace). The name `"id"` is reserved for the row id and cannot be used as a vector field name.

```python
from cocoindex.resources.schema import VectorSchema
import numpy as np

schema = await turbopuffer.NamespaceSchema.create(
    vectors={
        "text_embedding": turbopuffer.VectorDef(
            schema=VectorSchema(dtype=np.float32, size=384),
        ),
        "image_embedding": turbopuffer.VectorDef(
            schema=VectorSchema(dtype=np.float32, size=512),
        ),
    },
    distance="cosine_distance",
)
```

Rows pass a dict of vectors:

```python
target.declare_row(turbopuffer.Row(
    id="doc-123",
    vector={
        "text_embedding": text_vec,
        "image_embedding": image_vec,
    },
    attributes={"title": "..."},
))
```

### Distance metrics

Turbopuffer applies a single `distance_metric` per namespace. Supported values:

- `"cosine_distance"` — Cosine distance (default).
- `"euclidean_squared"` — Squared Euclidean distance.

### Example

```python
from typing import AsyncIterator
import os
import cocoindex as coco
from cocoindex.connectors import turbopuffer
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder

TPUF = coco.ContextKey[turbopuffer.AsyncTurbopuffer]("main_vectors")

embedder = SentenceTransformerEmbedder("sentence-transformers/all-MiniLM-L6-v2")

@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    client = turbopuffer.AsyncTurbopuffer(
        region="gcp-us-central1",
        api_key=os.environ["TURBOPUFFER_API_KEY"],
    )
    builder.provide(TPUF, client)
    yield

@coco.fn
async def process_document(
    doc_id: str,
    text: str,
    target: turbopuffer.NamespaceTarget,
) -> None:
    embedding = await embedder.embed(text)
    target.declare_row(turbopuffer.Row(
        id=doc_id,
        vector=embedding,
        attributes={"text": text},
    ))

@coco.fn
async def app_main() -> None:
    namespace = await turbopuffer.mount_namespace_target(
        TPUF,
        "documents",
        await turbopuffer.NamespaceSchema.create(
            vectors=turbopuffer.VectorDef(schema=embedder),
        ),
    )

    for doc_id, text in documents:
        await coco.mount(
            coco.component_subpath("doc", doc_id),
            process_document,
            doc_id,
            text,
            namespace,
        )
```

## Row IDs

Turbopuffer rows are identified by `str` or `int`. UUIDs should be passed as strings.

## Attributes

Row attributes are schemaless; turbopuffer infers attribute types from the values you write. Supported scalar types include `string`, `int`, `uint`, `float`, `bool`, `uuid`, and `datetime`, plus their array variants. See [Turbopuffer's schema reference](https://turbopuffer.com/docs/write#schema) for the full list.

Reserved attribute names depend on the schema; putting any reserved name in `Row.attributes` raises a `ValueError`:

- `id` is always reserved — it's the row id.
- For an unnamed-vector schema, `vector` is also reserved (it's the wire-level vector field).
- For a named-vectors schema, each declared vector field name is reserved instead.

## Vector search

The connector focuses on writing rows. For vector search, use the turbopuffer client directly:

```python
ns = client.namespace("documents")
result = await ns.query(
    rank_by=("vector", "ANN", query_embedding.tolist()),
    top_k=10,
)
```

---

# Valkey connector

Source: https://cocoindex.io/docs/connectors/valkey/

The `valkey` connector provides utilities for writing documents to Valkey with vector search support, using the valkey-search module for similarity search indexes.

```python
from cocoindex.connectors import valkey
```

**Note — Dependencies**
This connector requires additional dependencies. Install with:

```bash
pip install cocoindex[valkey]
```

Requires a Valkey server with the [valkey-search](https://github.com/valkey-io/valkey-search) module loaded.

## Connection setup

`create_client_config()` creates a configuration for connecting to Valkey.

```python
def create_client_config(
    host: str = "localhost",
    port: int = 6379,
    *,
    password: str | None = None,
    use_tls: bool = False,
    client_name: str | None = "cocoindex_vector_store",
    **kwargs: Any,
) -> GlideClientConfiguration
```

**Parameters:**

- `host` — Valkey server host (default: `"localhost"`).
- `port` — Valkey server port (default: `6379`).
- `password` — Optional authentication password.
- `use_tls` — Whether to use TLS for the connection (default: `False`).
- `client_name` — Client name for the connection, visible in `CLIENT LIST` and monitoring dashboards (default: `"cocoindex_vector_store"`). Pass `None` to disable.
- `**kwargs` — Additional keyword arguments passed to `GlideClientConfiguration`.

For advanced configurations not covered by these parameters, construct `GlideClientConfiguration` directly.

**Returns:** A `GlideClientConfiguration` instance.

**Example:**

```python
from glide import GlideClient
config = valkey.create_client_config("localhost", 6379)
client = await GlideClient.create(config)
```

## As target

The `valkey` connector provides target state APIs for writing documents to search indexes. CocoIndex tracks what documents should exist and automatically handles upserts and deletions.

### Declaring target states

#### Setting up a connection

Create a `ContextKey[GlideClient]` to identify your Valkey client, then provide it in your lifespan:

**Note**
The key name is load-bearing across runs — it's the stable identity CocoIndex uses to track managed indexes. See [ContextKey as stable identity](../programming_guide/context#contextkey-as-stable-identity) before renaming.

```python
from glide import GlideClient
import cocoindex as coco
from cocoindex.connectors import valkey

VALKEY_DB = coco.ContextKey[GlideClient]("my_valkey")

@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    config = valkey.create_client_config("localhost", 6379)
    client = await GlideClient.create(config)
    builder.provide(VALKEY_DB, client)
    yield
    await client.close()
```

#### Indexes (parent state)

Declares a search index as a target state. Returns an `IndexTarget` for declaring documents.

```python
from cocoindex.connectorkits.target import ManagedBy

def declare_index_target(
    db: ContextKey[GlideClient],
    index_name: str,
    schema: IndexSchema,
    *,
    managed_by: ManagedBy = ManagedBy.SYSTEM,
) -> IndexTarget[coco.PendingS]
```

**Parameters:**

- `db` — A `ContextKey[GlideClient]` identifying the Valkey client to use.
- `index_name` — Name of the search index.
- `schema` — Schema definition specifying vector configuration (see [Index Schema](#index-schema)).
- `managed_by` — `ManagedBy.SYSTEM` (default): CocoIndex creates and drops the index automatically. `ManagedBy.USER`: assumes the index already exists and never drops it.

**Returns:** A pending `IndexTarget`. Use the convenience wrapper `await valkey.mount_index_target(VALKEY_DB, index_name, schema)` to resolve.

#### Documents (child states)

Once an `IndexTarget` is resolved, declare documents to be upserted:

```python
def IndexTarget.declare_document(
    self,
    document: valkey.Document,
) -> None
```

**Parameters:**

- `document` — A `valkey.Document` containing:
  - `id` — Document ID (string)
  - `vector` — Vector data (list of floats or numpy array)
  - `payload` — Optional metadata dict (values must be str, int, or float)

### Index schema

Define vector configuration for an index using `IndexSchema`.

```python
class IndexSchema:
    @classmethod
    async def create(
        cls,
        vectors: VectorDef,
        fields: list[FieldDef] | None = None,
    ) -> IndexSchema
```

**Parameters:**

- `vectors` — A `VectorDef` specifying the vector field configuration.
- `fields` — Optional list of `FieldDef` for indexed payload fields (enables search/filtering).

#### VectorDef

Specifies vector configuration including dimension, distance metric, and algorithm:

```python
class VectorDef(NamedTuple):
    schema: VectorSchemaProvider | ContextKey[VectorSchemaProvider]
    distance: Literal["cosine", "l2", "ip"] = "cosine"
    algorithm: Literal["hnsw", "flat"] = "hnsw"
```

**Parameters:**

- `schema` — A `VectorSchemaProvider` or `ContextKey` that defines vector dimensions.
- `distance` — Distance metric for similarity search (default: `"cosine"`).
- `algorithm` — Vector index algorithm (default: `"hnsw"`).

#### VectorSchemaProvider

The `schema` field of `VectorDef` accepts a [`VectorSchemaProvider`](../common_resources/vector_schema#vectorschemaprovider), a `ContextKey`, or an explicit `VectorSchema` to specify the vector dimension and dtype. See [Vector Schema](../common_resources/vector_schema#vectorschemaprovider) for details.

### Distance metrics

The `distance` parameter in `VectorDef` specifies the similarity metric:

- `"cosine"` — Cosine similarity (default)
- `"l2"` — Euclidean distance (L2)
- `"ip"` — Inner product (dot product)

### Index algorithms

The `algorithm` parameter specifies the vector indexing strategy:

- `"hnsw"` — Hierarchical Navigable Small World graph (default, best for most use cases)
- `"flat"` — Brute-force flat index (exact results, slower for large datasets)

### Data storage

Documents are stored as Valkey HASH keys with a prefix pattern:

- Key format: `{index_name}:{document_id}`
- Vector field: stored as binary float32 blob in a `vector` hash field
- Payload fields: stored as individual hash fields (string values)

The search index is created with `FT.CREATE ON HASH PREFIX 1 {index_name}:` to automatically index all documents with the matching prefix.

### Example

```python
from glide import GlideClient
import cocoindex as coco
from cocoindex.connectors import valkey
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder
from typing import AsyncIterator

VALKEY_DB = coco.ContextKey[GlideClient]("main_vectors")
EMBEDDER = coco.ContextKey[SentenceTransformerEmbedder]("embedder")

@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    config = valkey.create_client_config("localhost", 6379)
    client = await GlideClient.create(config)
    builder.provide(VALKEY_DB, client)
    builder.provide(EMBEDDER, SentenceTransformerEmbedder("all-MiniLM-L6-v2"))
    yield
    await client.close()

@coco.fn
async def process_document(
    doc_id: str,
    text: str,
    target: valkey.IndexTarget,
) -> None:
    embedder = coco.use_context(EMBEDDER)
    embedding = await embedder.embed(text)

    target.declare_document(valkey.Document(
        id=doc_id,
        vector=embedding.tolist(),
        payload={"text": text},
    ))

@coco.fn
async def app_main() -> None:
    embedder = coco.use_context(EMBEDDER)

    index = await valkey.mount_index_target(
        VALKEY_DB,
        "documents",
        await valkey.IndexSchema.create(
            vectors=valkey.VectorDef(schema=embedder, distance="cosine"),
        ),
    )

    for doc_id, text in documents:
        await coco.mount(
            coco.component_subpath("doc", doc_id),
            process_document,
            doc_id,
            text,
            index,
        )
```

## Vector search

The connector focuses on writing documents to Valkey. For vector search queries, use the Valkey GLIDE client directly:

```python
from glide import GlideClient
from glide.async_commands import ft
from glide.async_commands.ft import FtSearchOptions
import struct

# Build a KNN query
query_vec = struct.pack(f"<{len(embedding)}f", *embedding)
query = f"*=>[KNN 10 @vector $query_vec AS score]"

results = await ft.search(
    client,
    "documents",
    query,
    options=FtSearchOptions(params={"query_vec": query_vec}),
)
```

---

# zvec connector

Source: https://cocoindex.io/docs/connectors/zvec/

The `zvec` connector writes documents to [zvec](https://zvec.org), an embedded, in-process vector database. zvec runs inside your application — no server or daemon — and stores each collection in a directory on disk.

```python
from cocoindex.connectors import zvec
```

**Note — Installation**
zvec is an optional dependency:

```bash
pip install cocoindex[zvec]
```

## Connection setup

### connect

`connect()` creates a `ManagedConnection` rooted at a base directory. Each collection lives in a subdirectory under it.

```python
def connect(base_path: str | Path, *, enable_mmap: bool = True) -> ManagedConnection
```

**Parameters:**

- `base_path` — Directory under which collections are stored. Created if missing.
- `enable_mmap` — Whether zvec uses memory-mapped I/O for data files.

### ManagedConnection

A handle to the base directory. zvec takes an exclusive write lock per open collection, so `ManagedConnection` caches open handles by collection name and reuses them.

**Methods:**

- `collection_path(name)` — Path to a collection's directory.
- `close()` — Release all open collection handles (drops their write locks).

For a lifespan, use `managed_connection()`, which closes handles on exit:

```python
def managed_connection(
    base_path: str | Path, *, enable_mmap: bool = True
) -> Iterator[ManagedConnection]
```

## As target

The `zvec` connector tracks which documents should exist in a collection and automatically handles upserts and deletions. zvec's native upsert is used directly, and documents are removed by id when they are no longer declared.

### Declaring target states

#### Setting up a connection

Create a `ContextKey[zvec.ManagedConnection]` to identify your connection, then provide it in your lifespan:

**Note**
The key name is load-bearing across runs — it's the stable identity CocoIndex uses to track managed documents. See [ContextKey as stable identity](../programming_guide/context#contextkey-as-stable-identity) before renaming.

```python
import cocoindex as coco

ZVEC_DB = coco.ContextKey[zvec.ManagedConnection]("main_db")

@coco.lifespan
def coco_lifespan(builder: coco.EnvironmentBuilder) -> Iterator[None]:
    with zvec.managed_connection("./zvec_data") as conn:
        builder.provide(ZVEC_DB, conn)
        yield
```

#### Collections (parent state)

Declares a collection as a target state. Returns a `CollectionTarget` for declaring documents.

```python
def declare_collection_target(
    db: ContextKey[ManagedConnection],
    collection_name: str,
    schema: CollectionSchema[RowT],
    *,
    managed_by: Literal["system", "user"] = "system",
) -> CollectionTarget[RowT, coco.PendingS]
```

**Parameters:**

- `db` — A `ContextKey[ManagedConnection]` identifying the connection.
- `collection_name` — Name of the collection (a subdirectory under the connection's base path).
- `schema` — Schema definition (see [Collection schema](#collection-schema-from-python-class)).
- `managed_by` — Whether CocoIndex manages the collection lifecycle (`"system"`, creating and destroying it) or assumes it already exists (`"user"`, documents only).

**Returns:** A pending `CollectionTarget`. Use `await zvec.mount_collection_target(ZVEC_DB, collection_name, schema)` to resolve.

#### Documents (child states)

Once a `CollectionTarget` is resolved, declare documents to be upserted:

```python
def CollectionTarget.declare_row(self, *, row: RowT) -> None
```

The primary-key value becomes the document `id` (converted to `str`).

### Collection schema: from Python class

Define the collection structure using a Python class (dataclass, NamedTuple, or Pydantic model):

```python
@classmethod
async def CollectionSchema.from_class(
    cls,
    record_type: type[RowT],
    primary_key: list[str],
    *,
    column_overrides: dict[str, ZvecType | ZvecVectorDef | ZvecFtsType | VectorSchemaProvider] | None = None,
) -> CollectionSchema[RowT]
```

**Parameters:**

- `record_type` — A record type whose fields define the document structure.
- `primary_key` — Exactly one column name. Its value becomes the document `id`.
- `column_overrides` — Optional per-column overrides for type mapping or vector configuration.

**Note — Single primary key**
zvec documents have a single string `id`, so `primary_key` must name exactly one column. Its value is converted to `str` to form the id. Composite primary keys are not supported.

**Note — At least one vector field**
zvec is a vector database: every collection must declare at least one vector field (dense or sparse).

**Example:**

```python
from dataclasses import dataclass
from typing import Annotated
import numpy as np
from numpy.typing import NDArray
from cocoindex.resources.schema import VectorSchema

@dataclass
class Doc:
    id: str
    title: str
    year: int
    embedding: Annotated[NDArray[np.float32], VectorSchema(dtype=np.dtype(np.float32), size=384)]

schema = await zvec.CollectionSchema.from_class(Doc, primary_key=["id"])
```

Scalar Python types map to zvec field types as follows:

| Python Type | zvec `DataType` |
|-------------|-----------------|
| `bool` | `BOOL` |
| `int` | `INT64` |
| `float` | `DOUBLE` |
| `str` | `STRING` |
| `bytes` | `STRING` (base64) |
| `uuid.UUID` | `STRING` |
| `decimal.Decimal` | `STRING` |
| `datetime.date` / `time` / `datetime` | `STRING` (ISO format) |
| `datetime.timedelta` | `DOUBLE` (total seconds) |
| `list[str]` / `list[int]` / `list[float]` / `list[bool]` | `ARRAY_STRING` / `ARRAY_INT64` / `ARRAY_DOUBLE` / `ARRAY_BOOL` |
| other `list`, `dict`, nested structs | `STRING` (JSON) |
| `NDArray` (with vector schema) | `VECTOR_FP32` (float32) or `VECTOR_FP16` (float16) |

Scalar fields get an invert index by default so they can be used in query filters. The primary-key column maps to the document `id` and is not stored as a separate field.

#### ZvecType

Override the scalar type, encoder, or indexing for a field:

```python
from typing import Annotated
import zvec
from cocoindex.connectors.zvec import ZvecType

@dataclass
class MyRow:
    id: str
    # Store as INT32 instead of INT64, without a filter index.
    count: Annotated[int, ZvecType(zvec.DataType.INT32, indexed=False)]
    embedding: Annotated[NDArray[np.float32], VectorSchema(dtype=np.dtype(np.float32), size=384)]
```

#### ZvecFtsType

Mark a `str` field as full-text (FTS) indexed. The field is stored as a `STRING` field with a zvec FTS index, so you can run full-text match queries against it directly through zvec. Requires zvec >= 0.5.

```python
from typing import Annotated
from cocoindex.connectors.zvec import ZvecFtsType

@dataclass
class Doc:
    id: str
    body: Annotated[str, ZvecFtsType(tokenizer_name="standard", filters=("lowercase",))]
    embedding: Annotated[NDArray[np.float32], VectorSchema(dtype=np.dtype(np.float32), size=384)]
```

`ZvecFtsType` options: `tokenizer_name` (e.g. `"standard"`, `"jieba"`), `filters` (token filters applied after tokenization, default `("lowercase",)`), and `extra_params`. The connector writes the field and its FTS index; querying it (for example with `zvec.Query(field_name=..., fts=zvec.Fts(match_string=...))`) happens directly against zvec, since this connector only handles the write path.

### Vectors

A collection can declare multiple named vector fields, dense and sparse, in one schema. zvec supports querying across them with reranking at read time.

#### Dense vectors

A NumPy `ndarray` field with a `VectorSchema` becomes a dense vector. The element dtype selects the zvec type: `float32` → `VECTOR_FP32`, `float16` → `VECTOR_FP16`. zvec's dense index only accepts these two; for smaller storage, keep a float32 vector and set `quantize`. Tune the HNSW index with `ZvecVectorDef`:

```python
from cocoindex.connectors.zvec import ZvecVectorDef

@dataclass
class Doc:
    id: str
    embedding: Annotated[
        NDArray[np.float32],
        VectorSchema(dtype=np.dtype(np.float32), size=384),
        ZvecVectorDef(metric="cosine", quantize="int8"),
    ]
```

`ZvecVectorDef` options: `metric` (`"cosine"`, `"ip"`, `"l2"`) and `quantize` (`"none"`, `"fp16"`, `"int8"`, `"int4"`).

#### Sparse vectors

Mark a `dict[int, float]` field (mapping dimension → weight) as sparse with `ZvecVectorDef(sparse=True)`:

```python
@dataclass
class Doc:
    id: str
    sparse: Annotated[dict[int, float], ZvecVectorDef(sparse=True)]
```

## Full example

```python
import pathlib
from dataclasses import dataclass
from typing import Annotated, Iterator

import cocoindex as coco
import numpy as np
from numpy.typing import NDArray
from cocoindex.connectors import zvec
from cocoindex.resources.schema import VectorSchema

ZVEC_DB = coco.ContextKey[zvec.ManagedConnection]("main_db")


@dataclass
class Doc:
    id: str
    title: str
    embedding: Annotated[
        NDArray[np.float32], VectorSchema(dtype=np.dtype(np.float32), size=384)
    ]


@coco.lifespan
def coco_lifespan(builder: coco.EnvironmentBuilder) -> Iterator[None]:
    with zvec.managed_connection("./zvec_data") as conn:
        builder.provide(ZVEC_DB, conn)
        yield


@coco.fn
async def index_docs(docs: list[Doc]) -> None:
    target = await zvec.mount_collection_target(
        ZVEC_DB,
        "docs",
        await zvec.CollectionSchema.from_class(Doc, primary_key=["id"]),
    )
    for doc in docs:
        target.declare_row(row=doc)
```

---

# Entity resolution

Source: https://cocoindex.io/docs/ops/entity_resolution/

The `cocoindex.ops.entity_resolution` package resolves a set of raw entity names into a deduplicated canonical map. It finds near-duplicate names via embedding similarity (FAISS), then asks a pair-resolver (typically an LLM) to confirm matches and pick the better canonical name.

```python
from cocoindex.ops.entity_resolution import resolve_entities
```

**Note — Dependencies**
This module requires additional dependencies. Install with:

```bash
# With built-in LLM resolver (recommended)
pip install cocoindex[entity_resolution_llm]

# Core only (for custom resolver implementations)
pip install cocoindex[entity_resolution]
```

## Basic usage

```python
import cocoindex as coco
from cocoindex.ops.entity_resolution import resolve_entities
from cocoindex.ops.entity_resolution.llm_resolver import LlmPairResolver
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder

EMBEDDER = coco.ContextKey[SentenceTransformerEmbedder]("embedder", detect_change=True)
LLM_MODEL = coco.ContextKey[str]("llm_model", detect_change=True)

@coco.fn(memo=True)
async def resolve_my_entities(raw_entities: set[str]) -> dict[str, str | None]:
    result = await resolve_entities(
        entities=raw_entities,
        embedder=coco.use_context(EMBEDDER),
        resolve_pair=LlmPairResolver(model=coco.use_context(LLM_MODEL)),
    )
    return result.to_dict()
```

## `ResolvedEntities`

`resolve_entities` returns a `ResolvedEntities` object — a read-only wrapper around the dedup map with safe chain-walking:

```python
result = await resolve_entities(...)

result.canonical_of("Microsoft Corp.")  # -> "Microsoft"
result.canonical_of("Microsoft")        # -> "Microsoft" (self-canonical)
result.canonicals()                     # -> {"Microsoft", "OpenAI", ...}
result.groups()                         # -> {"Microsoft": {"Microsoft", "Microsoft Corp."}, ...}
result.to_dict()                        # -> {"Microsoft": None, "Microsoft Corp.": "Microsoft", ...}
```

## Existing-canonical handling

If some entity names are already established as canonical (e.g., they have on-disk files or database records), you can pass `is_existing_canonical` to influence how matches involving those names are resolved.

### Without `is_existing_canonical` (default)

The resolver decides which side becomes canonical on every match. No special treatment for any name.

```python
result = await resolve_entities(
    entities=raw_entities,
    embedder=embedder,
    resolve_pair=resolver,
)
```

### With `is_existing_canonical`

Pass a sync predicate that returns `True` for names you consider already-established canonicals. The `existing_policy` parameter controls how strongly that status is enforced:

#### `PINNED` (default)

Existing canonicals are immutable. They are seeded directly as canonicals without consulting the resolver. Two existings never merge. Non-existing entities are resolved in a second pass; if they match an existing canonical, the existing always wins.

```python
result = await resolve_entities(
    entities=raw_entities,
    embedder=embedder,
    resolve_pair=resolver,
    is_existing_canonical=lambda name: name in existing_files,
    # existing_policy defaults to PINNED
)
```

#### `PREFERRED`

A softer policy: existing-canonical status breaks ties, but the resolver is always consulted. When exactly one side of a match is existing-canonical, that side wins regardless of the resolver's verdict. When both or neither are existing, the resolver decides.

```python
from cocoindex.ops.entity_resolution import ExistingCanonicalPolicy

result = await resolve_entities(
    entities=raw_entities,
    embedder=embedder,
    resolve_pair=resolver,
    is_existing_canonical=lambda name: name in existing_files,
    existing_policy=ExistingCanonicalPolicy.PREFERRED,
)
```

## Events

Pass `on_resolution` to receive a `ResolutionEvent` for each entity. Events are delivered once resolution finishes, in the same order the sequential implementation would have emitted them (PREFERRED: sorted by entity; PINNED: pass-1 existings first then pass-2 non-existings):

```python
from cocoindex.ops.entity_resolution import ResolutionEvent

def log_resolution(event: ResolutionEvent) -> None:
    if event.decision and event.decision.matched:
        print(f"  {event.entity!r} -> {event.canonical!r}")

result = await resolve_entities(
    entities=raw_entities,
    embedder=embedder,
    resolve_pair=resolver,
    on_resolution=log_resolution,
)
```

Each event includes:
- `entity` / `canonical` — the entity and its resolved canonical
- `candidates` — what was passed to the resolver (empty if no resolver call)
- `decision` — the resolver's raw verdict (compare with `canonical` to detect policy overrides)
- `repointed` — if a prior canonical was demoted, its name
- `seeded` — `True` for pinned existing-canonical entities seeded without resolver

## `resolve_entities` parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `entities` | `Iterable[str]` | *(required)* | Raw entity names. Duplicates collapsed; iterated in sorted order. |
| `embedder` | [`Embedder`](../common_resources/data_types#embedder) | *(required)* | Async single-text embedder. `LiteLLMEmbedder` and `SentenceTransformerEmbedder` both work. |
| `resolve_pair` | `PairResolver` | *(required)* | Pair-resolution callback. See [`LlmPairResolver`](#llmpairresolver) for the built-in option. |
| `is_existing_canonical` | `Callable[[str], bool] \| None` | `None` | Sync predicate for existing-canonical detection. |
| `existing_policy` | `ExistingCanonicalPolicy` | `PINNED` | How to treat existing canonicals. Ignored when `is_existing_canonical` is `None`. |
| `on_resolution` | `Callable[[ResolutionEvent], None] \| None` | `None` | Sync callback fired once per entity after resolution finishes, in deterministic per-policy order. |
| `max_distance` | `float` | `0.3` | Cosine distance threshold (similarity >= 0.7). |
| `top_n` | `int` | `5` | Max candidates surfaced per entity. |

---

## `LlmPairResolver`

`cocoindex.ops.entity_resolution.llm_resolver` provides `LlmPairResolver`, a built-in resolver that uses an LLM (via [litellm](https://docs.litellm.ai/)) to decide pair matches. It sends each `(entity, candidates)` pair to the model and returns a structured decision — with automatic validation and retry when the model hallucinates a name not in the candidate list.

### Usage

```python
from cocoindex.ops.entity_resolution import resolve_entities
from cocoindex.ops.entity_resolution.llm_resolver import LlmPairResolver

resolver = LlmPairResolver(model="openai/gpt-4o-mini")

result = await resolve_entities(
    entities={"Barack Obama", "Barack H. Obama", "OpenAI"},
    embedder=embedder,
    resolve_pair=resolver,
)
```

### Model strings

Uses the same litellm model-string format as [`LiteLLMEmbedder`](./litellm):

| Provider | Example |
|----------|---------|
| OpenAI | `openai/gpt-4o-mini` |
| Anthropic | `anthropic/claude-haiku-4-5` |
| Google (Gemini) | `gemini/gemini-2.0-flash` |
| Groq | `groq/llama-3.3-70b-versatile` |

See the [LiteLLM docs](https://docs.litellm.ai/docs/providers) for the full list of 100+ supported providers.

### Entity type hints

Pass `entity_type` to tailor the prompt for specific entity categories:

```python
person_resolver = LlmPairResolver(model="openai/gpt-4o-mini", entity_type="person")
tech_resolver = LlmPairResolver(model="openai/gpt-4o-mini", entity_type="technology")
```

This adds context to the system prompt, helping the model make better judgments (e.g., being more conservative with personal names).

### Extra guidance

Append domain-specific rules to the default prompt via `extra_guidance`:

```python
resolver = LlmPairResolver(
    model="openai/gpt-4o-mini",
    entity_type="organization",
    extra_guidance=(
        "A parent organization and its subsidiary/division are DISTINCT things. "
        "'Amazon' is not the same as 'AWS'. 'Google' is not the same as 'YouTube'."
    ),
)
```

`extra_guidance` is for **domain rules only** — do not include output-format instructions.

### Validation and retry

After each LLM call, the response is validated:

1. **Structural**: the JSON response must parse into the expected schema
2. **Semantic**: `matched` (if non-null) must be in the supplied `candidates`. If not, the LLM is re-prompted with explicit feedback and the conversation continues

The default retry budget is 2. If exhausted, no match is returned.

```python
resolver = LlmPairResolver(model="openai/gpt-4o-mini", retries=3)  # more retries
resolver = LlmPairResolver(model="openai/gpt-4o-mini", retries=1)  # fewer retries
```

### Memoization

Each unique `(entity, candidates)` pair's decision is persisted across runs via [`@coco.fn(memo=True)`](../programming_guide/function). Changing the model or prompt invalidates the cache. No additional memoization wrapper is needed.

### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `model` | `str` | *(required)* | Litellm model string (e.g. `"openai/gpt-4o-mini"`). |
| `entity_type` | `str \| None` | `None` | Entity type hint woven into the prompt. |
| `extra_guidance` | `str \| None` | `None` | Domain rules appended to the default prompt. |
| `retries` | `int` | `2` | Max retries on invalid `matched` output. |

---

## Custom resolvers

For cases where the built-in `LlmPairResolver` doesn't fit (rule-based matching, different LLM framework, etc.), implement the `PairResolver` protocol:

```python
from cocoindex.ops.entity_resolution import PairResolver, PairDecision, CanonicalSide

class MyResolver:
    async def __call__(self, entity: str, candidates: list[str]) -> PairDecision:
        # `candidates` is a non-empty, de-duplicated list of canonical names
        # with cosine similarity to `entity` above the threshold.

        # No match:
        return PairDecision()

        # Match — existing candidate stays canonical (default):
        return PairDecision(matched="Microsoft")

        # Match — new entity is the better canonical name:
        return PairDecision(matched="Microsoft Corp.", canonical=CanonicalSide.NEW)
```

Any async callable with the signature `(entity: str, candidates: list[str]) -> PairDecision` works — no subclassing required. `PairDecision.canonical` is advisory: the `existing_policy` may override it (see [Existing-canonical handling](#existing-canonical-handling)).

---

# LiteLLM embeddings & speech-to-text

Source: https://cocoindex.io/docs/ops/litellm/

The `cocoindex.ops.litellm` module provides integration with the [LiteLLM](https://docs.litellm.ai/) library for text embeddings (`LiteLLMEmbedder`) and speech-to-text transcription (`LiteLLMTranscriber`).

```python
from cocoindex.ops.litellm import LiteLLMEmbedder, LiteLLMTranscriber
```

**Note — Dependencies**
This module requires additional dependencies. Install with:

```bash
pip install cocoindex[litellm]
```

## Overview

The `LiteLLMEmbedder` class is a wrapper around LiteLLM's embedding API that:

- Implements `VectorSchemaProvider` for seamless integration with CocoIndex connectors
- Supports 100+ embedding providers (OpenAI, Azure, Vertex AI, Cohere, Bedrock, etc.) through a unified API
- Provides a simple async `embed()` method
- Passes through all additional arguments to the LiteLLM embedding API
- Returns properly typed numpy arrays

## Basic usage

### Creating an embedder

All extra keyword arguments are passed through to every `litellm.aembedding` call. See [Supported providers](#supported-providers) for provider-specific model strings and configuration.

```python
from cocoindex.ops.litellm import LiteLLMEmbedder

embedder = LiteLLMEmbedder("text-embedding-3-small")

# With explicit API key and base URL
embedder = LiteLLMEmbedder("text-embedding-3-small", api_key="sk-...", api_base="https://my-proxy.example.com")

# With custom dimensions (OpenAI text-embedding-3 models)
embedder = LiteLLMEmbedder("text-embedding-3-small", dimensions=512)

# With a timeout (seconds)
embedder = LiteLLMEmbedder("text-embedding-3-small", timeout=30)
```

### Embedding text

The `embed()` method converts text into a `numpy.ndarray` of `float32`. It's an async method — use `await` when calling it:

```python
# In a CocoIndex function
embedding = await embedder.embed("Hello, world!")

# Use the embedding in a dataclass row, store in a vector database, etc.
table.declare_row(row=DocEmbedding(text="Hello, world!", embedding=embedding))
```

### Using as a type annotation

The `LiteLLMEmbedder` implements [`VectorSchemaProvider`](../common_resources/vector_schema#vectorschemaprovider), which means it can be used directly as metadata in `Annotated` type annotations. This is the recommended way to declare vector columns — CocoIndex connectors automatically extract the vector dimension and dtype from the annotation when creating tables.

```python
from dataclasses import dataclass
from typing import Annotated
from numpy.typing import NDArray

embedder = LiteLLMEmbedder("text-embedding-3-small")

@dataclass
class DocEmbedding:
    id: int
    filename: str
    text: str
    embedding: Annotated[NDArray, embedder]
```

When you pass this dataclass to a connector's `TableSchema.from_class()`, the connector automatically reads the embedder annotation to determine the vector column's dimension and dtype. For example, with Postgres:

```python
from cocoindex.connectors import postgres

table_schema = await postgres.TableSchema.from_class(
    DocEmbedding,
    primary_key=["id"],
)
target_table = await postgres.mount_table_target(
    PG_DB,
    table_name="doc_embeddings",
    table_schema=table_schema,
    pg_schema_name="my_schema",
)
```

The connector automatically creates the appropriate `vector(N)` column. See the [Connectors](../connectors/postgres) docs for other supported backends (LanceDB, Qdrant, SQLite).

## Speech-to-text

The `LiteLLMTranscriber` class is a wrapper around LiteLLM's transcription API that turns audio into text through any LiteLLM-supported speech-to-text provider (OpenAI Whisper, ElevenLabs, Groq, and more) using a unified interface.

### Creating a transcriber

All extra keyword arguments are passed through to every `litellm.atranscription` call (e.g., `api_key`, `api_base`, `language`, `extra_body`).

```python
from cocoindex.ops.litellm import LiteLLMTranscriber

transcriber = LiteLLMTranscriber("whisper-1")

# With an explicit API key
transcriber = LiteLLMTranscriber("whisper-1", api_key="sk-...")

# With a default language hint applied to every call
transcriber = LiteLLMTranscriber("whisper-1", language="en")
```

### Transcribing audio

The `transcribe()` method takes a `FileLike` object (such as a `localfs.File`) containing audio data and returns the transcribed text as a `str`. It's an async method — use `await` when calling it. Per-call keyword arguments override the defaults provided at construction time:

```python
# In a CocoIndex function, `file` is a FileLike (e.g. localfs.File)
transcript = await transcriber.transcribe(file)

# Override or add per-call options
transcript = await transcriber.transcribe(file, response_format="verbose_json")
```

A complete pipeline that walks local audio files, transcribes each one, and stores the transcripts in Postgres is available in the [`audio_to_text` example](https://github.com/cocoindex-io/cocoindex/tree/main/examples/audio_to_text).

### Supported transcription providers

`LiteLLMTranscriber` accepts any model string supported by LiteLLM's transcription API. A few common examples:

| Model | Model string | Environment variables |
|-------|-------------|-----------------------|
| OpenAI Whisper | `whisper-1` | `OPENAI_API_KEY` |
| ElevenLabs Scribe | `elevenlabs/scribe_v1` | `ELEVENLABS_API_KEY` |
| Groq Whisper Large V3 | `groq/whisper-large-v3` | `GROQ_API_KEY` |

See the [LiteLLM audio transcription docs](https://docs.litellm.ai/docs/audio_transcription) for the full list of providers and model strings.

## Supported providers

Below are common providers with their model strings and configuration. The `litellm` module is re-exported from `cocoindex.ops.litellm` for setting provider-specific variables. See the [LiteLLM embedding docs](https://docs.litellm.ai/docs/embedding/supported_embedding) for the full list.

### Ollama

| Model                    | Model string                  |
|--------------------------|-------------------------------|
| Nomic Embed Text         | `ollama/nomic-embed-text`     |
| MXBai Embed Large        | `ollama/mxbai-embed-large`    |
| All MiniLM               | `ollama/all-minilm`           |
| Snowflake Arctic Embed   | `ollama/snowflake-arctic-embed` |
| BGE M3                   | `ollama/bge-m3`               |

No API key required. Ollama must be running locally (default `http://localhost:11434`). Pull the model first with `ollama pull <model-name>`.

```python
embedder = LiteLLMEmbedder("ollama/nomic-embed-text", api_base="http://localhost:11434")
```

### OpenAI

| Model | Model string |
|-------|-------------|
| Text Embedding 3 Small | `text-embedding-3-small` |
| Text Embedding 3 Large | `text-embedding-3-large` |
| Text Embedding Ada 002 | `text-embedding-ada-002` |

**Environment variables:** `OPENAI_API_KEY`

```python
embedder = LiteLLMEmbedder("text-embedding-3-small")
```

### Azure OpenAI

| Model | Model string |
|-------|-------------|
| Text Embedding 3 Small | `azure/<your-deployment-name>` |
| Text Embedding Ada 002 | `azure/<your-deployment-name>` |

The model string uses your Azure deployment name, not the OpenAI model name.

**Environment variables:** `AZURE_API_KEY`, `AZURE_API_BASE`, `AZURE_API_VERSION`

```python
embedder = LiteLLMEmbedder(
    "azure/my-deployment-name",
    api_key="your-azure-api-key",
    api_base="https://my-resource.openai.azure.com",
    api_version="2024-02-01",
)
```

### Gemini (Google AI Studio)

| Model | Model string |
|-------|-------------|
| Text Embedding 004 | `gemini/text-embedding-004` |

**Environment variables:** `GEMINI_API_KEY`

```python
embedder = LiteLLMEmbedder("gemini/text-embedding-004")
```

### Vertex AI

| Model | Model string |
|-------|-------------|
| Text Embedding 004 | `vertex_ai/text-embedding-004` |
| Text Multilingual Embedding 002 | `vertex_ai/text-multilingual-embedding-002` |
| Textembedding Gecko | `vertex_ai/textembedding-gecko` |

**Environment variables:** `GOOGLE_APPLICATION_CREDENTIALS` (path to service account JSON)

**Additional configuration:** Set project and location via the `litellm` module or environment variables `VERTEXAI_PROJECT` and `VERTEXAI_LOCATION`:

```python
from cocoindex.ops.litellm import LiteLLMEmbedder, litellm

litellm.vertex_project = "my-gcp-project"
litellm.vertex_location = "us-central1"

embedder = LiteLLMEmbedder("vertex_ai/text-embedding-004")
```

### AWS Bedrock

| Model | Model string |
|-------|-------------|
| Titan Text Embeddings V2 | `bedrock/amazon.titan-embed-text-v2:0` |
| Titan Text Embeddings V1 | `bedrock/amazon.titan-embed-text-v1` |
| Cohere Embed English | `bedrock/cohere.embed-english-v3` |
| Cohere Embed Multilingual | `bedrock/cohere.embed-multilingual-v3` |

**Environment variables:** `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION_NAME`

```python
embedder = LiteLLMEmbedder("bedrock/amazon.titan-embed-text-v2:0")
```

### Mistral AI

| Model | Model string |
|-------|-------------|
| Mistral Embed | `mistral/mistral-embed` |

**Environment variables:** `MISTRAL_API_KEY`

```python
embedder = LiteLLMEmbedder("mistral/mistral-embed")
```

### Voyage AI

| Model | Model string |
|-------|-------------|
| Voyage 3.5 | `voyage/voyage-3.5` |
| Voyage 3.5 Lite | `voyage/voyage-3.5-lite` |
| Voyage Code 3 | `voyage/voyage-code-3` |

**Environment variables:** `VOYAGE_API_KEY`

```python
embedder = LiteLLMEmbedder("voyage/voyage-3.5")
```

### Cohere

| Model | Model string |
|-------|-------------|
| Embed English V3 | `cohere/embed-english-v3.0` |
| Embed English Light V3 | `cohere/embed-english-light-v3.0` |
| Embed Multilingual V3 | `cohere/embed-multilingual-v3.0` |

**Environment variables:** `COHERE_API_KEY`

**Additional configuration:** V3 models require an `input_type` parameter (defaults to `"search_document"`; use `"search_query"` for queries):

```python
embedder = LiteLLMEmbedder("cohere/embed-english-v3.0", input_type="search_document")
```

### Nebius AI

| Model | Model string |
|-------|-------------|
| BGE EN ICL | `nebius/BAAI/bge-en-icl` |

**Environment variables:** `NEBIUS_API_KEY`

```python
embedder = LiteLLMEmbedder("nebius/BAAI/bge-en-icl")
```

---

# Sentence Transformers embeddings

Source: https://cocoindex.io/docs/ops/sentence_transformers/

The `cocoindex.ops.sentence_transformers` module provides integration with the [sentence-transformers](https://www.sbert.net/) library for text embeddings.

```python
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder
```

**Note — Dependencies**
This module requires additional dependencies. Install with:

```bash
pip install cocoindex[sentence_transformers]
```

## Overview

The `SentenceTransformerEmbedder` class is a wrapper around SentenceTransformer models that:

- Implements `VectorSchemaProvider` for seamless integration with CocoIndex connectors
- Handles model caching and thread-safe GPU access automatically
- Provides a simple `embed()` method
- Returns properly typed numpy arrays

## Basic usage

### Creating an embedder

```python
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder

# Initialize embedder with a pre-trained model
embedder = SentenceTransformerEmbedder("sentence-transformers/all-MiniLM-L6-v2")
```

### Embedding text

The `embed()` method converts text into a `numpy.ndarray` of `float32`. It supports both sync and async usage:

```python
# In a CocoIndex function
embedding = await embedder.embed("Hello, world!")

# Use the embedding in a dataclass row, store in a vector database, etc.
table.declare_row(row=CodeEmbedding(code="Hello, world!", embedding=embedding))
```

### Using as a type annotation

The `SentenceTransformerEmbedder` implements [`VectorSchemaProvider`](../common_resources/vector_schema#vectorschemaprovider), which means it can be used directly as metadata in `Annotated` type annotations. This is the recommended way to declare vector columns — CocoIndex connectors automatically extract the vector dimension and dtype from the annotation when creating tables.

```python
from dataclasses import dataclass
from typing import Annotated
from numpy.typing import NDArray

embedder = SentenceTransformerEmbedder("sentence-transformers/all-MiniLM-L6-v2")

@dataclass
class CodeEmbedding:
    id: int
    filename: str
    code: str
    embedding: Annotated[NDArray, embedder]  # vector(384) with float32
    start_line: int
    end_line: int
```

When you pass this dataclass to a connector's `TableSchema.from_class()`, the connector automatically reads the embedder annotation to determine the vector column's dimension and dtype. For example, with Postgres:

```python
from cocoindex.connectors import postgres

table_schema = await postgres.TableSchema.from_class(
    CodeEmbedding,
    primary_key=["id"],
)
target_table = await postgres.mount_table_target(
    PG_DB,
    "code_embeddings",
    table_schema,
    pg_schema_name="my_schema",
)
```

The connector automatically creates the appropriate `vector(384)` column. See the [Connectors](../connectors/postgres) docs for other supported backends (LanceDB, Qdrant, SQLite).

## Example: text embedding pipeline

Here's a complete example of a text embedding pipeline (based on the [text_embedding example](https://github.com/cocoindex-io/cocoindex/tree/main/examples/text_embedding)):

```python
import pathlib
from dataclasses import dataclass
from typing import Annotated, AsyncIterator

import asyncpg
from numpy.typing import NDArray

import cocoindex as coco
from cocoindex.connectors import localfs, postgres
from cocoindex.ops.text import RecursiveSplitter
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder
from cocoindex.resources.chunk import Chunk
from cocoindex.resources.file import FileLike, PatternFilePathMatcher
from cocoindex.resources.id import IdGenerator

PG_DB = coco.ContextKey[asyncpg.Pool]("pg_db")

_embedder = SentenceTransformerEmbedder("sentence-transformers/all-MiniLM-L6-v2")
_splitter = RecursiveSplitter()

@dataclass
class DocEmbedding:
    id: int
    filename: str
    chunk_start: int
    chunk_end: int
    text: str
    embedding: Annotated[NDArray, _embedder]

@coco.fn
async def process_chunk(
    chunk: Chunk,
    filename: pathlib.PurePath,
    id_gen: IdGenerator,
    table: postgres.TableTarget[DocEmbedding],
) -> None:
    table.declare_row(
        row=DocEmbedding(
            id=await id_gen.next_id(chunk.text),
            filename=str(filename),
            chunk_start=chunk.start.char_offset,
            chunk_end=chunk.end.char_offset,
            text=chunk.text,
            embedding=await _embedder.embed(chunk.text),
        ),
    )

@coco.fn(memo=True)
async def process_file(
    file: FileLike,
    table: postgres.TableTarget[DocEmbedding],
) -> None:
    text = await file.read_text()
    chunks = _splitter.split(
        text, chunk_size=2000, chunk_overlap=500, language="markdown"
    )
    id_gen = IdGenerator()
    await coco.map(process_chunk, chunks, file.file_path.path, id_gen, table)

@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
    target_table = await postgres.mount_table_target(
        PG_DB,
        "doc_embeddings",
        await postgres.TableSchema.from_class(
            DocEmbedding,
            primary_key=["id"],
        ),
        pg_schema_name="public",
    )

    files = localfs.walk_dir(
        sourcedir,
        recursive=True,
        path_matcher=PatternFilePathMatcher(included_patterns=["**/*.md"]),
    )
    await coco.mount_each(process_file, files.items(), target_table)
```

## Configuration options

### Model selection

You can use any model from the [sentence-transformers library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html):

```python
# Small, fast model (384 dimensions)
embedder = SentenceTransformerEmbedder("sentence-transformers/all-MiniLM-L6-v2")

# Larger, more accurate model (768 dimensions)
embedder = SentenceTransformerEmbedder("sentence-transformers/all-mpnet-base-v2")

# Multilingual model
embedder = SentenceTransformerEmbedder("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")

# Local model
embedder = SentenceTransformerEmbedder("/path/to/local/model")
```

### Normalization

By default, embeddings are normalized to unit length (suitable for cosine similarity):

```python
# Default: normalized embeddings
embedder = SentenceTransformerEmbedder(
    "sentence-transformers/all-MiniLM-L6-v2",
    normalize_embeddings=True  # Default
)

# Disable normalization if needed
embedder = SentenceTransformerEmbedder(
    "sentence-transformers/all-MiniLM-L6-v2",
    normalize_embeddings=False
)
```

---

# Text operations

Source: https://cocoindex.io/docs/ops/text/

The `cocoindex.ops.text` module provides operations for text processing.

```python
from cocoindex.ops.text import RecursiveSplitter, SeparatorSplitter
```

Features include:

- Code language detection
- Text chunking and splitting
- Syntax-aware code splitting

## Available functions and classes

### `detect_code_language()`

Detect the programming language from a filename.

**Usage:**

```python
from cocoindex.ops.text import detect_code_language

language = detect_code_language(filename="main.py")
print(language)  # "python"

language = detect_code_language(filename="app.rs")
print(language)  # "rust"

language = detect_code_language(filename="unknown.xyz")
print(language)  # None
```

### `SeparatorSplitter`

Split text by regex separators.

**Usage:**

```python
from cocoindex.ops.text import SeparatorSplitter

splitter = SeparatorSplitter()

text = "First sentence. Second sentence. Third sentence."
chunks = splitter.split(
    text,
    chunk_size=100,
    chunk_overlap=20,
    separators=[r"\.\s+"]  # Split on periods followed by whitespace
)

for chunk in chunks:
    print(chunk.text)
```

### `RecursiveSplitter`

Advanced text chunking with language awareness and syntax-aware splitting for code. Returns [`Chunk`](../common_resources/data_types#chunk) objects with position information.

**Features:**
- Supports many programming languages
- Preserves code structure
- Customizable chunk sizes and overlap
- Returns [`Chunk`](../common_resources/data_types#chunk) objects with start/end positions (line, column, byte/char offsets)

**Usage:**

```python
from cocoindex.ops.text import RecursiveSplitter

splitter = RecursiveSplitter()

# Split markdown text
text = "# Title\n\nParagraph 1.\n\nParagraph 2."
chunks = splitter.split(
    text,
    chunk_size=2000,
    chunk_overlap=500,
    language="markdown"
)

for chunk in chunks:
    print(f"Chunk: {chunk.text}")
    print(f"Start: line {chunk.start.line}, char {chunk.start.char_offset}")
    print(f"End: line {chunk.end.line}, char {chunk.end.char_offset}")
```

**Language-aware code splitting:**

```python
# Split Python code
python_code = '''
def hello():
    print("Hello, world!")

def goodbye():
    print("Goodbye!")
'''

chunks = splitter.split(
    python_code,
    chunk_size=1000,
    min_chunk_size=300,
    chunk_overlap=300,
    language="python"
)
```

**Supported languages:**

Languages with syntax-aware (tree-sitter) splitting — splits at logical boundaries like functions, classes, and blocks:

| Language | `language=` value | Extensions |
|---|---|---|
| Astro | `"astro"` | `.astro` |
| C | `"c"` | `.c`, `.h` |
| C++ | `"cpp"` | `.cpp`, `.cc`, `.cxx`, `.c++` |
| C# | `"c_sharp"` | `.cs` |
| CSS | `"css"` | `.css` |
| Fortran | `"fortran"` | `.f`, `.f90`, `.f95` |
| Go | `"go"` | `.go` |
| HTML | `"html"` | `.html`, `.htm` |
| Java | `"java"` | `.java` |
| JavaScript | `"javascript"` | `.js`, `.mjs`, `.cjs`, `.jsx` |
| JSON | `"json"` | `.json`, `.jsonc` |
| Julia | `"julia"` | `.jl` |
| Kotlin | `"kotlin"` | `.kt`, `.kts` |
| Markdown | `"markdown"` | `.md` |
| Pascal | `"pascal"` | `.pas` |
| PHP | `"php"` | `.php` |
| Python | `"python"` | `.py` |
| R | `"r"` | `.r`, `.R` |
| Ruby | `"ruby"` | `.rb` |
| Rust | `"rust"` | `.rs` |
| Scala | `"scala"` | `.scala` |
| Solidity | `"solidity"` | `.sol` |
| SQL | `"sql"` | `.sql` |
| Svelte | `"svelte"` | `.svelte` |
| Swift | `"swift"` | `.swift` |
| TOML | `"toml"` | `.toml` |
| TSX | `"tsx"` | `.tsx` |
| TypeScript | `"typescript"` | `.ts` |
| Vue | `"vue"` | `.vue` |
| XML | `"xml"` | `.xml` |
| YAML | `"yaml"` | `.yaml`, `.yml` |

Many additional languages use separator-based splitting (e.g. `"bash"`, `"dart"`, `"elixir"`, `"elm"`, `"go"`, `"haskell"`, `"lua"`, `"perl"`, `"swift"`, and more). Pass the language name string to `language=` — use [`detect_code_language()`](#detect_code_language) to infer it from a filename.

### `CustomLanguageConfig`

Define custom language splitting rules.

**Usage:**

```python
from cocoindex.ops.text import CustomLanguageConfig, RecursiveSplitter

# Create custom language config for abstracts
abstract_config = CustomLanguageConfig(
    language_name="abstract",
    separators_regex=[
        r"[.?!]+\s+",  # Sentence boundaries
        r"[:;]\s+",     # Clause boundaries
        r",\s+",        # Comma boundaries
        r"\s+",         # Whitespace
    ]
)

splitter = RecursiveSplitter(custom_languages=[abstract_config])

chunks = splitter.split(
    "This is a sample abstract. It has multiple sentences...",
    chunk_size=500,
    chunk_overlap=150,
    language="abstract"
)
```

## API reference

For detailed API documentation, refer to the module docstrings:

```python
from cocoindex.ops import text

help(text.RecursiveSplitter)
help(text.SeparatorSplitter)
help(text.detect_code_language)
help(text.CustomLanguageConfig)
```

---

# Concurrency control

Source: https://cocoindex.io/docs/advanced_topics/concurrency_control/

CocoIndex executes each [processing component](../programming_guide/processing_component) as a unit of concurrent work. By default, up to **1024** processing components can be in flight per app. When components perform resource-intensive work (e.g., calling external APIs, running ML models), you may want a tighter limit.

## Setting the limit

Set `max_inflight_components` in `AppConfig`:

```python
app = coco.App(
    coco.AppConfig(name="MyPipeline", max_inflight_components=4),
    app_main,
    sourcedir=pathlib.Path("./data"),
)
```

With `max_inflight_components=4`, at most 4 processing components execute at the same time. When a component finishes, the next pending one starts.

Setting `max_inflight_components=1` serializes all components — only one runs at a time.

You can also set the limit via the `COCOINDEX_MAX_INFLIGHT_COMPONENTS` environment variable:

```bash
export COCOINDEX_MAX_INFLIGHT_COMPONENTS=4
```

**Precedence:** `AppConfig` value > environment variable > default (1024).

## Deadlock prevention

When a parent component mounts a child, the parent releases its concurrency slot so the child can make progress. This prevents deadlocks in nested mount scenarios — even with `max_inflight_components=1`, a parent mounting a child will not block forever.

---

# Progress monitoring

Source: https://cocoindex.io/docs/advanced_topics/progress_monitoring/

Most runs only need the basics covered in [App](../programming_guide/app#updating-an-app): awaiting `app.update()` for the result, or the built-in stdout progress display — `report_to_stdout=True` on `app.update_blocking()` (or the CLI), or `await coco.show_progress(handle)` with the async API. `report_to_stdout` accepts a `timedelta` to set the refresh interval (`True` uses the default); `show_progress` takes a `refresh_interval` keyword.

This page covers the **structured** progress APIs for cases that need more: a daemon streaming progress to a client, a custom dashboard, or splitting a large pipeline into independently-reported phases.

## Structured update stats

`app.update()` returns an `UpdateHandle`. Besides being awaitable, it exposes the same stats that drive the terminal display — as Python objects you can read while the update runs.

### Polling

`handle.stats()` returns a snapshot of the current counters as an `UpdateStats` — or `None` before the handle has started running. It keeps working after completion, returning the final stats:

```python
handle = app.update()
result = await handle.result()   # run to completion
final_stats = handle.stats()     # UpdateStats with the final counters
```

`app.update()` starts lazily — the handle begins running on the first `await` (or `watch()`), so `stats()` returns `None` until then. To read stats *while* an update is in flight, poll `handle.stats()` from a separate task, or use `watch()` (below) to receive each change as it happens.

### Streaming

`handle.watch()` is an async iterator that yields an `UpdateSnapshot` whenever the stats change — `RUNNING` while processing, then a final `READY` snapshot carrying the result:

```python
handle = app.update()
async for snapshot in handle.watch():
    print(snapshot.stats.total.num_finished, "items processed")
    # snapshot.status is UpdateStatus.RUNNING or UpdateStatus.READY
    # snapshot.result is set on the final snapshot, when the iterator ends
```

On a processing error, `watch()` raises the exception directly — handle it with a normal `try`/`except` around the loop.

In [live mode](../programming_guide/live_mode), `watch()` does not stop at the first `READY`: it keeps yielding `RUNNING` snapshots as live components deliver incremental updates, ending only when the app stops.

### Stats types

```python
class UpdateStats:
    by_component: dict[str, ComponentStats]  # keyed by processor name
    total: ComponentStats                    # summed across all processors

class ComponentStats:
    num_execution_starts: int
    num_unchanged: int
    num_adds: int
    num_deletes: int
    num_reprocesses: int
    num_errors: int
    # derived helpers:
    num_processed: int    # unchanged + adds + deletes + reprocesses
    num_finished: int     # num_processed + num_errors
    num_in_progress: int  # max(0, execution_starts - num_finished)

class UpdateSnapshot:
    stats: UpdateStats
    status: UpdateStatus      # RUNNING | READY
    result: R | None          # set only on the final snapshot
```

## Scoped reports with `stats_group`

By default every component's stats roll up into one report for the whole `app.update()`. For a large pipeline you often want to watch a *part* of it on its own — "indexing the docs tree" separately from "indexing the code tree", or one expensive phase by itself.

`coco.stats_group(title)` opens a scope: everything mounted inside the block aggregates into a **separate** report under `title`, **split out** of the parent (the parent report no longer counts that work — there's no double counting). It's a plain `with` block, used inside a processing component like [`coco.component_subpath`](../programming_guide/processing_component#using-component_subpath-as-a-context-manager):

```python
@coco.fn
async def app_main(docs_dir, code_dir, target):
    with coco.stats_group("Indexing docs", report_to_stdout=True):
        files = localfs.walk_dir(docs_dir, ...)
        await coco.mount_each(process_doc, files.items(), target)

    with coco.stats_group("Indexing code", report_to_stdout=True):
        files = localfs.walk_dir(code_dir, ...)
        await coco.mount_each(process_code, files.items(), target)
```

With `report_to_stdout=True`, the group's progress is also printed to stdout, labeled by its title, alongside the main `report_to_stdout` display without disrupting it. Pass a `timedelta` instead of `True` to set the group's refresh interval.

The block yields a `StatsGroupHandle` with the same `stats()` and `watch()` as `UpdateHandle` (a group has no return value, so there's no `result()`):

```python
with coco.stats_group("Indexing docs") as sg:
    await coco.mount_each(process_doc, files.items(), target)

# Exit is non-blocking, so the mounted work keeps running after the block —
# `sg` stays valid; drain its watch() to follow that work to completion:
async for snapshot in sg.watch():
    send_to_dashboard(snapshot.stats)
```

### Semantics

- **Synchronous and non-blocking.** It's a regular `with` (not `async with`), even though the body `await`s mounts. Leaving the block does **not** wait for the work to finish — it only marks where member registration stops. The group becomes `READY` asynchronously once the block has exited *and* every component mounted inside it is ready; observe that via `sg.watch()` / `sg.stats()`.
- **Identity is unchanged.** Grouping only redirects where stats are *reported*. Component paths, change detection, target-state ownership, and the run's overall completion are unaffected — the grouped components are still ordinary children of the surrounding component.
- **Nesting.** Groups may nest. The **innermost** group owns a mount's stats; an outer group's readiness still waits for the inner group's work to finish.
- **Live members.** A group containing [live components](./live_component) becomes `READY` after their initial catch-up and reports their ongoing incremental stats, just like the root.

---

# Memoization keys & states

Source: https://cocoindex.io/docs/advanced_topics/memoization_keys/

As described in [Function — Change detection](../programming_guide/function#change-detection), CocoIndex detects [logic, input, and context changes](../programming_guide/function#change-detection) to decide whether a memo can be reused. Function arguments, [`deps`](../programming_guide/function#deps) values, and [context values](../programming_guide/context#change-detection) with `detect_change=True` are all fingerprinted through the same **data fingerprinting** pipeline. By default, most types are fingerprinted automatically. This page covers how to customize that pipeline — how objects are fingerprinted and validated:

- **Memoization keys** — how to control what CocoIndex uses as the fingerprint for your objects.
- **Memo states** — how to add post-fingerprint validation to check freshness beyond simple equality.

## How data fingerprinting works

For each data value (function argument, `deps` value, or context value), CocoIndex derives a canonical form with this precedence:

1. If the object implements **`__coco_memo_key__()`**, CocoIndex uses its return value.
2. Otherwise, if you registered a **memo key function** for the object's type, CocoIndex uses that.
3. Otherwise, CocoIndex falls back to structural canonicalization for a limited set of primitives/containers.

The following types are handled automatically (no custom key needed):

- **Primitives**: `None`, `bool`, `int`, `float`, `str`, `bytes`, `bytearray`, `memoryview`
- **Containers**: `list`, `tuple`, `dict`, `set`, `frozenset` (recursively canonicalized)
- **Dataclass instances**: all fields included in definition order
- **Pydantic v2 models**: all fields included
- **Class objects** (`type`): identified by module and qualified name
- **Other picklable objects**: used as a fallback via `pickle`

The canonical forms are combined into a deterministic fingerprint. If the fingerprint matches a cached entry, the cached result is reused — unless **memo states** indicate it's stale (see [Memo state validation](#memo-state-validation) below).

## Customizing the memoization key

### Define `__coco_memo_key__` (when you control the type)

Implement a method on your class that returns a stable, deterministic value:

```python
class MyType:
    def __coco_memo_key__(self) -> object:
        # Return small primitives / tuples.
        return (...)
```

Return something that uniquely identifies the **semantic content** your function depends on:

- **Good**: small tuples of primitives, e.g. `(stable_id, version)`
- **Bad**: memory addresses, unstable UUIDs, open file handles, `datetime.now()`, or large raw payloads

**Example — DB row:**

```python
class UserRow:
    def __init__(self, user_id: int, updated_at: int) -> None:
        self.user_id = user_id
        self.updated_at = updated_at

    def __coco_memo_key__(self) -> object:
        return ("users", self.user_id, self.updated_at)
```

### Register a key function (when you don't control the type)

If you can't add `__coco_memo_key__` (stdlib / third-party types), register a handler:

```python
from pathlib import Path
from cocoindex import register_memo_key_function

def path_key(p: Path) -> object:
    p = p.resolve()
    st = p.stat()
    return (str(p), st.st_mtime_ns, st.st_size)

register_memo_key_function(Path, path_key)
```

- Registration is **MRO-aware**: if you register both a base class and a subclass, the **most specific** match wins.
- Your key function must return the same kinds of stable objects as `__coco_memo_key__` (small primitives/tuples).

### Override at the call site with `memo_key=`

The two approaches above are **type-level**: every memoized function sees the same fingerprint for a given object. To customize fingerprinting **only for a specific function**, pass `memo_key=` to `@coco.fn` (or `@coco.fn.as_async`). It maps parameter names to either a callable (transform the value before fingerprinting) or `None` (exclude the parameter entirely).

```python
@coco.fn(memo=True, memo_key={"entry": lambda e: (e.name, e.version), "extra": None})
def transform(entry: SourceDataEntry, extra: str) -> str:
    ...
```

Each entry in `memo_key`:

- **Callable** — applied to the argument; its return value is fingerprinted in place of the original. Semantically the same as `__coco_memo_key__()` on the type, but scoped to this one function. Useful when the type's default fingerprint is correct everywhere *else*, and only this function should treat the argument differently.
- **`None`** — the parameter is excluded from the memo key. Changing its value never invalidates the cache. Useful for arguments that don't affect the result (logger handles, debug flags, request-scoped context that isn't part of the computation).
- **Not listed** — the parameter is fingerprinted normally (type-level `__coco_memo_key__`, registered handler, or default canonicalization).

It works for every parameter kind:

- **Positional / keyword parameters** — referenced by name.
- **`self`** — methods can pass `memo_key={"self": lambda self: self.prefix}` to memoize on selected instance state instead of the whole instance. Changes to other attributes on the same instance won't invalidate the memo.
- **`*args`** — name the varargs parameter; the callable receives the full tuple and must return a tuple. `None` excludes all variadic positional arguments.
- **`**kwargs`** — name the varkw parameter; the callable receives a dict and must return a dict. `None` excludes all variadic keyword arguments.

`memo_key` is validated at decoration time: unknown parameter names raise `ValueError`, and values that are neither callable nor `None` raise `TypeError`.

**Tip — Picking the right tool**
- You control the type and the fingerprint is the same wherever it's used → **`__coco_memo_key__`**.
- You don't control the type but want a global handler → **`register_memo_key_function`**.
- Only *this* function should treat an argument specially (transform or skip it) → **`memo_key=`**.

## Memo state validation

Sometimes fingerprint matching alone isn't enough to decide whether a cached result is valid. For example:

- **Multi-level validation**: for files, check the modified time first (cheap), and only read the file for a content fingerprint when the time doesn't match.
- **Async validation**: for an S3 object, send a HEAD request to check freshness — an inherently async operation.
- **Stateful validation**: for HTTP resources, store the last fetch time and use `If-Modified-Since` on the next run.

Memo state validation addresses these by letting you attach a **state function** to your objects. It runs *after* a fingerprint match, giving you a chance to check freshness before the cached result is reused.

### How it works

When CocoIndex finds a fingerprint match, it calls each state function with the stored state from the previous run:

1. **First run** (no previous state): `prev_state` is `coco.NON_EXISTENCE`. Use `coco.is_non_existence(prev_state)` to detect this.
2. **Subsequent runs**: `prev_state` is whatever you returned last time.

Your state function returns a `coco.MemoStateOutcome(state=..., memo_valid=...)`:

- **`state`** — the current state value. CocoIndex stores it for the next run.
- **`memo_valid`** (`bool`, defaults to `False`) — whether the cached result is still valid.

This decouples "has the state changed?" from "can we reuse the memo?":

- `MemoStateOutcome(state=new_state)` → cache is invalid (default). Function re-executes, new state is stored. On the first run (no previous cache), simply return the initial state without setting `memo_valid`.
- `MemoStateOutcome(state=same_state, memo_valid=True)` → nothing changed, cached result reused, no state update needed.
- `MemoStateOutcome(state=new_state, memo_valid=True)` → state changed but cached result is still valid (e.g. mtime changed but content hash unchanged). The new state is persisted so the next run uses the updated state.

### Define `__coco_memo_state__` (when you control the type)

**Info — Type annotations**
Annotate the `prev_state` parameter with its expected type (matching what you return in `MemoStateOutcome(state=...)`) so CocoIndex can properly reconstruct stored state values. See [Serialization](../programming_guide/serialization) for details on supported types.

Add a `__coco_memo_state__` method alongside `__coco_memo_key__`:

```python
import os
import hashlib
from pathlib import Path
import cocoindex as coco

class LocalFile:
    def __init__(self, path: Path) -> None:
        self.path = path

    def __coco_memo_key__(self) -> object:
        # Identity only — which file is it?
        return str(self.path.resolve())

    def __coco_memo_state__(self, prev_state: tuple[int, str] | coco.NonExistenceType) -> coco.MemoStateOutcome:
        st = os.stat(self.path)
        new_mtime = st.st_mtime_ns
        if coco.is_non_existence(prev_state):
            # First run — compute initial state (memo_valid defaults to False,
            # which is fine since there's no previous cache to reuse)
            content_hash = hashlib.sha256(self.path.read_bytes()).hexdigest()
            return coco.MemoStateOutcome(state=(new_mtime, content_hash))

        prev_mtime, prev_hash = prev_state
        if new_mtime == prev_mtime:
            # mtime unchanged — definitely reusable, no content read needed
            return coco.MemoStateOutcome(state=prev_state, memo_valid=True)
        # mtime changed — read content and check hash
        content_hash = coco.connectorkits.fingerprint_bytes(self.path.read_bytes())
        return coco.MemoStateOutcome(state=(new_mtime, content_hash), memo_valid=content_hash == prev_hash)
```

**Tip — Keys vs states for files**
Without state validation, you'd include `mtime` and `size` directly in the memo key:
```python
def __coco_memo_key__(self):
    st = os.stat(self.path)
    return (str(self.path.resolve()), st.st_mtime_ns, st.st_size)
```
This works for simple cases. State validation becomes useful when you need multi-level checks (e.g. check mtime first, then content hash only if it differs), async operations, or stored metadata like ETags. With the `MemoStateOutcome` return, you can update the state (e.g. new mtime) without invalidating the cache when the content hasn't actually changed.

### Register a state function (when you don't control the type)

Pass a `state_fn` keyword argument to `register_memo_key_function`. The state function receives the object as its first argument and `prev_state` as its second. Annotate `prev_state` with the expected type:

```python
from pathlib import Path
from cocoindex import register_memo_key_function

def path_key(p: Path) -> object:
    return str(p.resolve())

def path_state(p: Path, prev_state: tuple[int, int] | coco.NonExistenceType) -> coco.MemoStateOutcome:
    st = p.stat()
    new_state = (st.st_mtime_ns, st.st_size)
    memo_valid = not coco.is_non_existence(prev_state) and new_state == prev_state
    return coco.MemoStateOutcome(state=new_state, memo_valid=memo_valid)

register_memo_key_function(Path, path_key, state_fn=path_state)
```

### Async state methods

A state method can return an `Awaitable`. CocoIndex handles this automatically:

- **In an async CocoIndex function**: awaitables from all state methods are gathered concurrently.
- **In a sync CocoIndex function**: if no event loop is running, CocoIndex uses `asyncio.run()`. If a loop is already running, it raises an error — switch to an async function or use `@coco.fn.as_async`.

```python
import cocoindex as coco

class S3Object:
    def __init__(self, bucket: str, key: str) -> None:
        self.bucket = bucket
        self.key = key

    def __coco_memo_key__(self) -> object:
        return (self.bucket, self.key)

    async def __coco_memo_state__(self, prev_state: str | coco.NonExistenceType) -> coco.MemoStateOutcome:
        etag = await self._head_object()
        memo_valid = not coco.is_non_existence(prev_state) and etag == prev_state
        return coco.MemoStateOutcome(state=etag, memo_valid=memo_valid)

    async def _head_object(self) -> str:
        ...  # boto3 / aioboto3 HEAD call
```

## Preventing memoization

Some types maintain internal state that makes memoization semantically incorrect. For example, a generator that tracks call counts would produce wrong results if memoized.

### Inherit from `NotMemoKeyable` (when you control the type)

```python
import cocoindex as coco

class MyStatefulGenerator(coco.NotMemoKeyable):
    def __init__(self) -> None:
        self._counter = 0

    def next_value(self) -> int:
        self._counter += 1
        return self._counter
```

### Register as not memo-keyable (when you don't control the type)

```python
import cocoindex as coco
from some_library import StatefulGenerator

coco.register_not_memo_keyable(StatefulGenerator)
```

In either case, attempting to use the type as a memo key raises a clear error.

## Best practices

- **Keep keys small and deterministic**: use identifiers and versions, not full payloads. No `id(obj)`, pointer addresses, or random values.
- **Separate identity from freshness**: put stable identifiers (file path, URL, primary key) in the key. Put freshness checks (mtime, ETag, version) in the state.
- **Use state validation for expensive checks**: if freshness validation is costly (content hashing, network calls), a state function lets you do it only when the fingerprint matches, and only when a cheap pre-check (mtime) fails.
- **Use `MemoStateOutcome(state=new_state, memo_valid=True)` for cheap state updates**: when a cheap property changes (mtime) but the expensive check (content hash) confirms nothing meaningful changed, return `memo_valid=True` while updating the state. This avoids re-executing the function and avoids re-checking the expensive property next time.
- **Mark stateful types as `NotMemoKeyable`**: prevent subtle bugs from incorrect memoization of types with side effects.

---

# Error handling & exception handlers

Source: https://cocoindex.io/docs/advanced_topics/exception_handlers/

This page covers the full picture of failure behavior in CocoIndex — from how components fail in isolation, through what happens during interrupted updates, to the APIs for observing and reacting to errors in production.

For a quick overview of failure isolation and the two-phase model, see [What happens when a component fails](../programming_guide/processing_component#what-happens-when-a-component-fails) in the Processing Component guide.

## The guiding principle

**Whether a CocoIndex call raises on failure depends on what's on the critical path for the call to finish and return.** Things on that path that fail surface as exceptions; things off that path don't.

| API | What it takes to return | Raises on failure? |
|---|---|---|
| `coco.mount`, `coco.mount_each`, `LiveComponentOperator.update`/`.delete` | the work is *scheduled* (returns a handle) | **No** — execution runs in the background |
| `await handle.ready()` (any of the above) | reaches *ready state* | **No by default** — logged at `ERROR`; opt into propagation by installing a handler that raises |
| `await LiveComponentOperator.update_full()` | reaches *ready state* of the full cycle | **No by default** — same shape as `handle.ready()` |
| `await coco.use_mount(...)` | the inner component's *value* is produced | **Yes** — component success is on the critical path |
| `await app.drop()` (and `cocoindex drop`) | GC succeeds at *every* level | **Yes**, at any depth — leaking target state would otherwise go silent |

**Why `ready`/`update_full` tolerate errors by default:** `handle.ready()` is usually not awaited (callers spawn many mounts in a loop and rely on the parent's `wait_for_children`), so raising into nobody would silently disappear. Logging the failure gives operators a signal regardless. The same applies to `update_full` — a periodic refresh that died on the first cycle failure would be worse than one that logs and retries on the next tick.

**Why `app.drop` raises at every level:** clearing tracking records is only safe if every component's delete actually ran its sink calls. If a descendant delete silently fails, `await app.drop()` would return Ok while target state is still live in your database — a leak. So failures at any depth surface as an exception.

## How to opt into propagation

Install an exception handler (described below). A handler that **returns normally** swallows; a handler that **raises** propagates through `handle.ready()` (or `update_full()`) up to your awaiting code. For `app.drop`, propagation is built in — you don't need to install anything.

[Live components](./live_component) follow the same principle: their `update`/`delete`/`update_full` calls route failures through this chain with `mount_kind="process_live"`. The operator also exposes [`report_exception(exc)`](./live_component#report_exception) for surfacing errors raised in `process_live`'s body that aren't already routed by `update`/`delete`/`update_full`.

## Processing and submit phases

CocoIndex processes each component in two phases:

1. **Processing** — runs your function, declares target states in memory. This phase is side-effect-free. If it fails (e.g., a parsing error, an API timeout), no writes were attempted.
2. **Submit** — writes changes to target backends. This phase only runs after processing completes successfully.

This separation means a processing failure never leaves partial data in your targets.

## Interrupted updates and recovery

An update can be interrupted by various events: a process kill (SIGKILL), Ctrl+C (SIGINT), an unhandled exception, or a target backend failure during submit.

**What state is left behind?**

CocoIndex's internal database (LMDB) uses transactions, so its own state is always consistent even after a crash. CocoIndex tracks all possible states a target could be in — if an update is interrupted partway through a commit, both the old and new states are retained as possibilities. This ensures no state is ever lost.

**Recovery is automatic.** On the next `app.update()`, CocoIndex computes the current desired state and reconciles against all possible previous states. The target connector converges the target to the correct state regardless of whether the previous commit partially succeeded or never ran.

For details on how target handlers deal with multiple possible previous states after an interruption, see [Custom Target Connector — Handle multiple previous states](./custom_target_connector#handle-multiple-previous-states).

## Monitoring errors

`app.update()` returns an `UpdateHandle` that exposes processing stats, including error counts:

```python
handle = app.update()

# Poll stats at any time
stats = handle.stats()
if stats is not None:
    print(f"Errors: {stats.total.num_errored}")

# Stream progress
async for snapshot in handle.watch():
    print(f"{snapshot.stats.total.num_errored} errors so far")
```

See [Progress monitoring](./progress_monitoring) for the full `UpdateHandle` API.

## Exception handlers

For background-mounted components (`mount()` and `mount_each()`), you can register **exception handlers** to observe or react to failures — for example, to send alerts, record metrics, or implement custom logic.

CocoIndex supports two levels of exception handlers:

- **Global (environment-level)**: registered once in your lifespan function; applies to all background mounts in the environment.
- **Scoped**: an async context manager that applies to all `mount()` / `mount_each()` calls made within it.

**Note**
Exception handlers only apply to `mount()` and `mount_each()`. `use_mount()` propagates errors directly to the caller since the parent has an explicit dependency on the result.

### Global exception handler

Register a handler inside your `@coco.lifespan` function using `builder.set_exception_handler()`:

```python
import cocoindex as coco

@coco.lifespan
def lifespan(builder: coco.EnvironmentBuilder):
    def on_error(exc: BaseException, ctx: coco.ExceptionContext) -> None:
        print(f"[{ctx.env_name}] {ctx.mount_kind} failed at {ctx.stable_path}: {exc}")

    builder.set_exception_handler(on_error)
    yield
```

This replaces the default "log error" behavior for all background mounts in the environment.

### Scoped exception handler

Use `coco.exception_handler()` as an async context manager to apply a handler to a specific dynamic scope:

```python
@coco.fn
async def process_all(files):
    def on_error(exc: BaseException, ctx: coco.ExceptionContext) -> None:
        print(f"Failed processing {ctx.stable_path}: {exc}")

    async with coco.exception_handler(on_error):
        for f in files:
            await coco.mount(coco.component_subpath(str(f.path)), process_file, f)
```

The handler applies to all `mount()` / `mount_each()` calls within the `async with` block, including those in nested functions called from within the block.

### Handler type

Both sync and async handlers are supported:

```python
from typing import Awaitable

# Sync handler
def sync_handler(exc: BaseException, ctx: coco.ExceptionContext) -> None:
    ...

# Async handler
async def async_handler(exc: BaseException, ctx: coco.ExceptionContext) -> None:
    await send_alert(exc, ctx)
```

The type alias is:

```python
ExceptionHandler = Callable[
    [BaseException, ExceptionContext],
    None | Awaitable[None],
]
```

### `ExceptionContext` fields

Your handler receives an `ExceptionContext` dataclass with information about the failure:

| Field | Type | Description |
|---|---|---|
| `env_name` | `str` | Name of the CocoIndex environment |
| `stable_path` | `str` | Full stable path of the failing component |
| `processor_name` | `str \| None` | Name of the processor (best-effort) |
| `mount_kind` | `"mount" \| "mount_each" \| "delete_background" \| "process_live"` | Source of the failure: `"mount"`/`"mount_each"` for background build errors, `"delete_background"` for sweep-driven deletions, `"process_live"` for exceptions surfaced from a live component's `process_live` via `LiveComponentOperator.report_exception` (e.g. a failed `coco.auto_refresh` cycle) |
| `parent_stable_path` | `str \| None` | Stable path of the parent component |
| `is_background` | `bool` | Always `True` for exception handler invocations |
| `source` | `"component" \| "handler"` | `"component"` for the original failure; `"handler"` if a handler itself raised |
| `original_exception` | `BaseException \| None` | The original component exception, set only when `source == "handler"` |

### Handler stacking and fallback

Handlers are stacked: the most specific (innermost) handler runs first.

If the innermost handler raises an exception, the next outer handler is called with that new exception. In this case `ctx.source` is `"handler"` and `ctx.original_exception` holds the original component error.

This continues up the stack. If all handlers raise (or no handler is registered), CocoIndex falls back to the built-in behavior: logging the error at `ERROR` level, with no crash.

```python
@coco.lifespan
def lifespan(builder: coco.EnvironmentBuilder):
    builder.settings.db_path = "..."

    def global_handler(exc: BaseException, ctx: coco.ExceptionContext) -> None:
        if ctx.source == "handler":
            # A handler itself failed — exc is the handler's exception,
            # ctx.original_exception is the original component error.
            print(f"Handler error: {exc}; original: {ctx.original_exception}")
        else:
            print(f"Component error: {exc}")

    builder.set_exception_handler(global_handler)
    yield

@coco.fn
async def _root() -> None:
    def inner_handler(exc: BaseException, ctx: coco.ExceptionContext) -> None:
        print(f"inner: {exc}")
        raise RuntimeError("inner handler failed")  # falls through to global_handler

    async with coco.exception_handler(inner_handler):
        await coco.mount(coco.component_subpath("child"), _child)
```

Users who never register handlers see identical behavior to the default — exceptions from background mounts are logged at `ERROR` level and siblings continue unaffected.

---

# Internal storage configuration

Source: https://cocoindex.io/docs/advanced_topics/internal_storage/

CocoIndex uses an [LMDB](http://www.lmdb.tech/doc/) database to persist its internal state. This database tracks target states and [memoization](../programming_guide/function) results from previous runs, enabling CocoIndex to detect what changed and apply only the necessary updates.

## Database path

CocoIndex needs a database path (`db_path`) to know where to store this internal state. The simplest way to set it is via the `COCOINDEX_DB` environment variable:

```bash
export COCOINDEX_DB=./cocoindex.db
```

You can also set it programmatically in a [lifespan function](../programming_guide/app#lifespan-optional):

```python
@coco.lifespan
def coco_lifespan(builder: coco.EnvironmentBuilder) -> Iterator[None]:
    builder.settings.db_path = pathlib.Path("./cocoindex.db")
    yield
```

Or pass it directly when creating a [`Settings`](./multiple_environments) object:

```python
settings = coco.Settings(db_path=pathlib.Path("./cocoindex.db"))
```

Setting `db_path` in the lifespan or `Settings` takes precedence over the `COCOINDEX_DB` environment variable. If neither is provided, CocoIndex will raise an error.

## LMDB tuning

The LMDB database has two tunable settings, grouped under `coco.LmdbSettings` and attached to `Settings` as `db_settings`. The defaults work well for most use cases — you only need to adjust them for large-scale deployments.

| Setting | Default | Env Variable | Description |
|---------|---------|-------------|-------------|
| `max_dbs` | `1024` | `COCOINDEX_LMDB_MAX_DBS` | Maximum number of named LMDB databases. Must be &ge; 1. |
| `map_size` | `4294967296` (4 GiB) | `COCOINDEX_LMDB_MAP_SIZE` | Maximum size of the LMDB memory map in bytes. Must be &gt; 0; rounded up to the nearest multiple of the system page size. |

### When to adjust

- **Increase `map_size`** if you encounter LMDB "map full" errors. This happens when the accumulated internal state (target states + memoization cache) exceeds 4 GiB. On 64-bit systems, `map_size` is a virtual address space reservation — setting it larger than needed is safe and does not consume physical memory.
- **Increase `max_dbs`** if you have an unusually large number of apps sharing a single database directory.

### Configuration

Via environment variables:

```bash
export COCOINDEX_LMDB_MAP_SIZE=8589934592   # 8 GiB
export COCOINDEX_LMDB_MAX_DBS=2048
```

Or programmatically in a lifespan function:

```python
@coco.lifespan
def coco_lifespan(builder: coco.EnvironmentBuilder) -> Iterator[None]:
    builder.settings.db_path = pathlib.Path("./cocoindex.db")
    builder.settings.db_settings.map_size = 8 * 1024 * 1024 * 1024  # 8 GiB
    builder.settings.db_settings.max_dbs = 2048
    yield
```

Or when creating a `Settings` object directly:

```python
settings = coco.Settings(
    db_path=pathlib.Path("./cocoindex.db"),
    db_settings=coco.LmdbSettings(
        map_size=8 * 1024 * 1024 * 1024,  # 8 GiB
        max_dbs=2048,
    ),
)
```

When using `Settings.from_env()`, the LMDB settings are automatically loaded from their environment variables if set; otherwise, the defaults apply.

**Note — Legacy keyword arguments**
For backward compatibility, `Settings` still accepts `lmdb_max_dbs` and `lmdb_map_size` as keyword arguments, and exposes them as attributes (e.g., `settings.lmdb_map_size = ...`). These read and write the same underlying values as `settings.db_settings.max_dbs` / `settings.db_settings.map_size`. Passing both `db_settings=` and the legacy keywords in the same `Settings(...)` call raises `ValueError`.

---

# Multiple environments

Source: https://cocoindex.io/docs/advanced_topics/multiple_environments/

By default, all CocoIndex apps share the same environment, which manages a shared database and context. For most use cases, this default behavior is sufficient. However, you can create multiple isolated environments when you need:

- **Library development**: Libraries that depend on CocoIndex can use their own environment to avoid sharing state with other libraries or the application
- **Database isolation**: Different apps using separate databases
- **Multi-tenant deployments**: Isolated data per tenant
- **Testing**: Isolated test environments that don't interfere with production

## Creating an Environment

Create an explicit environment by providing `Settings` with a `db_path`:

```python
import cocoindex as coco
import pathlib

env = coco.Environment(coco.Settings.from_env(db_path=pathlib.Path("./my_db.db")))
```

You can optionally name the environment for easier identification:

```python
env = coco.Environment(
    coco.Settings.from_env(db_path=pathlib.Path("./my_db.db")),
    name="production"
)
```

`Settings` also accepts `db_settings` (a `coco.LmdbSettings` instance) for tuning the internal LMDB database. See [Internal Storage](./internal_storage#lmdb-tuning) for details.

## Associating Apps with an Environment

Pass the environment to `AppConfig` when creating an app:

```python
app = coco.App(
    coco.AppConfig(name="MyApp", environment=env),
    main_fn,
)
```

Apps that don't specify an environment use the default environment (configured via `@coco.lifespan` or the `COCOINDEX_DB` environment variable).

## Example: Multiple Environments

This example creates two apps in different environments, each with its own database:

```python
import cocoindex as coco
import pathlib

# Create two environments with separate databases
env1 = coco.Environment(coco.Settings.from_env(db_path=pathlib.Path("./db1/cocoindex.db")))
env2 = coco.Environment(coco.Settings.from_env(db_path=pathlib.Path("./db2/cocoindex.db")))

@coco.fn
def build1() -> None:
    # ... pipeline logic for env1 ...
    pass

@coco.fn
def build2() -> None:
    # ... pipeline logic for env2 ...
    pass

# Apps in different environments
app1 = coco.App(coco.AppConfig(name="App1", environment=env1), build1)
app2 = coco.App(coco.AppConfig(name="App2", environment=env2), build2)
```

## Same App Name in Different Environments

Apps with the same name can coexist in different environments. This is useful for multi-tenant scenarios or running the same pipeline against different databases:

```python
import cocoindex as coco
import pathlib
from typing import Iterator

# Named environment for "alpha" tenant
env_alpha = coco.Environment(
    coco.Settings.from_env(db_path=pathlib.Path("./alpha/cocoindex.db")),
    name="alpha"
)

# Default environment via lifespan
@coco.lifespan
def _lifespan(builder: coco.EnvironmentBuilder) -> Iterator[None]:
    builder.settings.db_path = pathlib.Path("./default/cocoindex.db")
    yield

@coco.fn
def build() -> None:
    # ... pipeline logic ...
    pass

# Same app name, different environments
app_alpha = coco.App(coco.AppConfig(name="MyApp", environment=env_alpha), build)
app_default = coco.App("MyApp", build)  # Uses default environment
```

## CLI Support

When working with multiple environments, the CLI groups apps by their environment. Use the `@env_name` syntax to target a specific app:

```bash
# List all apps grouped by environment
cocoindex ls ./multi_env.py

# Update a specific app in a named environment
cocoindex update ./multi_env.py:MyApp@alpha

# Update the app in the default environment
cocoindex update ./multi_env.py:MyApp@default
```

## Testing with Isolated Environments

For tests, create isolated environments to avoid interference between test runs:

```python
import cocoindex as coco
import pathlib
import tempfile

def create_test_env(test_name: str) -> coco.Environment:
    db_path = pathlib.Path(tempfile.mkdtemp()) / f"{test_name}.db"
    return coco.Environment(coco.Settings.from_env(db_path=db_path))

# In your test
def test_my_pipeline():
    env = create_test_env("test_my_pipeline")
    app = coco.App(
        coco.AppConfig(name="TestApp", environment=env),
        my_main_fn,
    )
    app.update_blocking()
    # ... assertions ...
```

## When to Use Multiple Environments

| Use Case | Approach |
|----------|----------|
| Single app, single database | Default environment (no explicit environment needed) |
| Multiple apps sharing state | Default environment |
| Apps needing separate databases | Explicit environments with different `db_path` |
| Multi-tenant isolation | Named environments per tenant |
| Testing | Temporary isolated environments |

---

# Building live components

Source: https://cocoindex.io/docs/advanced_topics/live_component/

By default, a [processing component](../programming_guide/processing_component) runs in **catch-up mode** — on each `app.update()`, it declares all target states and mounts all sub-components from scratch. CocoIndex handles incremental updates by skipping [memoized](../programming_guide/function) sub-components and reconciling target states at the end, then the component exits. This works well when the dataset is small enough to scan fully each cycle.

When the dataset is large or you need to react to changes continuously (e.g., watching a file system), you want the component itself to stay running and react incrementally. A **live component** does an initial full scan (same as catch-up mode), then keeps running and reacts to individual changes without rescanning everything.

## The LiveComponent protocol

A live component is a class with three methods:

```python
class MyLiveComponent:
    def __init__(self, folder: pathlib.Path, target: localfs.DirTarget) -> None:
        """Receive arguments from the mount() call."""
        self.folder = folder
        self.target = target

    async def process(self) -> None:
        """Full processing — mount all children, declare all target states."""
        ...

    async def process_live(self, operator: coco.LiveComponentOperator) -> None:
        """Continuous processing — orchestrate full and incremental updates."""
        ...
```

- **`__init__`** receives arguments passed to `coco.mount()`.
- **`process()`** does a full scan — mounts children via `coco.mount()` and declares target states, just like a traditional component function. Called indirectly via `operator.update_full()`.
- **`process_live(operator)`** is the long-running entry point. It orchestrates full and incremental updates using the operator.

CocoIndex detects a live component by checking if the class has both `process` and `process_live` methods.

## LiveComponentOperator

The `operator` passed to `process_live()` provides five methods:

| Method | Description |
|--------|-------------|
| `await operator.update_full()` | Run `process()` with a full submission phase (GCs removed children). Blocks until fully ready. |
| `await operator.update(subpath, fn, *args, **kwargs)` | Mount a child component incrementally. |
| `await operator.delete(subpath)` | Delete a child component. |
| `await operator.mark_ready()` | Signal that processing has caught up to the time `process_live()` was called. |
| `await operator.report_exception(exc)` | Surface a `process_live`-body exception through the parent's exception handler chain. |

### `update_full()`

Triggers a full processing cycle: calls `process()`, submits target states, waits for all children to be ready, and garbage-collects children that are no longer mounted. This is the same mechanism as a traditional component's update cycle.

Exceptions raised inside `process()` (or its descendants) are routed through the parent's [exception handler chain](./exception_handlers#exception-handlers) — same as background `coco.mount()` failures — and **do not propagate** to the caller. This lets long-running `process_live` loops (such as periodic refreshers) keep going across transient cycle failures while still surfacing the failure to operators.

### `update()` and `delete()`

Mount or delete individual child components without a full scan. These are concurrent with each other but serialized with `update_full()` — if `update_full()` is running, incremental operations wait until it finishes.

When multiple operations target the same subpath, only the latest one (by invocation order) takes effect.

Failures in the mounted child (or in the delete itself) are routed through the parent's [exception handler chain](./exception_handlers#exception-handlers) with `mount_kind="process_live"` and `stable_path` set to the *child's* path — same shape as background `coco.mount` failures, never propagated to `handle.ready()`. The live loop is unaffected by child failures.

### `mark_ready()`

Signals to the parent that the live component has caught up. The parent's `await handle.ready()` returns when `mark_ready()` is called. If `process_live()` returns without calling `mark_ready()`, it is called automatically.

### `report_exception()`

Routes an exception raised somewhere in `process_live`'s body — but **outside** an `update_full` / `update` / `delete` call — through the parent's [exception handler chain](./exception_handlers#exception-handlers). Use it for failures the framework can't see on its own, such as an external watcher emitting a malformed event:

```python
async def process_live(self, operator):
    await operator.update_full()
    await operator.mark_ready()
    async for event in watcher.events():
        try:
            subpath, value = parse_event(event)
        except ParseError as exc:
            await operator.report_exception(exc)
            continue
        await operator.update(subpath, process_one, value)
```

The handler receives an `ExceptionContext` with `mount_kind="process_live"` and `stable_path` set to the live component's own path. The full Python traceback is preserved (when `exc.__traceback__` is set — i.e. for caught exceptions). Falls back to ERROR-level logging if no handler is registered.

You **don't** need `report_exception` around `update_full` / `update` / `delete` themselves — those route their own errors automatically.

## Example: file system watcher

A component that watches a local folder and processes each file:

```python
import pathlib
import cocoindex as coco

class FolderWatcher:
    def __init__(self, folder: pathlib.Path, target) -> None:
        self.folder = folder
        self.target = target

    async def process(self) -> None:
        """Full scan — mount a child for every file in the folder."""
        for path in self.folder.iterdir():
            if path.is_file():
                await coco.mount(
                    coco.component_subpath(path.name),
                    process_file,
                    path,
                    self.target,
                )

    async def process_live(self, operator: coco.LiveComponentOperator) -> None:
        # 1. Set up the file watcher before the full scan so no events are missed.
        watcher = setup_watchdog(self.folder)

        # 2. Full scan.
        await operator.update_full()

        # 3. Signal readiness — parent can proceed.
        await operator.mark_ready()

        # 4. React to changes.
        async for event in watcher.events():
            subpath = coco.component_subpath(event.filename)
            if event.is_update:
                await operator.update(subpath, process_file, event.path, self.target)
            elif event.is_delete:
                await operator.delete(subpath)
```

The parent mounts it like any other component:

```python
@coco.fn
async def app_main(folder: pathlib.Path, outdir: pathlib.Path) -> None:
    # Set up the target in the parent (use_mount is not allowed inside process()).
    target = await localfs.mount_dir_target(outdir)
    # Mount the live component.
    await coco.mount(FolderWatcher, folder, target)
```

## Example: traditional component equivalent

A traditional single-function component:

```python
@coco.fn
async def process_all(data) -> None:
    for key, value in data.items():
        coco.declare_target_state(target.target_state(key, value))
```

is equivalent to this LiveComponent:

```python
class ProcessAll:
    def __init__(self, data):
        self.data = data

    async def process(self) -> None:
        for key, value in self.data.items():
            coco.declare_target_state(target.target_state(key, value))

    async def process_live(self, operator: coco.LiveComponentOperator) -> None:
        await operator.update_full()
        # mark_ready() is called automatically on return.
```

The key difference: the LiveComponent version can later be extended to handle incremental changes in `process_live()` without changing `process()`.

## Example: periodic refresh with `coco.auto_refresh`

A common pattern is "run a regular processor on a fixed schedule." `coco.auto_refresh` wraps any processor function as a LiveComponent that re-runs it on an interval — no need to write `process_live` yourself.

```python
import datetime
import cocoindex as coco

async def sync_users(db, target) -> None:
    rows = await db.fetch_all_users()
    for row in rows:
        target.declare_row(row=UserRow(...))

@coco.fn
async def app_main(db, target) -> None:
    await coco.mount(
        coco.auto_refresh(sync_users, interval=datetime.timedelta(minutes=5)),
        db, target,
    )
```

Behavior:
- The first cycle runs immediately, then `mark_ready` is signaled to the parent.
- In live mode, the loop continues: `sleep(interval) → run process_fn → ...` with a **fixed delay** between cycles (cycles never overlap; a slow cycle just stretches the gap).
- In catch-up mode, the first cycle runs and `auto_refresh` terminates — observationally identical to mounting `process_fn` directly. The interval is ignored.
- Cycle failures are routed through the parent's [exception handler chain](./exception_handlers#exception-handlers) (see `update_full()` above); the loop keeps going.
- **Each cycle reconciles target states against the previous cycle.** If `process_fn` no longer declares some target state (e.g. a row that vanished from your source table), CocoIndex automatically deletes the corresponding target — same garbage-collection model as a regular `coco.mount` re-run. You write the function as if processing from scratch each time; the framework handles the diff. See [Target State](../programming_guide/target_state) for the underlying model.

The returned class's `__init__` accepts the same positional and keyword arguments as `process_fn` — anything you'd pass to `coco.mount(process_fn, ...)` you can pass to `coco.mount(coco.auto_refresh(process_fn, interval=...), ...)`.

## LiveMapFeed and LiveMapView
**Tip**
For a user-facing introduction to live mode and when to use it, see [Live Mode](../programming_guide/live_mode). This section covers the protocol details for connector authors and advanced use cases.

Two protocols represent live keyed collections that `mount_each()` can consume. Choose based on whether your source can enumerate its current state:

- **`LiveMapView`** — the source has enumerable current state that can be scanned (e.g., a directory listing, database table). Supports both full scans via `__aiter__()` and incremental changes via `watch()`.
- **`LiveMapFeed`** — the source only streams changes, with no "current state" to scan (e.g., a Kafka consumer, a webhook event stream). Provides only `watch()`.

```python
class LiveMapFeed(Protocol[K, V]):
    async def watch(self, subscriber: LiveMapSubscriber[K, V]) -> None: ...

class LiveMapView(LiveMapFeed[K, V], Protocol[K, V]):
    def __aiter__(self) -> AsyncIterator[tuple[K, V]]: ...
```

`LiveMapView` extends `LiveMapFeed` — any `LiveMapView` is also a valid `LiveMapFeed`. When `mount_each()` receives either protocol as its `items` argument, it automatically creates an internal `LiveComponent`:

- **`LiveMapView`**: `__aiter__()` yields all `(key, value)` pairs for full scans (called inside the internal `process()`). `watch()` handles the full lifecycle.
- **`LiveMapFeed`**: `process()` is a no-op (no snapshot to scan). All work happens in `watch()`.

### LiveMapSubscriber

The `subscriber` passed to `watch()` mirrors `LiveComponentOperator`:

| LiveMapSubscriber | LiveComponentOperator | Description |
|---|---|---|
| `await subscriber.update_all()` | `await operator.update_full()` | Full re-scan of all items |
| `await subscriber.mark_ready()` | `await operator.mark_ready()` | Signal readiness |
| `await subscriber.update(key, value)` | `await operator.update(subpath, fn, ...)` | Incremental update; returns `ComponentMountHandle` |
| `await subscriber.delete(key)` | `await operator.delete(subpath)` | Incremental delete; returns `ComponentMountHandle` |

A typical `watch()` implementation for a `LiveMapView`:

```python
async def watch(self, subscriber: LiveMapSubscriber[K, V]) -> None:
    await subscriber.update_all()     # initial full scan
    await subscriber.mark_ready()     # signal readiness
    # ... watch for changes and call subscriber.update()/delete() ...
```

### Implementing live support for a connector

To add live support to a source connector, make your source return an object that implements `LiveMapView` (if the source has scannable state) or `LiveMapFeed` (if it only streams changes):

- **`LiveMapView` example**: The [`localfs`](../connectors/localfs) connector's `DirWalker` — `walk_dir(..., live=True).items()` returns a `LiveMapView` backed by `watchfiles`.
- **`LiveMapFeed` example**: The [`kafka`](../connectors/kafka) connector — `topic_as_map()` returns a `LiveMapFeed` that consumes messages from Kafka topics.

### Single-subscriber contract

A live feed — `LiveMapView`, `LiveMapFeed`, or `LiveStream` — is **single-subscriber**. Its `watch()` typically owns an exclusive underlying resource (a consumer subscription, an OS file watch, a single in-memory change channel) that can't be fanned out to two concurrent callers — e.g. two subscriptions would race one consumer's offset commits. So a given feed instance supports **one active `watch()` (one `mount_each`) at a time**; consuming it from two places concurrently is unsupported. Fan-out to multiple subscribers, if ever needed, belongs in a layer above the feed.

Enforce this in your `watch()` with `cocoindex.connectorkits.SingleWatcherGuard`, which raises `RuntimeError` on a second concurrent call:

```python
from cocoindex.connectorkits import SingleWatcherGuard

class MyView:
    def __init__(self, ...) -> None:
        self._watch_guard = SingleWatcherGuard("MyView")

    async def watch(self, subscriber) -> None:
        with self._watch_guard:
            ...  # existing body
```

## LiveStream

Some change sources are not naturally keyed maps — they deliver opaque event payloads (e.g. cloud-storage event notifications, raw message buses) that the consuming connector translates into map updates on its own terms. For these, CocoIndex provides **`LiveStream`**: a keyless sibling of `LiveMapFeed` — an unkeyed sequence of messages without any "key" or "delete" semantics. Connectors typically wrap a user-supplied `LiveStream` into their own `LiveMapView` (or `LiveMapFeed`) by translating each event into the appropriate key-level operation.

```python
class LiveStream(Protocol[M]):
    async def watch(self, subscriber: LiveStreamSubscriber[M]) -> None: ...

class LiveStreamSubscriber(Protocol[M]):
    async def send(self, message: M) -> ReadyAwaitable: ...
    async def mark_ready(self) -> None: ...
```

`send()` is awaited by the producer to obtain a `ReadyAwaitable` handle; the handle's `ready()` is then awaited later (often by an internal offset/commit tracker) to signal that the message has been fully processed. The stream awaits each `send()` inline, so a slow subscriber naturally throttles the producer — at most one message is in flight per stream, bounding adapter memory regardless of producer rate.

`mark_ready()` must return promptly — it is awaited inline in the producer's loop. Subscribers that need to wait on additional preconditions (e.g. a pending scan) should spawn a background task and return.

### Implementing a LiveStream-based connector

A connector built on `LiveStream` typically:

1. Accepts a user-supplied `LiveStream[...]` as a constructor argument — the user wires up the underlying transport.
2. Wraps the stream into a `LiveMapView` (or `LiveMapFeed`): `__aiter__()` runs an initial scan if the source has scannable state; `watch()` subscribes to the stream and translates each event into `subscriber.update(key, value)` / `subscriber.delete(key)`.
3. Lets the underlying `LiveStream` handle offset commits, readiness signaling, and back-pressure.
4. Guards its `watch()` against concurrent subscription — see [Single-subscriber contract](#single-subscriber-contract) — with `connectorkits.SingleWatcherGuard`.

**Example**: The [`oci_object_storage`](../connectors/oci_object_storage#live-bucket-watching) connector takes a `LiveStream[bytes]` (typically [`kafka.topic_as_stream(...).payloads()`](../connectors/kafka#as-a-live-stream)) and translates OCI Object Storage events into bucket-keyed updates. The Kafka connector exposes `topic_as_stream()` precisely so that downstream connectors like this one can plug into Kafka without inheriting the keyed-map semantics of `topic_as_map()`.

## Live mode vs catch-up mode

See [Live Mode](../programming_guide/live_mode) for how the two modes are enabled at the app level and an overview of how they work.

For a manual `LiveComponent`, the mode controls whether `process_live()` continues running after `mark_ready()`:

- **Live mode** (`live=True`): `process_live()` continues after `mark_ready()` — the component keeps watching for changes.
- **Catch-up mode** (`live=False`, default): `process_live()` terminates as soon as `mark_ready()` is awaited. No code after `await operator.mark_ready()` executes, so the component behaves like a traditional one-shot processor.

This lets you use the same `LiveComponent` class in both modes without code changes.

## Restrictions

### No `use_mount()` inside `process()`

`process()` may only call `coco.mount()` (background child mounts). Any setup that requires `use_mount()` — such as declaring target tables — must be done in the **parent** component before mounting the LiveComponent. This keeps the controller's provider set stable across full and incremental updates.

### Not allowed in `use_mount()`

LiveComponent classes can only be used with `coco.mount()` and `operator.update()`. Passing a LiveComponent class to `coco.use_mount()` raises a `TypeError`.

**Tip**
While LiveComponent classes cannot be passed to `mount_each()`, you can get live watching behavior more easily using a [`LiveMapFeed` or `LiveMapView`](#livemapfeed-and-livemapview) — `mount_each()` automatically creates an internal LiveComponent when it detects one.

## Readiness

The parent's `await handle.ready()` returns when `mark_ready()` is called inside `process_live()`, regardless of whether `process_live()` is still running.

```mermaid
sequenceDiagram
    participant Parent
    participant LC as LiveComponent
    participant Op as Operator
    participant Children

    Parent->>LC: mount()
    activate LC
    Note over LC: process_live(operator)

    LC->>Op: update_full()
    activate Op
    Op->>LC: process()
    activate LC
    LC->>Children: mount(A), mount(B), ...
    deactivate LC
    Note over Op: submit + GC<br/>wait children ready
    Op-->>LC: done
    deactivate Op

    LC->>Op: mark_ready()
    Op-->>Parent: readiness resolved

    rect rgb(240, 248, 255)
        Note over LC,Children: Live mode only
        LC->>Op: update(A)
        Op->>Children: run A
        LC->>Op: delete(B)
        Op->>Children: delete B
        Note over LC: continues...
    end
    deactivate LC
```

If `process_live()` returns without calling `mark_ready()`, it is called automatically — the parent will not hang.

---

# Building a custom target connector

Source: https://cocoindex.io/docs/advanced_topics/custom_target_connector/

A **custom target connector** is the mechanism that connects CocoIndex's declarative target state system to external systems. When you call methods like `dir_target.declare_file()` or `table_target.declare_row()`, a target connector handles the actual synchronization — determining what changed and applying those changes to the external system.

## When to create a custom target connector

Most users will use built-in connectors (like `localfs` or `postgres`) and never need to create their own. Consider creating a custom target connector when:

- You need to integrate with an external system not covered by existing connectors
- You need custom change detection logic (e.g., content-based fingerprinting)
- You need to manage hierarchical target states (containers with children)

**Tip — Start Simple**
For simple use cases where you just need to write data to an external system without sophisticated change tracking, consider using a regular function with memoization instead. Target states providers are most valuable when you need CocoIndex to track and clean up target states automatically.

## Key data types

This section introduces the key data types. Each is marked as either **you implement** or **CocoIndex provides** to clarify responsibilities.

### TargetHandler *(you implement)*

A `TargetHandler` implements the reconciliation logic. It's a protocol with one required method and one optional method:

```python
class TargetHandler(Protocol[ValueT, TrackingRecordT, OptChildHandlerT]):
    def reconcile(
        self,
        key: StableKey,
        desired_target_state: ValueT | NonExistenceType,
        prev_possible_records: Collection[TrackingRecordT],
        prev_may_be_missing: bool,
        /,
    ) -> TargetReconcileOutput[Any, TrackingRecordT, OptChildHandlerT] | None:
        ...

    # Optional: override to support attachment types (see "Implementing attachment providers")
    def attachments(self) -> dict[str, TargetHandler]:
        return {}  # Default: no attachments
```

**Type Parameters:**

- `ValueT`: The specification for the target state (e.g., file content, row data)
- `TrackingRecordT`: What's stored to detect changes on future runs
- `OptChildHandlerT`: The child handler type, or `None` for leaf targets

**Parameters:**

- `key`: `StableKey` — a union of `None | bool | int | str | bytes | uuid.UUID | Symbol | tuple[StableKey, ...]`
- `desired_target_state`: What the user declared, or `NON_EXISTENCE` if no longer declared
- `prev_possible_records`: Tracking records from previous runs (may have multiple due to interrupted updates)
- `prev_may_be_missing`: If `True`, the target state might not exist in the external system

**Returns:**

- `TargetReconcileOutput` if an action is needed
- `None` if no changes are required

**Warning — Non-blocking**
The `reconcile()` method must be **non-blocking**. It should only compare states and return an action — actual I/O operations happen later in the `TargetActionSink`.

**Info — Type annotations**
Annotate the `prev_possible_records` parameter with `Collection[YourTrackingRecord]` so CocoIndex can properly reconstruct stored tracking records during deserialization. See [Serialization](../programming_guide/serialization) for details on supported types.

### Tracking record *(you define)*

A **tracking record** captures the essential information needed to detect changes. Good tracking records:

- Are **minimal**: Only include what's needed for change detection
- Are **deterministic**: Same input always produces the same record
- Are **serializable**: Must be persistable (typically a NamedTuple or dataclass). Dataclasses and NamedTuples are serialized with msgspec automatically. For types requiring pickle, use `@coco.serialize_by_pickle`.

```python
# Example: File tracking record
@dataclass(frozen=True, slots=True)
class _FileTrackingRecord:
    fingerprint: bytes  # Content hash for change detection
```

**Tip — Fingerprinting**
For content-based change detection, use the `connectorkits.fingerprint` utilities. This lets you detect changes without storing the full content:

```python
from cocoindex.connectorkits.fingerprint import fingerprint_bytes, fingerprint_str, fingerprint_object

# For raw bytes
fp = fingerprint_bytes(content)

# For strings
fp = fingerprint_str(text)

# For arbitrary objects (uses memo key mechanism)
fp = fingerprint_object(obj)
```

### Action and TargetActionSink *(you implement)*

An **action** (you define) describes what operation to perform on the external system:

```python
# Example: File action
class _FileAction(NamedTuple):
    path: pathlib.Path
    content: bytes | None  # None means delete
```

A **TargetActionSink** batches and executes actions:

```python
# Sync sink
sink = coco.TargetActionSink.from_fn(apply_actions)

# Async sink
sink = coco.TargetActionSink.from_async_fn(apply_actions_async)
```

The sink function receives `context_provider` as its first argument (for looking up connections from the environment), followed by a sequence of actions. For container targets, it returns child handler definitions:

```python
import cocoindex as coco

def apply_actions(
    context_provider: coco.ContextProvider,
    actions: Sequence[_FileAction],
) -> list[coco.ChildTargetDef[_ChildHandler] | None]:
    outputs = []
    for action in actions:
        if action.content is None:
            action.path.unlink(missing_ok=True)
            outputs.append(None)
        else:
            action.path.write_bytes(action.content)
            # Return child handler for directories
            if action.is_directory:
                outputs.append(coco.ChildTargetDef(handler=_ChildHandler(action.path)))
            else:
                outputs.append(None)
    return outputs
```

### TargetReconcileOutput *(you return)*

`TargetReconcileOutput` bundles what `reconcile()` returns when an action is needed:

```python
class TargetReconcileOutput(Generic[ActionT, TrackingRecordT, OptChildHandlerT], NamedTuple):
    action: ActionT                                    # What to do
    sink: TargetActionSink[ActionT, OptChildHandlerT]  # How to execute it
    tracking_record: TrackingRecordT | NonExistenceType  # What to remember
    child_invalidation: Literal["destructive", "lossy"] | None = None  # For container targets
```

The `child_invalidation` field is only relevant for **container targets** (those with children). See [Child invalidation](#child-invalidation) for details.

### TargetStateProvider *(CocoIndex provides)*

A `TargetStateProvider` is a factory that creates `TargetState` objects. You don't implement this class — CocoIndex gives you one when you register a handler or declare a target state with children.

```python
# You get a provider from registration
provider = coco.register_root_target_states_provider("my.target", handler)

# Or from declaring a parent target state
child_provider = coco.declare_target_state_with_child(parent_target_state)

# Or from an attachment on a resolved child provider (see "Implementing attachment providers")
att_provider = child_provider.attachment("vector_index")
```

### TargetState *(CocoIndex provides)*

A `TargetState` wraps a key and spec. You create these using the provider, then declare them:

```python
# Create a target state
target_state = provider.target_state(key, spec)

# Declare it for reconciliation
coco.declare_target_state(target_state)
```

## Implementing root target states

This section covers root target states — those not nested inside another target.

### Life of a root target state

Understanding what happens at runtime:

1. **Registration**: You define a `TargetHandler` and call `register_root_target_states_provider()`. CocoIndex returns a `TargetStateProvider` — a factory for creating target states associated with your handler.

2. **Declaration**: During execution, user code calls `provider.target_state(key, spec)` to create `TargetState` objects, then `declare_target_state()` to declare them. CocoIndex collects all declared target states.

3. **Reconciliation**: When the processing unit finishes, CocoIndex calls your handler's `reconcile()` method for each target state. For declared target states, `desired_target_state` contains the spec; for previously declared but now missing states, `desired_target_state` is `NON_EXISTENCE` (triggering cleanup). Your `reconcile()` compares the desired state with previous records and returns `TargetReconcileOutput` if an action is needed, or `None` if no changes are required.

4. **Action Execution**: CocoIndex batches actions by their `TargetActionSink` and executes them. The sink applies changes to the external system (database writes, file operations, API calls, etc.).

5. **Tracking Persistence**: After successful execution, CocoIndex persists the new tracking records. On the next run, these become the `prev_possible_records` for change detection.

**Note — Multiple Previous States**
Due to interrupted updates, `prev_possible_records` may contain multiple records. CocoIndex tracks all possible states until a successful update confirms the current state. Your reconciliation logic should handle this by generating actions that work correctly regardless of which previous state is actual.

### Step 1: Define your types

Start by defining the types for your provider:

```python
from typing import NamedTuple, Collection
from dataclasses import dataclass
import cocoindex as coco

# Key: StableKey is used to identify target states — it's a union type:
#   None | bool | int | str | bytes | uuid.UUID | Symbol | tuple[StableKey, ...]

# Value: What the user declares
@dataclass
class _RowSpec:
    data: dict[str, Any]

# Tracking Record: What to persist for change detection
@dataclass(frozen=True, slots=True)
class _RowTrackingRecord:
    fingerprint: bytes

# Action: What operation to perform
class _RowAction(NamedTuple):
    key: coco.StableKey
    data: dict[str, Any] | None  # None = delete
```

### Step 2: Implement the handler

```python
class _RowHandler(coco.TargetHandler[_RowSpec, _RowTrackingRecord]):
    """Handler for database rows."""

    def __init__(self, connection: DatabaseConnection, table: str):
        self._conn = connection
        self._table = table
        self._sink = coco.TargetActionSink.from_async_fn(self._apply_actions)

    async def _apply_actions(
        self, context_provider: coco.ContextProvider, actions: Sequence[_RowAction]
    ) -> None:
        # Connection was passed in __init__ — context_provider not needed here,
        # but must be accepted per the sink protocol.
        for action in actions:
            if action.data is None:
                await self._conn.delete(self._table, action.key)
            else:
                await self._conn.upsert(self._table, action.key, action.data)

    def _compute_fingerprint(self, data: dict[str, Any]) -> bytes:
        from cocoindex.connectorkits.fingerprint import fingerprint_object
        return fingerprint_object(data)

    def reconcile(
        self,
        key: coco.StableKey,
        desired_target_state: _RowSpec | coco.NonExistenceType,
        prev_possible_records: Collection[_RowTrackingRecord],
        prev_may_be_missing: bool,
        /,
    ) -> coco.TargetReconcileOutput[_RowAction, _RowTrackingRecord] | None:
        # Handle deletion
        if coco.is_non_existence(desired_target_state):
            if not prev_possible_records and not prev_may_be_missing:
                return None  # Nothing to delete
            return coco.TargetReconcileOutput(
                action=_RowAction(key=key, data=None),
                sink=self._sink,
                tracking_record=coco.NON_EXISTENCE,
            )

        # Handle upsert
        target_fp = self._compute_fingerprint(desired_target_state.data)

        # Skip if unchanged
        if not prev_may_be_missing and all(
            prev.fingerprint == target_fp for prev in prev_possible_records
        ):
            return None

        return coco.TargetReconcileOutput(
            action=_RowAction(key=key, data=desired_target_state.data),
            sink=self._sink,
            tracking_record=_RowTrackingRecord(fingerprint=target_fp),
        )
```

### Step 3: Register the provider

For root-level target states (not nested within another target), register a provider:

```python
_row_provider = coco.register_root_target_states_provider(
    "mycompany.io/mydb/row",  # Unique provider name
    _RowHandler(connection, table),
)
```

### Step 4: Create user-facing APIs

Wrap the provider in a user-friendly API:

```python
class TableTarget:
    """User-facing API for declaring rows."""

    def __init__(self, provider: coco.TargetStateProvider[_RowSpec, None]):
        self._provider = provider

    def declare_row(self, *, row: dict[str, Any], key: tuple[str, ...]) -> None:
        spec = _RowSpec(data=row)
        target_state = self._provider.target_state(key, spec)
        coco.declare_target_state(target_state)
```

## Implementing container targets

Container targets (directories, tables) have children (files, rows). This section covers how non-root target states work and how to implement them.

### Non-root target states

For targets **nested inside another target** (e.g., files inside a directory), the lifecycle is similar to root targets but **how you get the provider is different**.

For root targets, you call `register_root_target_states_provider()` and immediately get a provider with your handler. For non-root targets, the handler comes from the **parent's sink execution**:

1. **Declaration**: Call `declare_target_state_with_child(parent_ts)` — returns an **unresolved** child provider immediately
2. **Resolution**: When the parent reconciles and its sink executes, the sink returns `ChildTargetDef(handler=...)`. CocoIndex resolves the child provider with this handler.
3. **Usage**: The child provider can now create child target states, which follow the same reconciliation → execution → tracking flow as root targets.

The child handler often needs context from the parent's action execution. For example, a file handler needs to know the directory path that was created. By returning the handler from the parent's sink, the handler has access to this runtime context.

### Child invalidation

When a container target undergoes certain changes, the child target states may be affected. The `child_invalidation` field in `TargetReconcileOutput` lets you signal this to CocoIndex:

- **`"destructive"`** — The container change destroys all existing children (e.g., a primary key change that requires dropping and recreating a table). CocoIndex will ignore all previous tracking records for children under this container and treat them as new.

- **`"lossy"`** — The container change may cause data loss for existing children (e.g., a schema change that removes columns). CocoIndex will force an upsert for all children by setting `prev_may_be_missing=True`, even if their data appears unchanged.

- **`None`** (default) — No impact on children. Normal change detection applies.

Set `child_invalidation` in the **parent handler's** `reconcile()` method when you detect that the container itself has changed in a way that affects its children:

```python
class _DirHandler(coco.TargetHandler[_DirSpec, _DirTrackingRecord, _EntryHandler]):
    def reconcile(self, key, desired_target_state, prev_possible_records, prev_may_be_missing, /):
        # Detect if the container change is destructive or lossy
        invalidation = None
        if self._is_destructive_change(desired_target_state, prev_possible_records):
            invalidation = "destructive"
        elif self._is_lossy_change(desired_target_state, prev_possible_records):
            invalidation = "lossy"

        return coco.TargetReconcileOutput(
            action=_DirAction(...),
            sink=self._sink,
            tracking_record=_DirTrackingRecord(...),
            child_invalidation=invalidation,
        )
```

### Step 1: Define parent and child handlers

The parent handler reconciles the container itself. The child handler reconciles entries within it:

```python
# Parent handler for directory
class _DirHandler(coco.TargetHandler[_DirSpec, _DirTrackingRecord, _EntryHandler]):
    def reconcile(self, key, desired_target_state, prev_possible_records, prev_may_be_missing, /):
        # Reconcile the directory itself
        ...

# Child handler for entries within a directory
class _EntryHandler(coco.TargetHandler[_EntrySpec, _EntryTrackingRecord]):
    def __init__(self, base_path: pathlib.Path):
        self._base_path = base_path

    def reconcile(self, key, desired_target_state, prev_possible_records, prev_may_be_missing, /):
        # Reconcile files/subdirs within the directory
        path = self._base_path / key
        ...
```

### Step 2: Return child handlers from the sink

The parent's sink creates the container and returns child handlers:

```python
def _apply_dir_actions(
    context_provider: coco.ContextProvider,
    actions: Sequence[_DirAction],
) -> list[coco.ChildTargetDef[_EntryHandler] | None]:
    outputs = []
    for action in actions:
        if action.should_delete:
            shutil.rmtree(action.path, ignore_errors=True)
            outputs.append(None)  # No child handler for deleted directories
        else:
            action.path.mkdir(parents=True, exist_ok=True)
            # Return child handler with the created path
            outputs.append(coco.ChildTargetDef(handler=_EntryHandler(action.path)))
    return outputs
```

### Step 3: Create user-facing API

The user-facing API uses `declare_target_state_with_child()` and exposes methods for declaring children:

```python
class DirTarget:
    """User-facing API for declaring files in a directory."""

    def __init__(self, provider: coco.TargetStateProvider[_EntrySpec, None]):
        self._provider = provider

    def declare_file(self, filename: str, content: bytes) -> None:
        spec = _EntrySpec(content=content)
        target_state = cast(
            coco.TargetState[None],
            self._provider.target_state(filename, spec),
        )
        coco.declare_target_state(target_state)


@coco.fn
def declare_dir_target(path: pathlib.Path) -> DirTarget:
    """Declare a directory target and return an API for declaring files."""
    parent_ts = _root_provider.target_state(
        key=str(path),
        value=_DirSpec(),
    )
    # Child provider is pending until parent sink runs
    child_provider = coco.declare_target_state_with_child(parent_ts)
    return DirTarget(child_provider)
```

## Implementing attachment providers

Attachment providers let a child handler expose **auxiliary target states** alongside its regular children. For example, a PostgreSQL table handler manages rows as regular children, but can also manage vector indexes and SQL command attachments as separate attachment types — each tracked independently.

### When to use attachments

Use attachments when a target has auxiliary state beyond its primary children — indexes, triggers, materialized views, or any side-resource that should be managed alongside the main data. Attachments use **symbol keys** as namespace separators so they never conflict with regular child keys.

### Target state path hierarchy

Attachments create additional levels in the target state path using symbol keys (denoted with `@` prefix in documentation):

```
@my_connector/table                    [level 1 — root provider]
  (db_key, schema, table_name)         [level 2 — table state]
    @vector_index                      [level 3 — attachment namespace (symbol key)]
      index_name_1                     [level 4 — attachment instance]
      index_name_2                     [level 4]
    @sql_command_attachment             [level 3 — another attachment namespace]
      cmd_name_1                       [level 4]
    row_pk_1                           [level 3 — regular child (row)]
    row_pk_2                           [level 3]
```

The symbol keys (`@vector_index`, `@sql_command_attachment`) are path namespaces — not target states themselves. They separate attachment instances from regular children at the same level.

### How it works

1. **Handler implements `attachments()`**: The child handler (e.g., row handler) returns a dict of all supported attachment types and their handlers.
2. **Engine eagerly registers attachment providers**: When the child handler is fulfilled, the engine calls `attachments()` and registers all returned types. This ensures orphaned attachments can be cleaned up even when not declared in the current run.
3. **User code calls `provider.attachment()`**: On a resolved child provider, this retrieves the cached attachment sub-provider.
4. **Target states declared under the attachment provider** are tracked independently from regular children.

### Step 1: Implement the attachment handler

An attachment handler is just a regular `TargetHandler` — it implements `reconcile()` and has an action sink:

```python
class _VectorIndexSpec(NamedTuple):
    column: str
    metric: str
    method: str

class _VectorIndexAction(NamedTuple):
    name: str
    spec: _VectorIndexSpec | None  # None means delete

class _VectorIndexHandler:
    def __init__(self, pool, table_name):
        self._pool = pool
        self._table_name = table_name
        self._sink = coco.TargetActionSink.from_async_fn(self._apply_actions)

    async def _apply_actions(
        self, context_provider: coco.ContextProvider, actions: Sequence[_VectorIndexAction]
    ) -> None:
        async with self._pool.acquire() as conn:
            for action in actions:
                if action.spec is None:
                    await conn.execute(f'DROP INDEX IF EXISTS "{action.name}"')
                else:
                    await conn.execute(f'CREATE INDEX "{action.name}" ...')

    def reconcile(self, key, desired_target_state, prev_possible_records, prev_may_be_missing, /):
        # Standard reconcile pattern — compare fingerprints, return action or None
        ...
```

### Step 2: Add `attachments()` to the parent handler

The parent handler (which manages regular children) returns a dict of all supported attachment types:

```python
class _RowHandler(coco.TargetHandler[_RowValue, _RowFingerprint]):
    def __init__(self, pool, table_name, schema_name, table_schema):
        self._pool = pool
        self._table_name = table_name
        self._schema_name = schema_name
        # ...

    def attachments(self) -> dict[str, _VectorIndexHandler | _SqlCommandHandler]:
        return {
            "vector_index": _VectorIndexHandler(self._pool, self._table_name, self._schema_name),
            "sql_command_attachment": _SqlCommandHandler(self._pool, self._table_name, self._schema_name),
        }

    def reconcile(self, ...):
        # Regular row reconciliation
        ...
```

### Step 3: Expose attachment APIs on the user-facing target

The user-facing target class calls `provider.attachment()` to get the attachment sub-provider, then declares target states on it:

```python
class TableTarget:
    def __init__(self, provider, table_schema):
        self._provider = provider
        self._table_schema = table_schema

    def declare_row(self, *, row):
        # Regular child target state
        coco.declare_target_state(self._provider.target_state(pk_values, row_dict))

    def declare_vector_index(self, *, name, column, metric="cosine", method="ivfflat"):
        spec = _VectorIndexSpec(column=column, metric=metric, method=method)
        att_provider = self._provider.attachment("vector_index")
        coco.declare_target_state(att_provider.target_state(name, spec))
```

**Tip — Tracking records for teardown**
When an attachment has a teardown step (like `DROP INDEX`), store the full spec as the tracking record instead of a fingerprint. This lets you recover the teardown information from `prev_possible_records` when the attachment is deleted or changed. See the `_SqlCommandHandler` in the PostgreSQL connector for an example.

## Best practices

### Use `ContextKey` for external resource identity

When a target connector manages state in an external resource (a database, a search index, an object store, etc.), use a `ContextKey` string as part of the target state key — not connection parameters like host, port, or credentials.

Target state keys must be stable across runs for correct reconciliation. CocoIndex uses keys to match current declarations with previously tracked states. If the key is stable, previously tracked states are associated with the current target, so CocoIndex can correctly reconcile — e.g., deleting rows that are no longer declared. If the key changes (because a connection parameter changed), CocoIndex cannot associate previous tracked states with the current target, and treats the target as being in a cleared state — losing the ability to clean up old data.

`ContextKey` solves this by providing a user-defined stable logical name (e.g., `"my_pg"`) that is decoupled from transient connection details:

```python
# User creates a stable logical name for the resource
db = coco.ContextKey[asyncpg.Pool]("my_pg")

# Target connector uses db.key (the string "my_pg") in the target state key
class _TableKey(NamedTuple):
    db_key: str            # Stable — from ContextKey.key
    schema_name: str | None
    table_name: str

key = _TableKey(db_key=db.key, schema_name=schema_name, table_name=table_name)

# At action time, resolve the live connection from context_provider
pool = context_provider.get(key.db_key, asyncpg.Pool)
```

This way, changing a password, switching replicas, or rotating credentials won't invalidate tracked states or break reconciliation.

See `_TableKey` in [`cocoindex/connectors/postgres/_target.py`](https://github.com/cocoindex-io/cocoindex/blob/main/python/cocoindex/connectors/postgres/_target.py) and [`cocoindex/connectors/surrealdb/_target.py`](https://github.com/cocoindex-io/cocoindex/blob/main/python/cocoindex/connectors/surrealdb/_target.py) for reference implementations.

### Idempotent actions

Actions should be idempotent — applying the same action multiple times should have the same effect as applying it once:

```python
# Good: Idempotent
path.mkdir(parents=True, exist_ok=True)
path.unlink(missing_ok=True)
await conn.execute("INSERT ... ON CONFLICT DO UPDATE ...")

# Bad: Not idempotent
path.mkdir()  # Fails if exists
await conn.execute("INSERT ...")  # Fails on duplicate key
```

### Handle multiple previous states

Due to interrupted updates, `prev_possible_records` may contain multiple records. Design your reconciliation logic to handle this:

```python
# Check if ALL previous states match (conservative approach)
if not prev_may_be_missing and all(
    prev.fingerprint == target_fp for prev in prev_possible_records
):
    return None  # Safe to skip
```

### Efficient change detection

Choose tracking records that enable efficient change detection without storing full content:

| Scenario | Tracking Record |
|----------|-----------------|
| File content | Content hash (fingerprint) |
| Database row | Row data hash |
| Schema/structure | Schema definition |
| Directory existence | `None` (presence is enough) |

### Shared action sinks

If all instances of a handler use the same action logic, create a shared sink. The action function must accept `context_provider` as its first argument:

```python
import cocoindex as coco

def _apply_actions(
    context_provider: coco.ContextProvider, actions: Sequence[_MyAction]
) -> None:
    ...

# Module-level shared sink
_shared_sink = coco.TargetActionSink.from_fn(_apply_actions)

class _MyHandler(coco.TargetHandler[...]):
    def reconcile(self, ...):
        return coco.TargetReconcileOutput(
            action=...,
            sink=_shared_sink,  # Reuse the same sink
            tracking_record=...,
        )
```

## Complete example: local file system

Here's a simplified version of the `localfs` connector showing the complete pattern:

```python
from __future__ import annotations
import pathlib
from dataclasses import dataclass
from typing import Collection, NamedTuple, Sequence
import cocoindex as coco
from cocoindex.connectorkits.fingerprint import fingerprint_bytes


# Types
_FileContent = bytes
_FileFingerprint = bytes


class _FileAction(NamedTuple):
    path: pathlib.Path
    content: _FileContent | None  # None = delete


@dataclass(frozen=True, slots=True)
class _FileTrackingRecord:
    fingerprint: _FileFingerprint


# Action execution
def _apply_actions(context_provider: coco.ContextProvider, actions: Sequence[_FileAction]) -> None:
    for action in actions:
        if action.content is None:
            action.path.unlink(missing_ok=True)
        else:
            action.path.parent.mkdir(parents=True, exist_ok=True)
            action.path.write_bytes(action.content)


_file_sink = coco.TargetActionSink[_FileAction, None].from_fn(_apply_actions)


# Handler
class _FileHandler(coco.TargetHandler[_FileContent, _FileTrackingRecord]):
    __slots__ = ("_base_path",)
    _base_path: pathlib.Path

    def __init__(self, base_path: pathlib.Path):
        self._base_path = base_path

    def reconcile(
        self,
        key: coco.StableKey,
        desired_target_state: _FileContent | coco.NonExistenceType,
        prev_possible_records: Collection[_FileTrackingRecord],
        prev_may_be_missing: bool,
        /,
    ) -> coco.TargetReconcileOutput[_FileAction, _FileTrackingRecord] | None:
        path = self._base_path / key

        if coco.is_non_existence(desired_target_state):
            if not prev_possible_records and not prev_may_be_missing:
                return None
            return coco.TargetReconcileOutput(
                action=_FileAction(path=path, content=None),
                sink=_file_sink,
                tracking_record=coco.NON_EXISTENCE,
            )

        target_fp = fingerprint_bytes(desired_target_state)

        if not prev_may_be_missing and all(
            prev.fingerprint == target_fp for prev in prev_possible_records
        ):
            return None

        return coco.TargetReconcileOutput(
            action=_FileAction(path=path, content=desired_target_state),
            sink=_file_sink,
            tracking_record=_FileTrackingRecord(fingerprint=target_fp),
        )
```

See the full implementations in:

- [`cocoindex/connectors/localfs/target.py`](https://github.com/cocoindex-io/cocoindex/blob/main/python/cocoindex/connectors/localfs/_target.py) — File system targets
- [`cocoindex/connectors/postgres/target.py`](https://github.com/cocoindex-io/cocoindex/blob/main/python/cocoindex/connectors/postgres/_target.py) — PostgreSQL tables and rows

---

# The CLI reference

Source: https://cocoindex.io/docs/cli/

CocoIndex CLI is a standalone tool for easily managing and inspecting your apps.

## Invoke the CLI

Once CocoIndex is installed, you can invoke the CLI directly using the `cocoindex` command. Most commands require an `APP_TARGET` argument, which tells the CLI where your app definitions are located.

### APP_TARGET Format

The `APP_TARGET` can be:

1. A **Python module name** that contains your app definitions (e.g., `main`, `my_package.apps`).
    You can also use `--app-dir <path>` to specify the base directory to load the module from.

2. A **path to a Python file** defining your apps (e.g., `main.py`, `path/to/my_apps.py`).

    The file will be loaded as a top-level Python module, e.g. relative imports will not work as its parent package is not defined (similar to how `python main.py` resolves imports).

3. For commands that operate on a *specific app* (like `show`, `update`), you can combine the application reference with an app name:
    * `my_package.apps:MyApp`
    * `path/to/my_apps.py:MyApp`

### Environment Variables

You can set environment variables in an environment file.

* By default, the `cocoindex` CLI searches upward from the current directory for a `.env` file.
* You can use `--env-file <path>` to specify one explicitly:

    ```sh
    cocoindex --env-file path/to/custom.env <COMMAND> ...
    ```

Loaded variables do *NOT* override existing system ones.
If no file is found, only existing system environment variables are used.

### Global Options

CocoIndex CLI supports the following global options:

* `-e, --env-file <path>`: Load environment variables from a specified `.env` file. If not provided, `.env` in the current directory is loaded if it exists.
* `-d, --app-dir <path>`: Load apps from the specified directory. It will be treated as part of `PYTHONPATH`. Default to the current directory.
* `-V, --version`: Show the CocoIndex version and exit.
* `--help`: Show the main help message and exit.

## Subcommands Reference

### `drop`

Drop an app and all its target states.

This will:

- Revert all target states created by the app (e.g., drop tables, delete rows)
- Clear the app's internal state database

`APP_TARGET`: `path/to/app.py`, `module`, `path/to/app.py:app_name`, or
`module:app_name`.

**Usage:**

```bash
cocoindex drop [OPTIONS] APP_TARGET
```

**Options:**

| Option | Description |
|--------|-------------|
| `-f, --force` | Skip confirmation prompt. |
| `-q, --quiet` | Avoid printing anything to the standard output, e.g. statistics. |
| `--help` | Show this message and exit. |

---

### `init`

Initialize a new CocoIndex project.

Creates a new project directory with starter files: 1. main.py (Main
application file) 2. pyproject.toml (Project metadata and dependencies) 3.
README.md (Quick start guide)

`PROJECT_NAME`: Name of the project (defaults to current directory name if
not specified).

**Usage:**

```bash
cocoindex init [OPTIONS] [PROJECT_NAME]
```

**Options:**

| Option | Description |
|--------|-------------|
| `--dir DIRECTORY` | Directory to create the project in. |
| `--help` | Show this message and exit. |

---

### `ls`

List all apps.

If `APP_TARGET` (`path/to/app.py` or `module`) is provided, lists apps
defined in that module and their persisted status, grouped by environment.

If `APP_TARGET` is omitted and `--db` is provided, lists all apps from the
specified database.

**Usage:**

```bash
cocoindex ls [OPTIONS] [APP_TARGET]
```

**Options:**

| Option | Description |
|--------|-------------|
| `--db TEXT` | Path to database to list apps from (only used when APP_TARGET is not specified). |
| `--help` | Show this message and exit. |

---

### `show`

Show the app's stable paths.

If `APP_TARGET` is provided, loads the app from the module.
Otherwise, `--db` and `--app-name` can be used to inspect an app
directly from its database without loading the module.

**Usage:**

```bash
cocoindex show [OPTIONS] [APP_TARGET] [STABLE_PATH]
```

**Options:**

| Option | Description |
|--------|-------------|
| `--db TEXT` | Path to database (used with --app-name when APP_TARGET is not specified). |
| `--app-name TEXT` | App name to inspect (used with --db when APP_TARGET is not specified). |
| `--tree` | Display stable paths as a tree with component annotations. |
| `-l, --long` | Display detailed information in multi-line format. |
| `-r, --recursive` | Show all children recursively (requires stable_path). |
| `-p, --parents` | Show all parent paths (requires stable_path). |
| `--help` | Show this message and exit. |

---

### `update`

Run an app in catch-up mode. With --live, run in live mode.

`APP_TARGET`: `path/to/app.py`, `module`, `path/to/app.py:app_name`, or
`module:app_name`.

**Usage:**

```bash
cocoindex update [OPTIONS] APP_TARGET
```

**Options:**

| Option | Description |
|--------|-------------|
| `-f, --force` | Skip confirmation prompt. |
| `-q, --quiet` | Avoid printing anything to the standard output, e.g. statistics. |
| `--reset` | Drop existing setup before updating (equivalent to running 'cocoindex drop' first). |
| `--full-reprocess` | Reprocess everything and invalidate existing caches. |
| `-L, --live` | Run in live mode (live components continue processing after initial update). |
| `--preview` | Compute target actions without applying them. Prints planned actions. |
| `--help` | Show this message and exit. |

---

---

# Frequently asked questions (FAQ)

Source: https://cocoindex.io/docs/faq/

## Change detection

### Why do logic changes propagate transitively but input changes don't?

In the call chain `foo(a)` → `bar(b)`:

- **Logic changes propagate**: if `bar`'s logic changes (code, `deps`, `version`), the output of `foo(a)` could be different too, so `foo`'s memo must be invalidated.
- **Input changes don't propagate**: `b` is the result of applying part of `foo`'s logic to `a`. As long as `foo`'s logic and `a` are unchanged, `b` won't change — there's nothing to propagate.

### How does logic change propagation work?

Logic changes propagate based on **runtime invocations**, not static call graphs. Two consequences:

- **Unannotated functions don't break the chain.** If `f1()` → `f2()` → `f3()`, and `f1` and `f3` are decorated with `@coco.fn` but `f2` is not, a logic change in `f3` still invalidates `f1`'s memo.
- **Conditional calls are tracked precisely.** If `f1()` calls `f2()` only in one branch, then invocations of `f1()` that didn't call `f2()` are not invalidated when `f2`'s logic changes — only invocations that actually called `f2()` are affected.

### What about hidden dependencies like global variables or files?

Like any memoization system (e.g., `@functools.cache`), CocoIndex's change detection assumes functions depend only on their declared inputs. If a function reads a global variable, a file, or external state not passed through arguments, changes to those won't be detected automatically.

CocoIndex provides mechanisms to capture some of these dependencies:

- **[`deps`](./programming_guide/function#deps)** — declares module-level values (like a prompt string or model name) as part of the function's logic. Changes to these values invalidate dependent memos, just like any other logic change. Note: `deps` is snapshotted once at decoration time.
- **[`use_context()`](./programming_guide/context)** — retrieves shared resources via `ContextKey`. With [`detect_change=True`](./programming_guide/context#change-detection), changes to the provided value invalidate dependent memos.

For per-call values that change at runtime, pass them as regular function arguments instead.

## Target states and syncing

### What happens if my pipeline crashes mid-update?

CocoIndex's internal state is always consistent — even after a crash or `kill -9`. On the next `app.update()`, CocoIndex automatically recovers: it computes the current desired state and reconciles against all possible previous states, converging the target to the correct state. No manual cleanup is needed. See [Error Handling — Interrupted updates and recovery](./advanced_topics/exception_handlers#interrupted-updates-and-recovery) for details.

### Are target state writes transactional across targets?

Not across targets. When a processing component finishes, CocoIndex sends all its target state changes to each target backend as a unit — all writes happen after processing completes, never partially during execution. Each target backend applies its batch atomically when supported (e.g., within a database transaction). But changes across *different* target backends (e.g., Postgres and local files) are not transactional with each other. See [How target states sync](./programming_guide/processing_component#how-target-states-sync) for details.

---

# Setup dev environment

Source: https://cocoindex.io/docs/contributing/setup_dev_environment/

Follow the steps below to get CocoIndex built on the latest codebase locally - if you are making changes to CocoIndex functionality and want to test it out.

- 🦀 [Install Rust](https://rust-lang.org/tools/install)

    If you don't have Rust installed, run

    ```sh
    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
    ```

    Already have Rust? Make sure it's up to date

    ```sh
    rustup update
    ```

- Install [uv](https://docs.astral.sh/uv/) for Python project management:

    ```sh
    # macOS / Linux
    curl -LsSf https://astral.sh/uv/install.sh | sh

    # Windows
    powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
    ```

- Setup your local development environment:

  - Install and enable prek hooks. This ensures all checks run automatically before each commit:

    ```sh
    uv run prek install
    ```

  - (Optionally) Install all optional dependencies:

    ```sh
    uv sync --all-extras
    ```

- Build the library. Run at the root of cocoindex directory:

    ```sh
    uv run maturin develop
    ```

    This step needs to be repeated whenever you make changes to the Rust code.

## Running examples

Before running a specific example, set extra environment variables, for exposing extra traces, allowing dev UI, etc.

```sh
. ./.env.lib_debug
```

To run examples during development, you need to use the local editable version of cocoindex.
`.env.lib_debug` provides a `coco-dev-run` function to make it more convenient.

```sh
# Navigate to an example directory
cd examples/text_embedding

# Run with your local cocoindex changes
coco-dev-run cocoindex update main
```

The `coco-dev-run` function runs `uv run --with-editable $COCOINDEX_DEV_ROOT`, which ensures the example uses your locally built cocoindex package instead of the published version. The `COCOINDEX_DEV_ROOT` variable is automatically set to the repo root when you source `.env.lib_debug`.

## Troubleshooting

### `cargo test` fails with `ModuleNotFoundError: encodings` (embedded Python can't find stdlib)

On some setups (notably when using `uv venv` with an `uv`-managed Python), `cargo test` may crash with:

`ModuleNotFoundError: No module named 'encodings'`

This can happen when the embedded Python interpreter (used by Rust tests) cannot locate the Python stdlib (you may see `sys.prefix=/install` in the crash output).

Workaround:

- Run cargo tests via:

  `./dev/run_cargo_test.sh -p cocoindex --lib`

This wrapper sets `PYTHONHOME`/`PYTHONPATH` for that command only, so embedded Python can locate the stdlib and site-packages.

**Info**

The cargo-test prek hook uses this wrapper.

---

# Contributing guide

Source: https://cocoindex.io/docs/contributing/guide/

[CocoIndex](https://github.com/cocoindex-io/cocoindex) is an open source project. We are respectful, open and friendly. This guide explains how to get involved and contribute to [CocoIndex](https://github.com/cocoindex-io/cocoindex).

Our [Discord server](https://discord.com/invite/zpA9S2DR7s) is constantly open.
If you are unsure about anything, it is a good place to discuss! We'd love to collaborate and will always be friendly.

## Good first issues

We tag issues with the ["good first issue"](https://github.com/cocoindex-io/cocoindex/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) label for beginner contributors.

## How to contribute
- If you decide to take an issue, we recommend you to leave a comment on the issue like  **`Can I work on this issue?`** so we could assign it to you. This helps you and others avoid duplicating work.
- For larger features, we recommend you to discuss with us first in our [Discord server](https://discord.com/invite/zpA9S2DR7s) to coordinate the design and work.

## Submit your code
CocoIndex is committed to the highest standards of code quality. Please ensure your code is thoroughly tested before submitting a PR.

To submit your code:

1. Fork the [CocoIndex repository](https://github.com/cocoindex-io/cocoindex)
2. [Create a new branch](https://docs.github.com/en/desktop/making-changes-in-a-branch/managing-branches-in-github-desktop) on your fork
3. Make your changes
4. Run the prek checks. It will be automatically triggered on `git commit` after you install the prek hooks by `prek install` (see [Setup Development Environment](setup_dev_environment)).

**Tip**
    To run them manually (same as CI):
        ```sh
        prek run --all-files
        ```

5. [Open a Pull Request (PR)](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request-from-a-fork) when your work is ready for review

In your PR description, please include:
- Description of the changes
- Motivation and context
- Note if it's a breaking change
- Reference any related GitHub issues

A core team member will review your PR within one business day and provide feedback on any required changes. Once approved and all tests pass, the reviewer will squash and merge your PR into the main branch.

Your contribution will then be part of CocoIndex! We'll highlight your contribution in our release notes 🌴.

---

# Join the community

Source: https://cocoindex.io/docs/about/community/

Welcome with a huge coconut hug 🥥⋆｡˚🤗.

We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests on [GitHub](https://github.com/cocoindex-io/cocoindex), and discussions in our [Discord](https://discord.com/invite/zpA9S2DR7s).

We would love to foster an inclusive, welcoming, and supportive environment. Contributing to CocoIndex should feel collaborative, friendly and enjoyable for everyone. Together, we can build better AI applications through robust data infrastructure.

**Tip — Start hacking CocoIndex**
Check out our [Contributing guide](../contributing/guide) to get started!

## Connect with us

Join our community channels to get help, share ideas, and connect with other developers:

- [Discord Community](https://discord.com/invite/zpA9S2DR7s) - Chat with the community and get real-time support
- [Twitter](https://x.com/cocoindex_io) - Follow us for updates and announcements
- [LinkedIn](https://www.linkedin.com/company/cocoindex/about/) - Connect professionally
- [YouTube](https://www.youtube.com/@cocoindex-io) - Watch tutorials and demos

## Get support

Need help? Here are the best ways to get support:

- Email us at [hi@cocoindex.io](mailto:hi@cocoindex.io)
- Check our [documentation](https://cocoindex.io/docs)
- Read our [blog posts](https://cocoindex.io/blogs)
- Ask questions in our [Discord Server](https://discord.com/invite/zpA9S2DR7s)

---

# Anonymous usage telemetry

Source: https://cocoindex.io/docs/about/telemetry/

CocoIndex sends a small number of anonymous events to help us understand how
the library is used across platforms and runtimes. Tracking is strictly
anonymous, release-only, and opt-out.

## What we collect

On release builds, CocoIndex POSTs a JSON event to our telemetry endpoint
(the [Scarf](https://about.scarf.sh/) gateway) at four points in the library's
lifecycle:

- `init` — once per process, when CocoIndex is imported.
- `app_create` — when a `cocoindex.App` is constructed.
- `app_update` — each time `app.update()` is called.
- `app_drop` — each time `app.drop()` is called.

Each event body contains exactly three fields:

```json
{ "event": "app_update", "platform": "aarch64-macos", "lang": "python3.11" }
```

The package identity and version travel in the request URL, for example
`https://cocoindex.gateway.scarf.sh/python-1.0.0a1`.

## What we don't collect

CocoIndex telemetry never sends:

- Your data, flow definitions, or function code.
- Source/target connector configuration, credentials, URLs, or database names.
- App names, component names, file paths, or any user-provided identifiers.
- Stats about row counts, processing times, or memory usage.
- A persistent machine or user identifier. Requests are not tagged; the gateway
  only sees whatever an HTTP request normally carries (IP address, which Scarf
  uses for coarse geolocation and then discards).

## When telemetry is active

Telemetry runs only in **release builds** of the native extension (the
wheels published to PyPI). Local development installs via `maturin develop`
produce a debug build and never send any events — the check is compile-time,
so the code path doesn't exist in debug binaries.

Delivery is non-blocking: each event is dispatched on a background task with a
5-second timeout. Failures are logged at INFO level and never affect your app.

## Opting out

Set the `COCOINDEX_DISABLE_USAGE_TRACKING` environment variable to any non-empty
value other than `0` before starting your program:

```bash
export COCOINDEX_DISABLE_USAGE_TRACKING=1
```

When this variable is set, CocoIndex skips all telemetry initialization — no
events are ever constructed or dispatched.

## Why we collect this

Knowing which platforms and Python versions are in active use helps us prioritize
wheel builds, platform support, and compatibility testing. If you have feedback
about what we track, please [open an issue](https://github.com/cocoindex-io/cocoindex/issues)
or reach out in our [Discord](https://discord.com/invite/zpA9S2DR7s).

---

# Example Walkthroughs

---

# Example: Semantic Search 101

Source: https://cocoindex.io/docs/examples/text-embedding/

![Semantic Search 101 with CocoIndex V1](https://cocoindex.io/blobs/docs-v1/img/examples/text-embedding/cover.png)

We'll take a folder of Markdown files and turn it into a [vector index](https://github.com/pgvector/pgvector) you can search in plain English — the foundation under every RAG and semantic-search system. Point it at your docs, and "how does incremental processing work?" finds the right passage even when it shares no keywords with the text.

The whole pipeline is ordinary `async` Python and your own types. The heavy lifting — [incremental processing](https://cocoindex.io/docs/programming_guide/core_concepts/), change tracking, managed targets — runs in a Rust engine underneath, so only what changed gets re-embedded and re-upserted.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/text_embedding)

## Flow overview

![CocoIndex text embedding flow: read Markdown, split into chunks, embed each chunk, and store the vectors in Postgres with pgvector](https://cocoindex.io/blobs/docs-v1/img/examples/text-embedding/flow-v1.png)

From a high level, these are the steps:

1. Read Markdown files from a local directory.
2. [Split each file into overlapping chunks](https://cocoindex.io/docs/ops/text/), then [embed](https://cocoindex.io/docs/ops/sentence_transformers/) every chunk.
3. Store the chunks and their embeddings in Postgres (as [target states](https://cocoindex.io/docs/programming_guide/target_state/)).

You [declare the transformation logic](https://cocoindex.io/docs/programming_guide/core_concepts/) with native Python, without worrying about how updates propagate. Think: **target_state = transformation(source_state)**.

> **New to embeddings?** An [*embedding*](https://cocoindex.io/docs/ops/sentence_transformers/) is a list of numbers (a vector) that captures the *meaning* of a piece of text, so passages with similar meaning land close together in vector space. A [*vector index*](https://cocoindex.io/docs/common_resources/vector_schema/) stores those vectors and finds the nearest ones to your query fast. That's what lets search match by meaning instead of exact words.

## Setup

- A running Postgres with the [pgvector](https://github.com/pgvector/pgvector) extension. CocoIndex supports [many targets](https://cocoindex.io/docs/connectors/postgres/), so you can pick another store.

  ```sh
  export POSTGRES_URL="postgres://cocoindex:cocoindex@localhost/cocoindex"
  ```

- Install CocoIndex and the dependencies this example uses:

  ```sh
  pip install -U "cocoindex[postgres,sentence_transformers]" asyncpg pgvector numpy python-dotenv
  ```

- A few `.md` files to index. Grab the [sample files](https://github.com/cocoindex-io/cocoindex/tree/main/examples/text_embedding/markdown_files) from the repo, or drop your own notes into a `markdown_files/` directory.

## Define the data and shared resources

[Apps](https://cocoindex.io/docs/programming_guide/app/) are the top-level runnable unit in CocoIndex. Before the App, we set up two things the rest of the code builds on. `DocEmbedding` defines one row of the output table — each chunk of text becomes one row, with its filename, location, text, and embedding vector. `coco_lifespan` provides the [shared resources](https://cocoindex.io/docs/programming_guide/context/) every step needs — the Postgres connection pool and the embedding model — once at startup.

```python
import os
import pathlib
from dataclasses import dataclass
from typing import AsyncIterator, Annotated

import asyncpg
from numpy.typing import NDArray

import cocoindex as coco
from cocoindex.connectors import localfs, postgres
from cocoindex.ops.text import RecursiveSplitter
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder
from cocoindex.resources.chunk import Chunk
from cocoindex.resources.file import FileLike, PatternFilePathMatcher
from cocoindex.resources.id import IdGenerator

DATABASE_URL = os.getenv("POSTGRES_URL", "postgres://cocoindex:cocoindex@localhost/cocoindex")
TABLE_NAME = "doc_embeddings"
EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"

PG_DB = coco.ContextKey[asyncpg.Pool]("text_embedding_db")
EMBEDDER = coco.ContextKey[SentenceTransformerEmbedder]("embedder", detect_change=True)

_splitter = RecursiveSplitter()


@dataclass
class DocEmbedding:
    id: int
    filename: str
    chunk_start: int
    chunk_end: int
    text: str
    embedding: Annotated[NDArray, EMBEDDER]


@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    async with asyncpg.create_pool(DATABASE_URL) as pool:
        builder.provide(PG_DB, pool)
        builder.provide(EMBEDDER, SentenceTransformerEmbedder(EMBED_MODEL))
        yield
```

`embedding: Annotated[NDArray, EMBEDDER]` ties the vector column to the embedder, so its dimensions are inferred automatically — and if you swap the model later, CocoIndex notices (`detect_change=True`) and re-embeds.

## Process a file

![One processing component per file: each file is chunked and embedded, producing DocEmbedding rows written to Postgres](https://cocoindex.io/blobs/docs-v1/img/examples/text-embedding/stage-file-process.png)

`process_file` runs once per file. It reads the file, [splits the text](https://cocoindex.io/docs/ops/text/) into overlapping chunks, and maps each chunk to `process_chunk`.

```python
@coco.fn(memo=True)
async def process_file(
    file: FileLike,
    table: postgres.TableTarget[DocEmbedding],
) -> None:
    text = await file.read_text()
    chunks = _splitter.split(
        text, chunk_size=2000, chunk_overlap=500, language="markdown"
    )
    id_gen = IdGenerator()
    await coco.map(process_chunk, chunks, file.file_path.path, id_gen, table)
```

Chunking keeps each embedded unit small and focused, and the overlap means an idea that straddles a boundary still lands whole in at least one chunk.

[`@coco.fn`](https://cocoindex.io/docs/programming_guide/function/) with [`memo=True`](https://cocoindex.io/docs/advanced_topics/memoization_keys/) is what makes this incremental: if a file's content and this function's code are both unchanged, the whole file is skipped on the next run. `coco.map` fans out to one `process_chunk` call per chunk.

## Process a chunk

`process_chunk` embeds the chunk with the shared embedder and declares the target row.

```python
@coco.fn
async def process_chunk(
    chunk: Chunk,
    filename: pathlib.PurePath,
    id_gen: IdGenerator,
    table: postgres.TableTarget[DocEmbedding],
) -> None:
    table.declare_row(
        row=DocEmbedding(
            id=await id_gen.next_id(chunk.text),
            filename=str(filename),
            chunk_start=chunk.start.char_offset,
            chunk_end=chunk.end.char_offset,
            text=chunk.text,
            embedding=await coco.use_context(EMBEDDER).embed(chunk.text),
        ),
    )
```

We use [`SentenceTransformerEmbedder`](https://cocoindex.io/docs/ops/sentence_transformers/) with `all-MiniLM-L6-v2` — a small, fast model that runs locally with no API key. There are 12k+ sentence-transformer models on [Hugging Face](https://huggingface.co/models?other=sentence-transformers), so swap in whichever you prefer. `table.declare_row` declares the row as a target state; CocoIndex handles inserting, updating, or deleting it to match.

## Define the main function

![mount_each fans out one processing component per file, from the Markdown source to the Postgres target](https://cocoindex.io/blobs/docs-v1/img/examples/text-embedding/stage-main-function.png)

`app_main` wires the source to the target. It mounts the Postgres table (with a [vector index](https://cocoindex.io/docs/common_resources/vector_schema/)), walks the source directory, and mounts one [processing component](https://cocoindex.io/docs/programming_guide/processing_component/) per file.

```python
@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
    target_table = await postgres.mount_table_target(
        PG_DB,
        table_name=TABLE_NAME,
        table_schema=await postgres.TableSchema.from_class(
            DocEmbedding, primary_key=["id"],
        ),
    )
    target_table.declare_vector_index(column="embedding")

    files = localfs.walk_dir(
        sourcedir,
        recursive=True,
        path_matcher=PatternFilePathMatcher(included_patterns=["**/*.md"]),
        live=True,  # watch for changes; pass -L to `cocoindex update` to run live
    )
    await coco.mount_each(process_file, files.items(), target_table)
```

`mount_table_target` creates and manages the Postgres table for you: schema, the pgvector index, idempotent upserts, and orphan cleanup when a file disappears. `live=True` makes the [filesystem source](https://cocoindex.io/docs/connectors/localfs/) [watch for changes](https://cocoindex.io/docs/programming_guide/live_mode/), and `mount_each` runs one component per file so the engine can track and update them independently.

## Create the App

![A CocoIndex App binds the source, the transform, and the target state into one runnable unit](https://cocoindex.io/blobs/docs-v1/img/examples/text-embedding/stage-create-app.png)

Bind `app_main` into a `coco.App` and point it at the folder of Markdown files.

```python
app = coco.App(
    coco.AppConfig(name="TextEmbeddingV1"),
    app_main,
    sourcedir=pathlib.Path("./markdown_files"),
)
```

That is the entire indexing path.

## Run the pipeline

Run the [`cocoindex` CLI](https://cocoindex.io/docs/cli/) to build and update the index. Choose catch-up (scan, sync, exit) or live (catch up, then keep watching):

```sh
# Catch-up run
cocoindex update main

# Live run: keep watching for file changes
cocoindex update -L main
```

## Query the index

Match user text against the index with a plain SQL query, reusing the *same* embedder from the indexing flow so indexing and querying stay consistent.

```python
async def query_once(pool, embedder, query: str, *, top_k: int = 5) -> None:
    query_vec = await embedder.embed(query)
    async with pool.acquire() as conn:
        rows = await conn.fetch(
            f"""
            SELECT filename, text, embedding <=> $1 AS distance
            FROM "{TABLE_NAME}"
            ORDER BY distance ASC
            LIMIT $2
            """,
            query_vec, top_k,
        )
    for r in rows:
        score = 1.0 - float(r["distance"])
        print(f"[{score:.3f}] {r['filename']}")
        print(f"    {r['text']}")
        print("---")
```

The `<=>` operator is pgvector's cosine distance. We turn it into a similarity score and print the filename and the matching chunk. Run a search straight from the command line:

```bash
python main.py "what is self-attention?"
```

The most semantically similar chunks come back ranked — even when they share none of the words in your query. That's the whole point of a vector index.

## Incremental updates

CocoIndex keeps the index in sync with your files and does the **minimum work** to get there. You never compute a diff or write update logic: you change something, and CocoIndex works out exactly what to embed, upsert, and delete. Two pieces make this work. `@coco.fn(memo=True)` decides what to *recompute* — a file is skipped when its content and the function's code are both unchanged. `mount_table_target` decides what to *write* — each row's [`id`](https://cocoindex.io/docs/common_resources/id_generation/) is derived from its chunk's text, so it upserts only the rows that actually changed and deletes rows whose source is gone.

- **A file is added** — only that file is chunked and embedded, and its rows are inserted. The rest is untouched.
- **A file is edited** — it is re-chunked; chunks whose text is unchanged keep their `id` and embedding and are left as-is, genuinely new chunks are embedded and inserted, and chunks that no longer exist are deleted.
- **A file is deleted** — its rows are removed from the target automatically.

The same machinery covers **logic** changes too: tune the chunk size or swap the embedding model, and CocoIndex compares the new output against what is already in Postgres and applies only the difference. A catch-up run (`cocoindex update main`) does this once and exits; live mode (`cocoindex update -L main`) keeps watching and applies each change with low latency.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/text_embedding](https://github.com/cocoindex-io/cocoindex/tree/main/examples/text_embedding). Once this clicks, [Index Your Codebase](https://cocoindex.io/docs/examples/index-codebase/) is the natural next step — the same flow with syntax-aware chunking for code.

---

# Example: Index Your Codebase

Source: https://cocoindex.io/docs/examples/index-codebase/

![Index Your Codebase for AI Agents with CocoIndex V1](https://cocoindex.io/blobs/docs-v1/img/examples/index-codebase/cover.png)

In this tutorial, we'll build a live semantic index over a codebase with [CocoIndex](https://github.com/cocoindex-io/cocoindex). Point it at a repo, and you get a vector index you can search in natural language ("where do we embed chunks?") that updates itself as you edit — the kind of fresh, low-latency context an agent needs.

The whole pipeline is ordinary `async` Python and your own types. The heavy lifting — incremental processing, change tracking, managed targets — runs in a Rust engine underneath, so only what changed gets re-embedded and re-upserted.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/code_embedding)

## Use cases

![CocoIndex codebase indexing use cases](https://cocoindex.io/blobs/docs-v1/img/examples/index-codebase/usecase-agents.png)

- **Code context for agents** — semantic context for Claude, Codex, OpenCode, Factory instead of file-by-file reading.
- **Code search** — natural-language and semantic search over your repo.
- **Review & refactor agents** — context for code review, security analysis, and large-scale refactoring.

## Why CocoIndex for codebase indexing

A codebase is hard to keep indexed well, and it exercises most of what CocoIndex was built for:

- **Syntax-aware chunking** is built in. Tree-sitter integration means chunks follow real code structure (functions, classes, blocks) instead of arbitrary line windows, for every major language.
- **Incremental updates** suit code that changes constantly. CocoIndex re-embeds only the chunks that changed and re-upserts only the rows that moved — no full re-index on a one-line edit.
- **Live updates** keep the index current. With `live=True` on the filesystem source and `cocoindex update -L`, the index keeps watching and applies changes with low latency.
- **Plain Python** keeps it customizable. Pick your embedding model, chunking strategy, and vector database.
- **Consistent indexing and query.** The same embedder is shared between the indexing path and the query path, so what you index is what you search against.

![Why CocoIndex for codebase indexing: syntax-aware chunking, incremental updates, live updates, a Rust core, plain Python, and a consistent embedder across indexing and query.](https://cocoindex.io/blobs/docs-v1/img/examples/index-codebase/why-cocoindex.png)

## Flow overview

![CocoIndex flow for code embedding](https://cocoindex.io/blobs/docs-v1/img/examples/index-codebase/flow-v1.png)

From a high level, these are the steps:

1. Read code files from a local directory.
2. Split each file into syntax-aware chunks with Tree-sitter, then embed every chunk.
3. Store the chunks and their embeddings in Postgres (as [target states](https://cocoindex.io/docs/programming_guide/target_state/)).

You [declare the transformation logic](https://cocoindex.io/docs/programming_guide/core_concepts/) with native Python, without worrying about how updates propagate. Think: **target_state = transformation(source_state)**.

## Setup

- A running Postgres with the [pgvector](https://github.com/pgvector/pgvector) extension. CocoIndex supports [many targets](https://cocoindex.io/docs/connectors/postgres/), so you can pick another store.

  ```sh
  export POSTGRES_URL="postgres://cocoindex:cocoindex@localhost/cocoindex"
  ```

- Install CocoIndex and the dependencies this example uses:

  ```sh
  pip install -U "cocoindex[postgres,sentence_transformers]" asyncpg pgvector numpy python-dotenv
  ```

## Define the data and shared resources

[Apps](https://cocoindex.io/docs/programming_guide/app/) are the top-level runnable unit in CocoIndex. Before the App, we set up two things the rest of the code builds on. `CodeEmbedding` defines one row of the output table — each chunk of code becomes one row, with its text, location, and embedding vector. `coco_lifespan` provides the [shared resources](https://cocoindex.io/docs/programming_guide/context/) every step needs — the Postgres connection pool and the embedding model — once at startup.

```python
import os
import pathlib
from dataclasses import dataclass
from typing import AsyncIterator, Annotated

import asyncpg
from numpy.typing import NDArray

import cocoindex as coco
from cocoindex.connectors import localfs, postgres
from cocoindex.ops.text import RecursiveSplitter, detect_code_language
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder
from cocoindex.resources.chunk import Chunk
from cocoindex.resources.file import FileLike, PatternFilePathMatcher
from cocoindex.resources.id import IdGenerator

DATABASE_URL = os.getenv("POSTGRES_URL", "postgres://cocoindex:cocoindex@localhost/cocoindex")
TABLE_NAME = "code_embeddings"
EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"

PG_DB = coco.ContextKey[asyncpg.Pool]("code_embedding_db")
EMBEDDER = coco.ContextKey[SentenceTransformerEmbedder]("embedder", detect_change=True)

_splitter = RecursiveSplitter()


@dataclass
class CodeEmbedding:
    id: int
    filename: str
    code: str
    embedding: Annotated[NDArray, EMBEDDER]
    start_line: int
    end_line: int


@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    async with asyncpg.create_pool(DATABASE_URL) as pool:
        builder.provide(PG_DB, pool)
        builder.provide(EMBEDDER, SentenceTransformerEmbedder(EMBED_MODEL))
        yield
```

`embedding: Annotated[NDArray, EMBEDDER]` ties the vector column to the embedder, so its dimensions are inferred automatically.

## Process a file

![One processing component per file: each file is chunked with Tree-sitter and embedded, producing CodeEmbedding rows.](https://cocoindex.io/blobs/docs-v1/img/examples/index-codebase/stage-file-process.png)

`process_file` runs once per file. It reads the file, detects the language so Tree-sitter can parse it, [splits the code](https://cocoindex.io/docs/ops/text/) along the syntax tree, and maps each chunk to `process_chunk`.

```python
@coco.fn(memo=True)
async def process_file(
    file: FileLike,
    table: postgres.TableTarget[CodeEmbedding],
) -> None:
    text = await file.read_text()
    language = detect_code_language(filename=str(file.file_path.path.name))
    chunks = _splitter.split(
        text,
        chunk_size=1000,
        min_chunk_size=300,
        chunk_overlap=300,
        language=language,
    )
    id_gen = IdGenerator()
    await coco.map(process_chunk, chunks, file.file_path.path, id_gen, table)
```

CocoIndex uses Tree-sitter to chunk code along its actual syntax structure rather than arbitrary line breaks. Because each chunk is a coherent syntactic unit, retrieval returns whole functions or blocks instead of fragments cut mid-statement. All major languages are supported; unknown types fall back to plain text.

[`@coco.fn`](https://cocoindex.io/docs/programming_guide/function/) with [`memo=True`](https://cocoindex.io/docs/advanced_topics/memoization_keys/) is what makes this incremental: if a file's content and this function's code are both unchanged, the whole file is skipped on the next run. `coco.map` fans out to one `process_chunk` call per chunk.

Here is what chunking produces: each file is split into syntactic chunks, each with its own location and text.

![Each file split into chunks, with the location and text of every chunk](https://cocoindex.io/blobs/docs-v1/img/examples/index-codebase/flow-chunk.png)

## Process a chunk

`process_chunk` embeds the chunk with the shared embedder and declares the target row.

```python
@coco.fn
async def process_chunk(
    chunk: Chunk,
    filename: pathlib.PurePath,
    id_gen: IdGenerator,
    table: postgres.TableTarget[CodeEmbedding],
) -> None:
    embedding = await coco.use_context(EMBEDDER).embed(chunk.text)
    table.declare_row(
        row=CodeEmbedding(
            id=await id_gen.next_id(chunk.text),
            filename=str(filename),
            code=chunk.text,
            embedding=embedding,
            start_line=chunk.start.line,
            end_line=chunk.end.line,
        ),
    )
```

We use [`SentenceTransformerEmbedder`](https://cocoindex.io/docs/ops/sentence_transformers/) with `all-MiniLM-L6-v2`; there are 12k+ sentence-transformer models on [Hugging Face](https://huggingface.co/models?other=sentence-transformers), so swap in whichever you prefer. `chunk.start.line` and `chunk.end.line` carry through, so search results point straight at the lines that matched.

## Define the main function

![mount_each fans out one processing component per file, from the codebase source to the Postgres target.](https://cocoindex.io/blobs/docs-v1/img/examples/index-codebase/stage-main-function.png)

`app_main` wires the source to the target. It mounts the Postgres table (with a [vector index](https://cocoindex.io/docs/common_resources/vector_schema/)), walks the codebase, and mounts one [processing component](https://cocoindex.io/docs/programming_guide/processing_component/) per file.

```python
@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
    target_table = await postgres.mount_table_target(
        PG_DB,
        table_name=TABLE_NAME,
        table_schema=await postgres.TableSchema.from_class(
            CodeEmbedding, primary_key=["id"],
        ),
    )
    target_table.declare_vector_index(column="embedding")

    files = localfs.walk_dir(
        sourcedir,
        recursive=True,
        path_matcher=PatternFilePathMatcher(
            included_patterns=["**/*.py", "**/*.rs", "**/*.toml", "**/*.md", "**/*.mdx"],
            excluded_patterns=["**/.*", "**/target", "**/node_modules"],
        ),
        live=True,  # watch for changes; pass -L to `cocoindex update` to run live
    )
    await coco.mount_each(process_file, files.items(), target_table)
```

`mount_table_target` creates and manages the Postgres table for you: schema, the pgvector index, idempotent upserts, and orphan cleanup when a file disappears. `live=True` makes the [filesystem source](https://cocoindex.io/docs/connectors/localfs/) [watch for changes](https://cocoindex.io/docs/programming_guide/live_mode/), and `mount_each` runs one component per file so the engine can track and update them independently.

## Create the App

![A CocoIndex App binds source, transform, and target state into one runnable unit.](https://cocoindex.io/blobs/docs-v1/img/examples/index-codebase/stage-create-app.png)

Bind `app_main` into a `coco.App` and point it at the codebase root.

```python
app = coco.App(
    coco.AppConfig(name="CodeEmbeddingV1"),
    app_main,
    sourcedir=pathlib.Path(__file__).parent / ".." / "..",  # index from repo root
)
```

That is the entire indexing path.

## Run the pipeline

Run the [`cocoindex` CLI](https://cocoindex.io/docs/cli/) to set up and update the index. Choose catch-up (scan, sync, exit) or live (catch up, then keep watching):

```sh
# Catch-up run
cocoindex update main

# Live run: keep watching for file changes
cocoindex update -L main
```

## Query the index

Match user text against the index with a plain SQL query, reusing the *same* embedder from the indexing flow so indexing and querying stay consistent.

```python
async def query_once(pool, embedder, query: str, *, top_k: int = 5) -> None:
    query_vec = await embedder.embed(query)
    async with pool.acquire() as conn:
        rows = await conn.fetch(
            f"""
            SELECT filename, code, embedding <=> $1 AS distance, start_line, end_line
            FROM "{TABLE_NAME}"
            ORDER BY distance ASC
            LIMIT $2
            """,
            query_vec, top_k,
        )
    for r in rows:
        score = 1.0 - float(r["distance"])
        print(f"[{score:.3f}] {r['filename']} (L{r['start_line']}-L{r['end_line']})")
        print(f"    {r['code']}")
        print("---")
```

The `<=>` operator is pgvector's cosine distance. We turn it into a similarity score and print the filename, the matched line range, and the code snippet.

```bash
python main.py "embedding"
```

The search results print in the terminal:

![Search results in the terminal](https://cocoindex.io/blobs/docs-v1/img/examples/index-codebase/search-results.png)

## Incremental updates

CocoIndex keeps the index in sync with the codebase and does the **minimum work** to get there. You never compute a diff or write update logic: you change something, and CocoIndex works out exactly what to embed, upsert, and delete. Two pieces make this work. `@coco.fn(memo=True)` decides what to *recompute* — a file is skipped when its content and the function's code are both unchanged, and an embedding is reused when its chunk text is unchanged. `mount_table_target` decides what to *write* — each row's `id` is derived from its chunk's content, so it upserts only the rows that actually changed and deletes rows whose source is gone.

The same machinery covers two kinds of change: changes to your **data** (the code being indexed) and changes to your **logic** (the pipeline itself).

**Data changes.**

- **A file is added** — only that file is chunked and embedded, and its rows are inserted. The rest of the repo is untouched.
- **A file is deleted** — its rows are removed from the target.
- **A file is edited** — the file is re-chunked, and the new chunks usually overlap the old ones. Chunks whose text is unchanged keep their `id` and embedding, so they are left as-is; genuinely new chunks are embedded and inserted; chunks that no longer exist are deleted. Edit one function and only that function's chunks are re-embedded, even though the whole file was re-read.

![A file edited and re-chunked: unchanged chunks are reused with no re-embedding, a removed chunk's row is deleted, and a new chunk is embedded and inserted.](https://cocoindex.io/blobs/docs-v1/img/examples/index-codebase/incremental-diff.png)

**Logic changes.** A pipeline change is reconciled the same way — CocoIndex compares the new output against what is already in Postgres and applies only the difference.

- **Change the file patterns** (`included_patterns` / `excluded_patterns`) — files that now match are added automatically; files that no longer match have their rows deleted automatically.
- **Tune the chunking** (chunk size, overlap) — files are re-chunked, producing the same partial-overlap diff shown above: unchanged chunks are no-ops, new chunks are embedded and inserted, dropped chunks are deleted.
- **Swap the embedding model** — the vectors genuinely change, so all embeddings are recomputed, but row identity is stable: it is an in-place update of the `embedding` column, with no rows added or removed.

A catch-up run (`cocoindex update main`) does this once and exits; live mode (`cocoindex update -L main`) keeps watching and applies each change with low latency, so the index stays current as you code.

## Run it
The full, runnable example is in the CocoIndex repo: [examples/code_embedding](https://github.com/cocoindex-io/cocoindex/tree/main/examples/code_embedding).

## CocoIndex Code
If you'd rather not wire the pipeline yourself, [CocoIndex Code](https://github.com/cocoindex-io/cocoindex-code) is an end-to-end implementation of exactly this indexing, packaged as a CLI and an MCP server. It does the same thing this example does (AST-aware chunking, incremental re-index on file changes, local embeddings by default), hardened for production.

![CocoIndex Code: semantic code search for coding agents, as a CLI and MCP server](https://cocoindex.io/blobs/docs-v1/img/examples/index-codebase/cocoindex-code.png)

You can plug it straight into your coding agent or code-review agent:

- **Claude Code skill:** `npx skills add cocoindex-io/cocoindex-code`, then invoke `/ccc`.
- **MCP server:** `claude mcp add cocoindex-code -- ccc mcp` (Codex, OpenCode, Cursor, and any MCP client work the same way).
- **CLI:** `ccc index` to build the index, `ccc search "where we embed chunks"` to query it.

---

# Example: Multi-codebase Summarization

Source: https://cocoindex.io/docs/examples/multi-codebase-summarization/

![Multi-Codebase Summarization](https://cocoindex.io/blobs/docs-v1/img/examples/multi-codebase-summarization/cover.png)

Your code is the source of truth. In this tutorial, we'll build a pipeline that automatically generates a one-pager wiki for each project in a list, that never goes out-of-date with incremental processing. Think about building your own deep wiki that is always fresh.

For example, for each [cocoindex example project](https://github.com/cocoindex-io/cocoindex/tree/main/examples), we can have an auto-one-pager like this:

![markdown](https://cocoindex.io/blobs/docs-v1/img/examples/multi-codebase-summarization/markdown.png)

## Overview

This example uses structured LLM outputs to analyze code and generate documentation at scale with LLMs.

1. Scan top-level subdirectories, treating each as a separate project
2. Extract structured information from each file using an LLM (classes, functions, relationships)
3. Aggregate file-level data into project-level summaries
4. Generate Markdown documentation with Mermaid diagrams

You declare the transformation logic with native Python without worrying about changes.

Think:
**target_state = transformation(source_state)**

When your source data is updated, or your processing logic is changed (for example, switching to a different model, updating your LLM extraction logic), CocoIndex performs smart incremental processing that only reprocesses the minimum. And it keeps your wikis always up to date in production.

## Setup

1. Install CocoIndex and dependencies:

    ```bash
    pip install 'cocoindex>=1.0.2' instructor litellm pydantic
    ```

2. Create a new directory for your project:

    ```bash
    mkdir multi-codebase-summarization
    cd multi-codebase-summarization
    ```

3. Set up your LLM environment variables:

    ```bash
    export GEMINI_API_KEY="your-api-key"
    export LLM_MODEL="gemini/gemini-2.5-flash"  # Or any LiteLLM-supported model
    ```

4. Create a `.env` file to configure the database path:

    ```bash
    echo "COCOINDEX_DB=./cocoindex.db" > .env
    ```

5. Create a `projects/` directory with subdirectories for each Python project:

    ```bash
    mkdir projects
    ```
    ```bash
    projects/
    ├── my_project_1/
    │   ├── main.py
    │   └── utils.py
    ├── my_project_2/
    │   └── app.py
    └── ...
    ```

## Define the app

Define a CocoIndex App — the top-level runnable unit in CocoIndex.

![App Definition](https://cocoindex.io/blobs/docs-v1/img/examples/multi-codebase-summarization/app.svg)

```python title="main.py"
from __future__ import annotations

import os
import pathlib
from typing import Collection

import instructor
from litellm import acompletion
from pydantic import BaseModel, Field

import cocoindex as coco
from cocoindex.connectors import localfs
from cocoindex.resources.file import FileLike, PatternFilePathMatcher

from models import CodebaseInfo

LLM_MODEL = os.environ.get("LLM_MODEL", "gemini/gemini-2.5-flash")


app = coco.App(
    "MultiCodebaseSummarization",
    app_main,
    root_dir=pathlib.Path("./projects"),
    output_dir=pathlib.Path("./output"),
)
```

- The app scans `projects/` and outputs documentation to `output/`

[→ App](/docs/programming_guide/app)

## Define the main function

![App Definition](https://cocoindex.io/blobs/docs-v1/img/examples/multi-codebase-summarization/main.svg)

In the main function, we walk through each project in the subdirectories and process it.

It is up to you to declare the process granularity. It can be
- at a directory level per project. For example, [code_embedding](https://github.com/cocoindex-io/cocoindex/tree/main/examples/code_embedding) is a project, each containing multiple files,
- or at file level,
- or at even smaller units (e.g., page level, or semantic unit level).

In this example, we have a [projects folder](https://github.com/cocoindex-io/cocoindex/tree/main/examples) containing 20+ projects. It is natural to pick granularity at the directory level for each project, because we want to create a wiki page per project.

```python title="main.py"
@coco.fn
async def app_main(
    root_dir: pathlib.Path,
    output_dir: pathlib.Path,
) -> None:
    """Scan subdirectories and generate documentation for each project."""
    for entry in root_dir.resolve().iterdir():
        if not entry.is_dir() or entry.name.startswith("."):
            continue
        project_name = entry.name

        files = [
            f
            async for f in localfs.walk_dir(
                entry,
                recursive=True,
                path_matcher=PatternFilePathMatcher(
                    included_patterns=["**/*.py"],
                    excluded_patterns=["**/.*", "**/__pycache__"],
                ),
            )
        ]

        if files:
            await coco.mount(
                coco.component_subpath("project", project_name),
                process_project,
                project_name,
                files,
                output_dir,
            )
```

The main function does two things:

1. **Find all projects** — Loop through each subdirectory in `root_dir`, treating each as a separate project.

2. **Mount a processing component for each project** — For each project with Python files, `await coco.mount()` sets up a processing component. CocoIndex handles the execution and tracks dependencies automatically.

**Why processing components?** A processing component groups an item's processing together with its target states. Each component runs independently and in parallel. In this case, when `project_a` finishes, its results are applied to the external system immediately, without waiting for `project_b` or any other project.

To learn more about processing components, you can read the documentation:
[→ Processing Component](/docs/programming_guide/processing_component)

## Process each project
For each project, we will
1. use LLM to extract info
2. aggregate all the extraction into a project-level summary
3. output the extraction to a nice documentation with a Mermaid diagram.

![Process Project](https://cocoindex.io/blobs/docs-v1/img/examples/multi-codebase-summarization/project.svg)

```python title="main.py"
@coco.fn(memo=True)
async def process_project(
    project_name: str,
    files: Collection[localfs.File],
    output_dir: pathlib.Path,
) -> None:
    """Process a project: extract, aggregate, and output markdown."""
    # Extract info from each file concurrently.
    file_infos = await coco.map(extract_file_info, files)

    # Aggregate into project-level summary
    project_info = await aggregate_project_info(project_name, file_infos)

    # Generate and output markdown
    markdown = generate_markdown(project_name, project_info, file_infos)
    localfs.declare_file(
        output_dir / f"{project_name}.md", markdown, create_parent_dirs=True
    )
```
**Concurrent processing with async** — By using `coco.map()`, all file extractions run concurrently while keeping CocoIndex function calls visible to the pipeline. This is significantly faster than sequential processing, especially when making LLM API calls.

[→ Function](/docs/programming_guide/function)

## Extract file information with LLM

Now let's take a look at the details for each transformation.
For file extraction, we define a structure using Pydantic and use [Instructor](https://github.com/567-labs/instructor) to extract with LLMs.

![Extract files](https://cocoindex.io/blobs/docs-v1/img/examples/multi-codebase-summarization/extraction.svg)

### Define the data models

The key to structured LLM outputs is defining clear Pydantic models.
![Define Models](https://cocoindex.io/blobs/docs-v1/img/examples/multi-codebase-summarization/extraction-models.svg)

```python title="models.py"
class FunctionInfo(BaseModel):
    """Information about a public function."""
    name: str = Field(description="Function name")
    signature: str = Field(
        description="Function signature, e.g. 'async def foo(x: int) -> str'"
    )
    is_coco_function: bool = Field(
        description="Whether decorated with @coco.fn"
    )
    summary: str = Field(description="Brief summary of what the function does")


class ClassInfo(BaseModel):
    """Information about a public class."""
    name: str = Field(description="Class name")
    summary: str = Field(description="Brief summary of what the class represents")


class CodebaseInfo(BaseModel):
    """Extracted information from Python code."""
    name: str = Field(description="File path or project name")
    summary: str = Field(description="Brief summary of purpose and functionality")
    public_classes: list[ClassInfo] = Field(default_factory=list)
    public_functions: list[FunctionInfo] = Field(default_factory=list)
    mermaid_graphs: list[str] = Field(
        default_factory=list,
        description="Mermaid graphs showing function relationships"
    )
```

### Extract file info

The core extraction function uses memoization to cache LLM results:

````python title="main.py"
_instructor_client = instructor.from_litellm(acompletion, mode=instructor.Mode.JSON)

@coco.fn(memo=True)
async def extract_file_info(file: FileLike) -> CodebaseInfo:
    """Extract structured information from a single Python file using LLM."""
    content = await file.read_text()
    file_path = str(file.file_path.path)

    prompt = f"""Analyze the following Python file and extract structured information.

File path: {file_path}

```python
{content}
```

Instructions:
1. Identify all PUBLIC classes (not starting with _) and summarize their purpose
2. Identify all PUBLIC functions (not starting with _) and summarize their purpose
3. If this file contains CocoIndex apps (coco.App), create Mermaid graphs showing the
   function call relationships (see the mermaid_graphs field description for format)
4. Provide a brief summary of the file's purpose
"""

    result = await _instructor_client.chat.completions.create(
        model=LLM_MODEL,
        response_model=CodebaseInfo,
        messages=[{"role": "user", "content": prompt}],
    )
    return CodebaseInfo.model_validate(result.model_dump())
````

**Why `memo=True` matters:** LLM calls are expensive. With memoization, CocoIndex caches the result keyed by the file content. If you run the pipeline again without changing a file, the cached result is used—no LLM call needed.

[→ Function](/docs/programming_guide/function)

## Aggregate project information

For projects with multiple files, we aggregate into a unified summary:

![Aggregate files](https://cocoindex.io/blobs/docs-v1/img/examples/multi-codebase-summarization/aggregate.svg)

```python title="main.py"
@coco.fn
async def aggregate_project_info(
    project_name: str,
    file_infos: list[CodebaseInfo],
) -> CodebaseInfo:
    """Aggregate multiple file extractions into a project-level summary."""
    if not file_infos:
        return CodebaseInfo(
            name=project_name, summary="Empty project with no Python files."
        )

    # Single file - just update the name
    if len(file_infos) == 1:
        info = file_infos[0]
        return CodebaseInfo(
            name=project_name,
            summary=info.summary,
            public_classes=info.public_classes,
            public_functions=info.public_functions,
            mermaid_graphs=info.mermaid_graphs,
        )

    # Multiple files - use LLM to create unified summary
    files_text = "\n\n".join(
        f"### {info.name}\n"
        f"Summary: {info.summary}\n"
        f"Classes: {', '.join(c.name for c in info.public_classes) or 'None'}\n"
        f"Functions: {', '.join(f.name for f in info.public_functions) or 'None'}"
        for info in file_infos
    )

    # Collect all mermaid graphs from files
    all_graphs = [g for info in file_infos for g in info.mermaid_graphs]

    prompt = f"""Aggregate the following Python files into a project-level summary.

Project name: {project_name}

Files:
{files_text}

Create a unified CodebaseInfo that:
1. Summarizes the overall project purpose (not individual files)
2. Lists the most important public classes across all files
3. Lists the most important public functions across all files
4. For mermaid_graphs: create a single unified graph showing how the CocoIndex
   components connect across the project (if applicable)
"""

    result = await _instructor_client.chat.completions.create(
        model=LLM_MODEL,
        response_model=CodebaseInfo,
        messages=[{"role": "user", "content": prompt}],
    )
    result = CodebaseInfo.model_validate(result.model_dump())

    # Keep original file-level graphs if LLM didn't generate a unified one
    if not result.mermaid_graphs and all_graphs:
        result.mermaid_graphs = all_graphs

    return result
```

This function combines file-level extractions into a single project summary:

- **Single file project** — Just use that file's info directly (no extra LLM call needed)
- **Multi-file project** — Ask the LLM to synthesize all file summaries into one cohesive project overview

The result is a unified `CodebaseInfo` that represents the entire project, not individual files.

[→ Function](/docs/programming_guide/function)

## Generate markdown output

Create output markdown for each project.
![Create Markdown](https://cocoindex.io/blobs/docs-v1/img/examples/multi-codebase-summarization/markdown.svg)

```python title="main.py"
@coco.fn
def generate_markdown(
    project_name: str, info: CodebaseInfo, file_infos: list[CodebaseInfo]
) -> str:
    """Generate markdown documentation from project info."""
    lines = [
        f"# {project_name}",
        "",
        "## Overview",
        "",
        info.summary,
        "",
    ]

    if info.public_classes or info.public_functions:
        lines.extend(["## Components", ""])

        if info.public_classes:
            lines.append("**Classes:**")
            for cls in info.public_classes:
                lines.append(f"- `{cls.name}`: {cls.summary}")
            lines.append("")

        if info.public_functions:
            lines.append("**Functions:**")
            for fn in info.public_functions:
                marker = " ★" if fn.is_coco_function else ""
                lines.append(f"- `{fn.signature}`{marker}: {fn.summary}")
            lines.append("")

    if info.mermaid_graphs:
        lines.extend(["## CocoIndex Pipeline", ""])
        for graph in info.mermaid_graphs:
            graph_content = graph.strip()
            if not graph_content.startswith("```"):
                lines.append("```mermaid")
                lines.append(graph_content)
                lines.append("```")
            else:
                lines.append(graph_content)
            lines.append("")

    if len(file_infos) > 1:
        lines.extend(["## File Details", ""])
        for fi in file_infos:
            lines.extend([f"### {fi.name}", "", fi.summary, ""])

    lines.extend(["---", "", "*★ = CocoIndex function*"])
    return "\n".join(lines)
```

This function converts the structured `CodebaseInfo` into readable documentation:

- **Overview** — Project summary at the top
- **Components** — Lists classes and functions with descriptions (★ marks CocoIndex functions)
- **Pipeline diagram** — Mermaid graphs showing how functions connect
- **File details** — For multi-file projects, includes per-file summaries

## Run the pipeline

```bash
cocoindex update main.py
```

CocoIndex will:

1. Scan each subdirectory in `projects/`
2. Extract structured information from Python files using the LLM
3. Aggregate file summaries into project summaries
4. Generate Markdown files in `output/`

Check the output:

```bash
ls output/
# project1.md project2.md ...
```

## Incremental updates

The real power shows when you make changes:

**Modify a file:**

Edit a Python file in one of your projects, then run:

```bash
cocoindex update main.py
```

Only the modified file is re-analyzed by the LLM. Unchanged files use cached results.

**Add a new project:**

Add a new subdirectory with Python files:

```bash
mkdir projects/new_project
# add .py files
cocoindex update main.py
```

Only the new project is processed.

## Key patterns demonstrated

This example showcases several powerful patterns:

1. **Structured LLM outputs** with Instructor + Pydantic models
2. **Memoized LLM calls** to avoid redundant API costs
3. **Async concurrent processing** with `coco.map()`
4. **Hierarchical aggregation** (file → project)
5. **Incremental processing** for efficient updates

---

# Example: PDF → Markdown

Source: https://cocoindex.io/docs/examples/pdf-to-markdown/

![PDF to Markdown](https://cocoindex.io/blobs/docs-v1/img/examples/pdf-to-markdown/cover.png)

In this tutorial, we'll build a simple app that converts PDF files to Markdown and saves them to a local directory.

## Overview

![App example showing PDF to Markdown conversion](https://cocoindex.io/blobs/docs-v1/img/examples/pdf-to-markdown/app-example.svg)

1. Read PDF files from a local directory
2. Convert each file to Markdown using Docling
3. Save the Markdown files to an output directory (as **target states**)

You declare the transformation logic with native Python without worrying about changes.

Think:
**target_state = transformation(source_state)**

When your source data is updated, or your processing logic is changed (for example, switching parsers or tweaking conversion settings), CocoIndex performs smart incremental processing that only reprocesses the minimum. And it keeps your Markdown files always up to date in production.

## Setup

1. Install CocoIndex and dependencies:

    ```bash
    pip install 'cocoindex>=1.0.0' docling
    ```

2. Create a new directory for your project:

    ```bash
    mkdir pdf-to-markdown
    cd pdf-to-markdown
    ```

3. Create a `pdf_files/` directory and add your PDF files:

    ```bash
    mkdir pdf_files
    ```
    You can download sample PDF files from the [git repo](https://github.com/cocoindex-io/cocoindex/tree/main/examples/pdf_to_markdown).

4. Create a `.env` file to configure the database path:

    ```bash
    echo "COCOINDEX_DB=./cocoindex.db" > .env
    ```

## Define the app

Define a CocoIndex App — the top-level runnable unit in CocoIndex.

![App Definition](https://cocoindex.io/blobs/docs-v1/img/examples/pdf-to-markdown/app-def.svg)

```python title="main.py"
import pathlib

import cocoindex as coco
from cocoindex.connectors import localfs
from cocoindex.resources.file import PatternFilePathMatcher
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
```

[→ CocoIndex App](/docs/programming_guide/app)

### Define the main function

![App Definition](https://cocoindex.io/blobs/docs-v1/img/examples/pdf-to-markdown/components.svg)

In the main function, we walk through each file in the source directory and process it.

```python title="main.py"
@coco.fn
async def app_main(sourcedir: pathlib.Path, outdir: pathlib.Path) -> None:
    files = localfs.walk_dir(
        sourcedir,
        recursive=True,
        path_matcher=PatternFilePathMatcher(included_patterns=["**/*.pdf"]),
    )
    await coco.mount_each(process_file, files.items(), outdir)
```
For each file, `coco.mount_each()` mounts a processing component. It's up to
you to pick the process granularity, for example it can be at directory level,
file level, or page level.

In this example, because we want to independently convert each file to Markdown, it is the most natural to pick it at the file level.

[→ Processing Component](/docs/programming_guide/processing_component)

### Define file processing

![File Process](https://cocoindex.io/blobs/docs-v1/img/examples/pdf-to-markdown/file-process.svg)

For a file, we use Docling to convert it to Markdown. The converter follows
Docling's [explicit accelerator configuration](https://docling-project.github.io/docling/_generated/examples/run_with_accelerator/)
pattern and is pinned to CPU for portability across local machines. The
Docling accelerator docs were checked on 2026-05-31; Docling documents CPU as
the mode that works everywhere, while MPS/CUDA/XPU depend on compatible
hardware and PyTorch builds.

```python title="main.py"
_pipeline_options = PdfPipelineOptions(
    accelerator_options=AcceleratorOptions(device=AcceleratorDevice.CPU)
)
_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=_pipeline_options)
    }
)

@coco.fn(memo=True)
def process_file(
    file: localfs.File,
    outdir: pathlib.Path,
) -> None:
    markdown = _converter.convert(
        file.file_path.resolve()
    ).document.export_to_markdown()
    outname = file.file_path.path.stem + ".md"
    localfs.declare_file(outdir / outname, markdown, create_parent_dirs=True)
```

We use `@coco.fn` with `memo=True` to create a memoized function that processes each file.

[→ Function](/docs/programming_guide/function)

### Create the App

```python title="main.py"
app = coco.App(
    "PdfToMarkdown",
    app_main,
    sourcedir=pathlib.Path("./pdf_files"),
    outdir=pathlib.Path("./out"),
)
```

## Run the pipeline

Run the pipeline:

```bash
cocoindex update main.py
```

CocoIndex will:

1. Create the `out/` directory
2. Convert each PDF in `pdf_files/` to Markdown in `out/`

Check the output:

```bash
ls out/
# example.md (one .md file for each input PDF)
```

## Incremental updates

The power of CocoIndex is **incremental processing**. Try these:

**Add a new file:**

Add a new PDF to `pdf_files/`, then run:

```bash
cocoindex update main.py
```

Only the new file is processed.

**Modify a file:**

Replace a PDF in `pdf_files/` with an updated version, then run:

```bash
cocoindex update main.py
```

Only the changed file is reprocessed.

**Delete a file:**

```bash
rm pdf_files/example.pdf
cocoindex update main.py
```

The corresponding Markdown file is automatically removed.

---

# Example: Podcasts → Knowledge Graph

Source: https://cocoindex.io/docs/examples/podcast-to-knowledge-graph/

![Turn podcasts into a knowledge graph with LLM and CocoIndex](https://cocoindex.io/blobs/docs-v1/img/examples/podcast-to-knowledge-graph/cover.png)

Podcasts are one of the richest sources of expert knowledge on the internet. A single Lex Fridman or Dwarkesh Patel episode can contain dozens of substantive claims about people, technologies, and organizations — but it's all locked inside hours of audio. You can't query any of it, and you can't cross-reference what two different guests said about the same topic.

In this tutorial, we'll build a [CocoIndex](https://github.com/cocoindex-io/cocoindex) pipeline that turns YouTube podcast episodes into a queryable knowledge graph. It downloads audio, transcribes with speaker diarization, uses an LLM to extract structured statements and entities, resolves duplicates across episodes, and stores everything in [SurrealDB](https://github.com/surrealdb/surrealdb) as a graph.

![Podcast episodes flowing through CocoIndex with typed LLM extraction into a SurrealDB knowledge graph](https://cocoindex.io/blobs/docs-v1/img/examples/podcast-to-knowledge-graph/diagram.png)

The whole pipeline is ordinary `async` Python and your own types. The heavy lifting — [incremental processing](https://cocoindex.io/docs/programming_guide/core_concepts/), change tracking, managed graph targets — runs in a Rust engine underneath, so re-running only processes new or changed episodes.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/conversation_to_knowledge)

## What we're building

Here's the knowledge graph schema — five node types connected by four relationship types:

![Knowledge graph schema: session, statement, person, tech, and org nodes joined by session_statement, person_session, person_statement, and statement_mentions](https://cocoindex.io/blobs/docs-v1/img/examples/podcast-to-knowledge-graph/schema.png)

A **session** is one podcast episode. A **statement** is a thematic claim extracted from the conversation — e.g. "Scaling laws suggest that larger models will continue to improve." Each statement is linked to who said it and what it mentions. **Person**, **tech**, and **org** are named entities.

The tricky part: the same entity appears under different names across episodes ("GPT-4", "GPT4", "OpenAI's GPT-4"). We collapse these with entity resolution — more on that below.

## Pipeline overview

The pipeline runs in three phases.

![Three phases: per-session processing, entity resolution, knowledge base creation, landing in SurrealDB](https://cocoindex.io/blobs/docs-v1/img/examples/podcast-to-knowledge-graph/pipeline-overview.png)

1. **Per-session processing** — for each episode: download, transcribe, and extract metadata, speakers, and statements with an LLM. Sessions and statements are written immediately; they don't need cross-episode dedup.
2. **Entity resolution** — collect every raw entity name across episodes and deduplicate them with embedding similarity + LLM confirmation.
3. **Knowledge base creation** — write the canonical entities and all relationships.

![Detailed pipeline: podcast URLs fetched, transcribed, extracted into raw entities and statements, then resolved into canonical Person, Tech, and Org nodes in SurrealDB](https://cocoindex.io/blobs/docs-v1/img/examples/podcast-to-knowledge-graph/overview.png)

You [declare the transformation](https://cocoindex.io/docs/programming_guide/core_concepts/) with native Python; CocoIndex works out what to insert, update, and delete. Think: **target_state = transformation(source_state)**.

## Phase 1: per-session processing

![Phase 1 — per-session: iterate podcast URLs, fetch audio, transcribe, and extract entities and statements](https://cocoindex.io/blobs/docs-v1/img/examples/podcast-to-knowledge-graph/phase1.png)

Each session goes through a multi-step pipeline, starting from a YouTube URL.

### Fetch the transcript

We download the audio with `yt-dlp` and transcribe it with AssemblyAI, which returns speaker-diarized utterances ("Speaker A", "Speaker B", …) plus YouTube metadata.

```python title="conv_knowledge/fetch.py"
@coco.fn(memo=True)
async def fetch_transcript(youtube_id: str) -> SessionTranscript:
    url = f"https://www.youtube.com/watch?v={youtube_id}"
    with tempfile.TemporaryDirectory() as tmpdir:
        # Download audio via yt-dlp, convert to mp3 (FFmpeg)
        ydl_opts = {"format": "bestaudio/best", "outtmpl": f"{tmpdir}/audio.mp3", ...}
        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            info = ydl.extract_info(url, download=True)
        # Transcribe with AssemblyAI speaker diarization
        transcript = aai.Transcriber().transcribe(
            f"{tmpdir}/audio.mp3", aai.TranscriptionConfig(speaker_labels=True)
        )
    utterances = [Utterance(speaker=u.speaker, text=u.text) for u in transcript.utterances]
    return SessionTranscript(utterances=utterances, yt_title=info["title"], ...)
```

[`@coco.fn(memo=True)`](https://cocoindex.io/docs/advanced_topics/memoization_keys/) **memoizes** the function: if you've already fetched and transcribed a video, re-running skips it entirely — essential when you're iterating on downstream extraction and don't want to re-download hours of audio every time.

### Two-step LLM extraction

![Incremental relationship extraction: episodes processed into persons, sessions, statements, techs, and orgs](https://cocoindex.io/blobs/docs-v1/img/examples/podcast-to-knowledge-graph/relationship.png)

There's a bootstrapping problem: to attribute statements correctly, the LLM needs to know who the speakers are — but the raw transcript only has generic labels like "Speaker A". So extraction runs in two passes, both using a shared `format_transcript()` that swaps diarization labels for names.

**Step 1 — identify speakers and extract metadata.** Format the transcript with generic labels, give the LLM the YouTube metadata as context, and get back typed speaker identifications. The output is a Pydantic model, enforced by [instructor](https://github.com/instructor-ai/instructor) over [LiteLLM](https://cocoindex.io/docs/ops/litellm/):

```python title="conv_knowledge/extract.py"
@coco.fn(memo=True)
async def extract_metadata(reformatted_transcript: str, transcript: SessionTranscript) -> SessionMetadata:
    client = instructor.from_litellm(litellm.acompletion, mode=instructor.Mode.JSON)
    return await client.chat.completions.create(
        model=coco.use_context(LLM_MODEL),
        response_model=SessionMetadata,
        messages=[{"role": "system", "content": METADATA_PROMPT}, {"role": "user", "content": ...}],
    )
```

```python title="conv_knowledge/models.py"
class SpeakerIdentification(pydantic.BaseModel):
    label: str   # "A", "B"
    name: str    # "Lex Fridman" — unidentifiable speakers are excluded

class SessionMetadata(pydantic.BaseModel):
    name: str
    description: str | None
    date: str | None
    speakers: list[SpeakerIdentification]
```

**Step 2 — extract statements with real names.** Now that "Speaker A" is "Lex Fridman", reformat the transcript with real names and extract thematic statements, each with its speakers and mentioned entities:

```python title="conv_knowledge/models.py"
class RawStatement(pydantic.BaseModel):
    statement: str               # "Scaling laws suggest larger models will improve"
    speakers: list[str]          # ["Lex Fridman"]
    mentioned_person: list[str]  # ["Ilya Sutskever"]
    mentioned_tech: list[str]    # ["Large language model"]
    mentioned_org: list[str]     # ["OpenAI"]
```

Every name must be **self-contained** — the prompt forbids pronouns, speaker labels, or contextual references — because statements from different episodes get cross-referenced later, and a name like "he" or "the host" is meaningless outside its transcript.

### Declare the session and statements

After extraction we [declare](https://cocoindex.io/docs/programming_guide/target_state/) the session and its statements as records in SurrealDB. IDs come from CocoIndex's [`IdGenerator`](https://cocoindex.io/docs/common_resources/id_generation/), which is stable — the same inputs always yield the same ID, so re-running never duplicates. `next_id(content)` folds the content in, so an ID stays stable even if statement order changes.

```python title="conv_knowledge/app.py"
id_gen = IdGenerator(youtube_id)
session_id = await id_gen.next_id()
session_table.declare_record(row=Session(id=session_id, youtube_id=youtube_id, name=metadata.name, transcript=step2_text, ...))

for stmt in stmt_extraction.statements:
    stmt_id = await id_gen.next_id(stmt.statement)
    statement_table.declare_record(row=Statement(id=stmt_id, statement=stmt.statement))
    session_statement_rel.declare_relation(from_id=session_id, to_id=stmt_id)
```

Each session runs as an independent [processing component](https://cocoindex.io/docs/programming_guide/processing_component/) via [`coco.use_mount`](https://cocoindex.io/docs/programming_guide/app/), keyed by the YouTube ID — so adding an episode only processes that episode:

```python title="conv_knowledge/app.py"
raw = await coco.use_mount(
    coco.component_subpath("session", youtube_id),
    process_session, youtube_id,
    session_table, statement_table, session_statement_rel,
)
```

`process_session` returns the raw entity names and statement linkages that Phases 2 and 3 need. Sessions and statements are already in SurrealDB; the raw entities are carried forward for dedup.

## Phase 2: entity resolution

![Phase 2 — raw Tech, Person, and Org names deduplicated via embedding similarity and LLM confirmation into canonical entities](https://cocoindex.io/blobs/docs-v1/img/examples/podcast-to-knowledge-graph/phase2.png)

Now we have a pile of raw names from every episode, with the same entity under many spellings ("GPT-4" vs "GPT4", "Sam Altman" vs "Samuel Altman"). CocoIndex ships an [`entity_resolution`](https://cocoindex.io/docs/ops/entity_resolution/) utility that collapses them: it embeds each name, finds near-matches by vector similarity, and asks an LLM to confirm only the close pairs — cheap embeddings filter the field, expensive LLM calls happen only where it's ambiguous.

```python title="conv_knowledge/app.py"
@coco.fn(memo=True)
async def _resolve_entities(all_raw_entities: set[str]) -> dict[str, str | None]:
    result = await resolve_entities(
        entities=all_raw_entities,
        embedder=coco.use_context(EMBEDDER),               # Snowflake/snowflake-arctic-embed-xs
        resolve_pair=LlmPairResolver(model=coco.use_context(RESOLUTION_LLM_MODEL)),
    )
    return result.to_dict()  # {"Apple Inc.": None, "Apple": "Apple Inc.", "AAPL": "Apple Inc."}
```

Resolution runs independently per entity type, so CocoIndex processes person, tech, and org concurrently:

```python title="conv_knowledge/app.py"
entity_dedup = dict(zip(
    [cfg.name for cfg in ENTITY_TYPES],
    await asyncio.gather(*(
        coco.use_mount(coco.component_subpath("resolve", cfg.name),
                       _resolve_entities, _collect_all_raw(all_session_raw, cfg.name))
        for cfg in ENTITY_TYPES
    )),
))
```

A small, cheaper model handles these confirmations (configurable via `RESOLUTION_LLM_MODEL`).

## Phase 3: knowledge base creation

With the dedup maps ready, we write the final graph. Canonical entities become nodes; every relationship uses resolved names. `resolve_canonical(name, dedup)` chases the dedup chain to the root — `resolve_canonical("AAPL", dedup)` → `"Apple Inc."`.

```python title="conv_knowledge/app.py"
@coco.fn
async def create_knowledge_base(all_session_raw, entity_dedup, entity_tables, ...):
    # Canonical entity nodes (name is the id)
    for cfg in ENTITY_TYPES:
        for name, upstream in entity_dedup[cfg.name].items():
            if upstream is None:                      # this name is canonical
                entity_tables[cfg.name].declare_record(row=Entity(id=name, name=name))

    # Relationships, using canonical names
    for session_raw in all_session_raw:
        for stmt in session_raw.statements:
            for cfg in ENTITY_TYPES:
                dedup = entity_dedup[cfg.name]
                for canonical in {resolve_canonical(e, dedup) for e in getattr(stmt.raw, f"mentioned_{cfg.name}")}:
                    statement_mentions_rel.declare_relation(
                        from_id=stmt.id, to_id=canonical,
                        to_table=entity_tables[cfg.name])   # polymorphic target
```

The `statement_mentions` relationship is **polymorphic** — its target can be a person, tech, or org table — and `to_table` tells CocoIndex which table the target ID belongs to. The targets themselves are mounted once in `app_main`:

```python title="conv_knowledge/app.py"
statement_mentions_rel = await surrealdb.mount_relation_target(
    SURREAL_DB, "statement_mentions", statement_table,
    [entity_tables[cfg.name] for cfg in ENTITY_TYPES],   # polymorphic TO
)
```

## Incremental updates

This isn't a one-shot job — you'll add episodes over time and evolve the schema. CocoIndex's memoization and component model make both efficient.

**Adding episodes.** A new URL re-runs the pipeline, but only the new episode is processed: `fetch_transcript` and both extraction steps are memoized for existing episodes, entity resolution reuses cached embeddings and decisions and only makes fresh LLM calls for genuinely new names, and the declarative targets diff the rest. Removing an episode deletes its component — so its session, statements, and relationships are cleaned out of SurrealDB automatically.

**Evolving the schema.** Say you add a `Product` entity type:

| Pipeline step | What happens | Why |
|---|---|---|
| Fetch transcript | **Reused** | Memoized, input unchanged |
| Step 1: speaker identification | **Reused** | Prompt unchanged |
| Step 2: statement extraction | **Re-runs** | Extraction prompt changed |
| Entity resolution (person, tech, org) | **Reused** | Raw entities unchanged |
| Entity resolution (product) | **Runs fresh** | New type |
| Knowledge base creation | **Re-declared** | New nodes + relationships |

The expensive operations — download, transcription, speaker ID — are fully reused. Add one entity type across 50 episodes and you re-run only the statement-extraction calls plus resolution for the new type.

## Run the pipeline

You'll need Python 3.11+, [FFmpeg](https://ffmpeg.org/), Docker, an [AssemblyAI API key](https://www.assemblyai.com/) (transcription), and an OpenAI API key (extraction).

Start SurrealDB:

```sh
docker run -d --name surrealdb --user root -p 8787:8000 \
  -v surrealdb-data:/data surrealdb/surrealdb:latest \
  start --user root --pass root surrealkv:/data/database
```

Set keys and install:

```sh
export ASSEMBLYAI_API_KEY="..."
export OPENAI_API_KEY="sk-..."
pip install -e .
```

Add YouTube URLs to `input/sample.txt` (one per line, `#` for comments), then build the graph — incremental, so re-running skips episodes already processed:

```sh
cocoindex update conv_knowledge.app
```

## Explore the results

SurrealDB ships [Surrealist](https://surrealdb.com/surrealist), a built-in explorer. Connect to `ws://localhost:8787`, namespace `cocoindex`, database `yt_conversations`. The graph view shows persons (blue) linked to the statements (pink) they made:

![SurrealDB graph view: persons and statements joined by person_statement and statement_mentions edges](https://cocoindex.io/blobs/docs-v1/img/examples/podcast-to-knowledge-graph/surreal_statement_person.png)

You can also run analytical queries — for example, which technologies are mentioned by the most distinct people across every episode:

```surql
SELECT name,
  array::len(array::distinct(
    <-statement_mentions<-statement<-person_statement<-person.id
  )) AS person_count
FROM tech
ORDER BY person_count DESC
LIMIT 15;
```

![Surrealist results: technologies ranked by distinct people mentioning them — artificial intelligence, language model, machine learning, …](https://cocoindex.io/blobs/docs-v1/img/examples/podcast-to-knowledge-graph/surreal_top_mentioned_tech.png)

A few more:

```surql
-- All statements a person made
SELECT <-person_statement<-person.name AS speaker, statement FROM statement;

-- Everything involved in each statement
SELECT statement,
  ->statement_mentions->person.name AS persons,
  ->statement_mentions->tech.name AS techs,
  ->statement_mentions->org.name AS orgs
FROM statement;
```

## Run it

The full, runnable example is in the CocoIndex repo: [examples/conversation_to_knowledge](https://github.com/cocoindex-io/cocoindex/tree/main/examples/conversation_to_knowledge). Got a podcast, a meeting archive, or any other corpus you want to turn into a graph? Come tell us on [Discord](https://discord.com/invite/zpA9S2DR7s) — and if this was useful, [star CocoIndex on GitHub](https://github.com/cocoindex-io/cocoindex).

---

# Example: Docs → Knowledge Graph

Source: https://cocoindex.io/docs/examples/docs-to-knowledge-graph/

![Turn documentation into a knowledge graph with LLM extraction and CocoIndex](https://cocoindex.io/blobs/docs-v1/img/examples/docs-to-knowledge-graph/cover.png)

Documentation is a web of concepts pretending to be a list of files. "Incremental processing relies on change detection", "a target receives declared target states" — every page asserts relationships like these, but they're locked in prose. You can search docs for keywords; you can't ask how the concepts connect.

In this tutorial, we'll build a [CocoIndex](https://github.com/cocoindex-io/cocoindex) pipeline that turns a folder of Markdown docs into a concept knowledge graph in [Neo4j](https://neo4j.com/). For each document, an LLM extracts a summary plus a set of `(subject, predicate, object)` triples — *"engine detects source changes"*, *"triple becomes relationship in graph"* — and the triples become a property graph you can query in Cypher.

The whole pipeline is ordinary `async` Python and your own types. The heavy lifting — [incremental processing](https://cocoindex.io/docs/programming_guide/core_concepts/), change tracking, managed graph targets — runs in a Rust engine underneath, so editing one doc re-extracts only that doc, and the graph reconciles itself: no orphaned nodes, no stale edges, no cleanup scripts.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/docs_to_knowledge_graph)

## Use cases

- **GraphRAG over your docs** — retrieval that follows concept relationships instead of (or alongside) vector similarity.
- **Agent memory and context** — give an agent a queryable map of your product's concepts and how they relate.
- **Docs navigation and gap analysis** — which pages cover which concepts, which concepts are mentioned everywhere but defined nowhere.

## What we're building

The graph schema is small — two node types, two relationship types:

![Graph schema: Document nodes connected to Entity nodes by MENTION edges, and Entity nodes connected to each other by RELATIONSHIP edges carrying the predicate](https://cocoindex.io/blobs/docs-v1/img/examples/docs-to-knowledge-graph/schema.png)

- **`Document`** nodes — one per Markdown file, keyed by filename, with an LLM-generated `title` and `summary`.
- **`Entity`** nodes — one per distinct concept named in any triple, keyed by the concept name.
- **`RELATIONSHIP`** edges — `Entity → Entity`, with the `predicate` stored as an edge property.
- **`MENTION`** edges — `Document → Entity`, recording which document named which concept.

Here's the result in Neo4j Browser, built from a docs folder — documents (cyan) at the center of the concepts (pink) they mention:

![The resulting graph in Neo4j Browser: Document and Entity nodes joined by MENTION and RELATIONSHIP edges](https://cocoindex.io/blobs/docs-v1/img/examples/docs-to-knowledge-graph/neo4j-browser.png)

## Why CocoIndex for knowledge graphs

A knowledge graph over living docs is exactly the kind of pipeline that's easy to demo and hard to keep correct:

- **LLM extraction is expensive.** [Memoization](https://cocoindex.io/docs/advanced_topics/memoization_keys/) caches every extraction by content — edit one doc and only that doc hits the LLM again. A no-change re-run makes zero LLM calls.
- **Graphs accumulate garbage.** Delete a doc, rename a concept, tighten a prompt — and a hand-rolled pipeline leaves orphaned nodes and stale edges behind. In CocoIndex, nodes and edges are [target states](https://cocoindex.io/docs/programming_guide/target_state/): you declare what should exist, and the engine inserts, updates, and deletes the difference.
- **Cross-document steps need cross-document tracking.** Entities are shared between docs, so they can't be owned by any single file's processing. The two-phase shape below — per-file fan-out, then one graph pass — maps directly onto CocoIndex's [processing components](https://cocoindex.io/docs/programming_guide/processing_component/).
- **Plain Python.** Extraction is [instructor](https://github.com/instructor-ai/instructor) over [LiteLLM](https://docs.litellm.ai/) with your own Pydantic models — swap in any provider, prompt, or schema.

## Pipeline overview

![CocoIndex flow: Markdown docs walked from the filesystem, per-doc LLM extraction declaring Document nodes and carrying triples forward, then a single graph-building pass declaring Entity nodes and edges into Neo4j](https://cocoindex.io/blobs/docs-v1/img/examples/docs-to-knowledge-graph/flow-v1.png)

The pipeline runs in two phases:

1. **Per-file extraction** — for each Markdown file: extract a summary and the relationship triples with an LLM. The `Document` node is declared here; the triples are carried forward.
2. **Graph building** — one pass over all triples declares the deduplicated `Entity` nodes and the `RELATIONSHIP` / `MENTION` edges.

You [declare the transformation](https://cocoindex.io/docs/programming_guide/core_concepts/) with native Python; CocoIndex works out what to insert, update, and delete. Think: **target_state = transformation(source_state)**.

## Define the graph schema

Nodes and edges are plain dataclasses. Each becomes a Neo4j label (or relationship type), with one field as the primary key:

```python title="main.py"
@dataclass
class Document:
    filename: str  # primary key
    title: str
    summary: str


@dataclass
class Entity:
    value: str  # primary key — the concept name


@dataclass
class Relationship:
    """RELATIONSHIP edge payload. ``id`` is a stable hash of the triple so the
    same (subject, predicate, object) always maps to a single edge; the
    ``predicate`` is stored as an edge property."""

    id: int
    predicate: str
```

`MENTION` carries no payload, so it gets no schema at all — the Neo4j connector derives its identity from the (document, entity) endpoints: one edge per pair.

## Shared resources: the lifespan

The [lifespan](https://cocoindex.io/docs/programming_guide/context/) provides what every step needs — the Neo4j connection factory and the LLM model id — once at startup, via [context keys](https://cocoindex.io/docs/programming_guide/context/):

```python title="main.py"
KG_DB = coco.ContextKey[neo4j.ConnectionFactory]("kg_db")
LLM_MODEL = coco.ContextKey[str]("llm_model", detect_change=True)


@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    builder.provide(
        KG_DB,
        neo4j.ConnectionFactory(
            uri=os.environ.get("NEO4J_URI", "bolt://localhost:7687"),
            auth=(
                os.environ.get("NEO4J_USER", "neo4j"),
                os.environ.get("NEO4J_PASSWORD", "cocoindex"),
            ),
            database=os.environ.get("NEO4J_DATABASE", "neo4j"),
        ),
    )
    builder.provide(LLM_MODEL, os.environ.get("LLM_MODEL", "openai/gpt-5.4"))
    yield
```

Note `detect_change=True` on `LLM_MODEL`: the model id participates in change detection. Point `LLM_MODEL` at a different model and CocoIndex knows every memoized extraction is stale — the whole corpus re-extracts on the next run, with no cache to clear manually. The model is any [LiteLLM provider string](https://docs.litellm.ai/docs/providers); set `LLM_MODEL=ollama/llama3.2` to run extraction locally with no API key.

## LLM extraction

Extraction is typed end to end: Pydantic models describe what we want, instructor enforces them. The field descriptions double as instructions to the model:

```python title="main.py"
class ExtractedRelationship(pydantic.BaseModel):
    subject: str = pydantic.Field(
        description="The concept the statement is about, e.g. 'CocoIndex'."
    )
    predicate: str = pydantic.Field(
        description="How subject relates to object, e.g. 'supports'."
    )
    object: str = pydantic.Field(
        description="The related concept, e.g. 'Incremental Processing'."
    )


class RelationshipList(pydantic.BaseModel):
    relationships: list[ExtractedRelationship] = pydantic.Field(default_factory=list)
```

Two memoized functions call the LLM — one for the summary, one for the triples:

```python title="main.py"
@coco.fn(memo=True)
async def extract_relationships(content: str) -> list[Triple]:
    client = instructor.from_litellm(litellm.acompletion, mode=instructor.Mode.JSON)
    result = await client.chat.completions.create(
        model=coco.use_context(LLM_MODEL),
        response_model=RelationshipList,
        messages=[
            {"role": "system", "content": RELATIONSHIP_PROMPT},
            {"role": "user", "content": content},
        ],
    )
    validated = RelationshipList.model_validate(result.model_dump())
    return [Triple(r.subject, r.predicate, r.object) for r in validated.relationships]
```

[`@coco.fn(memo=True)`](https://cocoindex.io/docs/programming_guide/function/) is what makes iteration affordable: the result is cached keyed by the document content (and the function's own code). Unchanged docs never hit the LLM again. The prompt steers extraction toward *"concepts, not code"* — salient noun-phrase subjects and objects, short verb-phrase predicates, only relationships supported by the text.

## Phase 1: per-file extraction

![Phase 1 — one processing component per doc: each file goes through memoized LLM extraction, declares its Document node into Neo4j, and returns DocTriples for phase 2](https://cocoindex.io/blobs/docs-v1/img/examples/docs-to-knowledge-graph/stage-phase1.png)

`process_file` runs once per document: extract the summary, declare the `Document` node, extract the triples, and return them for phase 2.

```python title="main.py"
@coco.fn(memo=True)
async def process_file(
    file: localfs.File,
    document_table: neo4j.TableTarget[Document],
) -> DocTriples:
    content = await file.read_text()
    filename = file.file_path.path.as_posix()

    summary = await extract_summary(content)
    document_table.declare_record(
        row=Document(filename=filename, title=summary.title, summary=summary.summary)
    )

    triples = await extract_relationships(content)
    return DocTriples(filename=filename, triples=triples)
```

Each file runs as its own [processing component](https://cocoindex.io/docs/programming_guide/processing_component/), mounted in `app_main` and keyed by the file path:

```python title="main.py"
file_coros = []
async for path_key, file in files.items():
    file_coros.append(
        coco.use_mount(
            coco.component_subpath("file", path_key),
            process_file,
            file,
            document_table,
        )
    )
docs: list[DocTriples] = list(await asyncio.gather(*file_coros))
```

Why a component per file? Ownership. The component at `("file", path_key)` owns that document's `Document` node — if the file disappears, so does the component, and CocoIndex deletes its node (and the `MENTION` edges pointing from it) automatically. [`coco.use_mount`](https://cocoindex.io/docs/programming_guide/app/) returns each file's triples, and `asyncio.gather` runs all files concurrently.

## Phase 2: build the concept graph

![Phase 2 — a single build_graph component: all docs' triples are deduplicated into Entity nodes and RELATIONSHIP / MENTION edges, declared into Neo4j](https://cocoindex.io/blobs/docs-v1/img/examples/docs-to-knowledge-graph/stage-phase2.png)

A single component takes every file's triples and declares the cross-document parts of the graph: deduplicated `Entity` nodes and the two edge types.

```python title="main.py"
@coco.fn
async def build_graph(
    docs: list[DocTriples],
    entity_table: neo4j.TableTarget[Entity],
    relationship_rel: neo4j.RelationTarget[Relationship],
    mention_rel: neo4j.RelationTarget[Any],
) -> None:
    entities: set[str] = set()
    mentions: set[tuple[str, str]] = set()  # (filename, entity value)

    for doc in docs:
        for t in doc.triples:
            entities.add(t.subject)
            entities.add(t.object)
            mentions.add((doc.filename, t.subject))
            mentions.add((doc.filename, t.object))

            rel_id = await generate_id((t.subject, t.predicate, t.object))
            relationship_rel.declare_relation(
                from_id=t.subject,
                to_id=t.object,
                record=Relationship(id=rel_id, predicate=t.predicate),
            )

    for value in entities:
        entity_table.declare_record(row=Entity(value=value))

    for filename, entity in mentions:
        mention_rel.declare_relation(from_id=filename, to_id=entity)
```

Two details carry the correctness:

- **Stable edge identity.** [`generate_id`](https://cocoindex.io/docs/common_resources/id_generation/) hashes the triple, so the same `(subject, predicate, object)` always maps to the same edge — re-asserting a fact in another doc is a no-op, not a duplicate.
- **Entities live here, not in phase 1.** Concepts are shared across documents, so no single file's component can own them. The graph component owns the entity set as one target state; when the set of triples changes, CocoIndex diffs it — entities no longer named anywhere are deleted from Neo4j along with their edges.

This is plain Python doing set-dedup in memory — no framework abstractions. The declarative part is only at the boundary: `declare_record` / `declare_relation` say what should exist, and the engine reconciles.

## Wire it up: app_main

`app_main` mounts the targets and runs the two phases. Node tables come first, because relation targets are declared *between* two node tables — that's how the connector knows each edge's endpoint labels and keys:

```python title="main.py"
@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
    document_table = await neo4j.mount_table_target(
        KG_DB,
        "Document",
        await neo4j.TableSchema.from_class(Document, primary_key="filename"),
        primary_key="filename",
    )
    entity_table = await neo4j.mount_table_target(
        KG_DB,
        "Entity",
        await neo4j.TableSchema.from_class(Entity, primary_key="value"),
        primary_key="value",
    )

    relationship_rel = await neo4j.mount_relation_target(
        KG_DB,
        "RELATIONSHIP",
        entity_table,
        entity_table,
        await neo4j.TableSchema.from_class(Relationship, primary_key="id"),
        primary_key="id",
    )
    mention_rel = await neo4j.mount_relation_target(
        KG_DB, "MENTION", document_table, entity_table
    )

    files = localfs.walk_dir(
        sourcedir,
        recursive=True,
        path_matcher=PatternFilePathMatcher(included_patterns=["**/*.md", "**/*.mdx"]),
    )
    # ... phase 1 fan-out (above), then:
    await coco.mount(
        coco.component_subpath("build_graph"),
        build_graph,
        docs,
        entity_table,
        relationship_rel,
        mention_rel,
    )


app = coco.App(
    coco.AppConfig(name="DocsToKnowledgeGraph"),
    app_main,
    sourcedir=pathlib.Path("./markdown_files"),
)
```

That's the entire pipeline — one file, ~200 lines.

## Run the pipeline

You'll need a Neo4j instance and an LLM key. Start Neo4j with Docker:

```sh
docker run -d \
  -p 7474:7474 -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/cocoindex \
  --name cocoindex-neo4j \
  neo4j:5.26-community
```

Set up the environment and install:

```sh
cp .env.example .env   # fill in OPENAI_API_KEY (or set LLM_MODEL=ollama/llama3.2)
pip install -e .
```

The example ships a small `markdown_files/` folder of sample docs so it runs out of the box. Build the graph:

```sh
cocoindex update main
```

To graph your own docs, drop `.md` / `.mdx` files into `markdown_files/` — or point `sourcedir` at your real docs folder — and re-run.

## Explore the graph

Open [Neo4j Browser](http://localhost:7474) (`neo4j` / `cocoindex`) and ask the graph questions:

```cypher
// Everything
MATCH p=()-->() RETURN p LIMIT 200

// How concepts relate
MATCH (a:Entity)-[r:RELATIONSHIP]->(b:Entity)
RETURN a.value, r.predicate, b.value

// Concepts mentioned in the most documents
MATCH (d:Document)-[:MENTION]->(e:Entity)
RETURN e.value, count(DISTINCT d) AS docs
ORDER BY docs DESC LIMIT 10
```

## Incremental updates

This is where the declarative model pays for itself. You never compute a diff or write cleanup logic — change something, re-run `cocoindex update main`, and CocoIndex works out the minimum set of LLM calls and graph writes.

**Data changes.**

- **Edit one doc** — only that doc's component re-runs and re-extracts. If its triples changed, `build_graph` re-runs and diffs the graph: new entities and edges are inserted, ones no longer supported anywhere are deleted. Every other doc's extraction is served from the memo cache.
- **Add a doc** — one new component, one extraction, plus the graph diff.
- **Delete a doc** — its component disappears, so its `Document` node and `MENTION` edges are cleaned up automatically; concepts only that doc introduced vanish from the entity set on the next graph pass.
- **Nothing changed** — the run completes in a fraction of a second with zero LLM calls.

**Logic changes** are reconciled the same way:

- **Tighten the extraction prompt** — the function's code changed, so all docs re-extract; the graph then diffs against what's in Neo4j and applies only the difference.
- **Swap the LLM** — `LLM_MODEL` has `detect_change=True`, so changing the env var invalidates every memoized extraction. No cache to clear, no manual rebuild.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/docs_to_knowledge_graph](https://github.com/cocoindex-io/cocoindex/tree/main/examples/docs_to_knowledge_graph).

One natural next step: the LLM will sometimes name the same concept two ways ("CocoIndex" vs "Cocoindex"). The [meeting notes graph example](https://github.com/cocoindex-io/cocoindex/tree/main/examples/meeting_notes_graph_neo4j) adds an embedding + LLM [entity-resolution](https://cocoindex.io/docs/ops/entity_resolution/) pass that collapses near-duplicates — it drops into this pipeline between the two phases. For a bigger end-to-end graph build (transcription, multi-entity schemas, polymorphic edges), see [Turn Podcasts into a Knowledge Graph](https://cocoindex.io/docs/examples/podcast-to-knowledge-graph).

Got a docs folder, a wiki, or a pile of specs you want to turn into a graph? Come tell us on [Discord](https://discord.com/invite/zpA9S2DR7s) — and if this was useful, [star CocoIndex on GitHub](https://github.com/cocoindex-io/cocoindex).

---

# Example: Meeting Notes → Knowledge Graph

Source: https://cocoindex.io/docs/examples/meeting-notes-to-knowledge-graph/

![Turn meeting notes into a self-updating knowledge graph with LLM extraction and CocoIndex](https://cocoindex.io/blobs/docs-v1/img/examples/meeting-notes-to-knowledge-graph/cover.png)

Meeting notes are a graph pretending to be a folder of documents. Every note records who ran the meeting, who showed up, what got decided, and who owns each task — relationships between people, meetings, and tasks. But they're written as prose, scattered across a shared drive, so you can full-text search them and not much else. You can't ask *"what is Alice on the hook for across every meeting this quarter?"* or *"which meetings did the platform team actually attend?"*

In this tutorial, we'll build a [CocoIndex](https://github.com/cocoindex-io/cocoindex) pipeline that turns a Google Drive folder of Markdown meeting notes into a queryable knowledge graph in [Neo4j](https://neo4j.com/). An LLM extracts the organizer, participants, and tasks from each meeting; an embedding + LLM **entity-resolution** pass collapses the same person written five different ways into one node; and the result is a property graph you can query in Cypher.

The whole pipeline is ordinary `async` Python and your own types. The heavy lifting — [incremental processing](https://cocoindex.io/docs/programming_guide/core_concepts/), change tracking, managed graph targets — runs in a Rust engine underneath, so editing one note re-extracts only that note, and the graph reconciles itself: no orphaned people, no stale edges, no cleanup scripts.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/meeting_notes_graph_neo4j)

## What we're building

We're modeling meeting notes as a **property graph** — the data model behind [Neo4j](https://cocoindex.io/docs/connectors/neo4j/) and most graph databases. Two ideas carry it:

- **Nodes** are the things: a `Person`, a `Meeting`, a `Task`. Each node has a *label* (its type) and a bag of *properties* (`name`, `time`, `note`, …).
- **Relationships** are the connections between things — and they're first-class. A relationship is *typed* and *directed* (`Person -[:ATTENDED]-> Meeting`), and it can carry properties of its own. Here, the `ATTENDED` edge holds an `is_organizer` flag, so "ran the meeting" and "showed up" are the same edge with different data.

That second idea is the whole point: in a property graph, "who is on the hook for what" stops being prose buried in a document and becomes an edge you can traverse in a query. The schema here is small — three node types, three relationship types:

![Property graph schema: Person, Meeting, and Task nodes joined by ATTENDED (carrying is_organizer), DECIDED, and ASSIGNED_TO edges, with sample node and edge properties labeled](https://cocoindex.io/blobs/docs-v1/img/examples/meeting-notes-to-knowledge-graph/schema.png)

- **`Meeting`** nodes — one per meeting section, keyed by a stable integer id derived from `(note_file, date)`.
- **`Person`** nodes — canonical organizers, participants, and task assignees, deduplicated by an embedding + LLM entity-resolution pass (so "Alice", "Alice Chen", and "alice c." collapse to a single node).
- **`Task`** nodes — tasks decided in meetings, keyed by description.
- **`ATTENDED`** edges — `Person → Meeting`, carrying an `is_organizer` flag.
- **`DECIDED`** edges — `Meeting → Task`.
- **`ASSIGNED_TO`** edges — `Person → Task`.

The source is one or more Google Drive folders shared with a service account. The flow watches for changes and keeps the graph up to date incrementally.

## Why CocoIndex for meeting-note graphs

A knowledge graph over living notes is easy to demo and hard to keep correct. Three things make it tricky, and each maps onto a CocoIndex primitive:

- **LLM extraction is expensive.** [Memoization](https://cocoindex.io/docs/advanced_topics/memoization_keys/) caches every extraction by content — edit one note and only that note hits the LLM again. A no-change re-run makes zero LLM calls.
- **People are written inconsistently.** The same person shows up as "Alice", "Alice Chen", and "alice c." across notes. Names are shared *across* files, so no single note's processing can own a `Person` node — and a graph full of near-duplicate people is useless. CocoIndex ships an [`entity_resolution`](https://cocoindex.io/docs/ops/entity_resolution/) op that collapses them, and a [processing component](https://cocoindex.io/docs/programming_guide/processing_component/) model that owns cross-file nodes in one place.
- **Graphs accumulate garbage.** Delete a note, reassign a task, tighten a prompt — a hand-rolled pipeline leaves orphaned nodes and stale edges behind. In CocoIndex, nodes and edges are [target states](https://cocoindex.io/docs/programming_guide/target_state/): you declare what should exist, and the engine inserts, updates, and deletes the difference.

Extraction is [instructor](https://github.com/instructor-ai/instructor) over [LiteLLM](https://docs.litellm.ai/) with your own Pydantic models — swap in any provider, prompt, or schema.

## Pipeline overview

![CocoIndex flow: Google Drive meeting notes through three phases — per-note extraction declaring Meeting and Task nodes, person entity resolution, and a final person-relations pass — landing in a Neo4j property graph](https://cocoindex.io/blobs/docs-v1/img/examples/meeting-notes-to-knowledge-graph/flow-v1.png)

The pipeline runs in three phases:

1. **Per-file extraction** — read each note from Google Drive, split it by Markdown headings into meeting sections, and for each section LLM-extract a structured meeting (date, note, organizer, participants, tasks). `Meeting` and `Task` nodes plus `DECIDED` edges are declared here; raw person names are carried forward.
2. **Person entity resolution** — collect every raw person name across all notes and deduplicate them with embedding similarity + LLM confirmation, producing a canonical-name mapping.
3. **Person-touching relations** — declare the canonical `Person` nodes, then wire up the `ATTENDED` and `ASSIGNED_TO` edges using resolved names.

You [declare the transformation](https://cocoindex.io/docs/programming_guide/core_concepts/) with native Python; CocoIndex works out what to insert, update, and delete. Think: **target_state = transformation(source_state)**.

## Define the graph schema

Nodes and edges are plain dataclasses. Each becomes a Neo4j label (or relationship type), with one field as the primary key:

```python title="main.py"
@dataclass
class Meeting:
    id: int  # stable id generated from (note_file, date)
    note_file: str
    time: datetime.date
    note: str


@dataclass
class Person:
    name: str  # canonical


@dataclass
class Task:
    description: str


@dataclass
class AttendedRel:
    """ATTENDED edge payload — just the organizer flag. The relation's identity
    is auto-derived from its (person, meeting) endpoints, so one edge exists per
    (person, meeting) pair."""

    is_organizer: bool
```

`DECIDED` and `ASSIGNED_TO` carry no payload, so they get no schema at all — the connector derives each edge's identity from its endpoints: one edge per `(meeting, task)` or `(person, task)` pair.

These dataclasses are the bridge between plain Python and the property graph. Each record you declare into a node table becomes a **node**, with its fields as properties; each relation you declare becomes a typed, directed **edge** between two nodes, with its payload (like `is_organizer`) as edge properties:

![A meeting record becoming a Meeting node, and an attendance record becoming an ATTENDED edge from a Person node to a Meeting node carrying is_organizer](https://cocoindex.io/blobs/docs-v1/img/examples/meeting-notes-to-knowledge-graph/records-to-graph.png)

So a `Meeting(note_file=…, time=…, note=…)` becomes a `Meeting` node carrying those fields, and an `ATTENDED` relation from a person to that meeting becomes a `Person -[:ATTENDED {is_organizer: true}]-> Meeting` edge. You declare the records and relations in Python; the [Neo4j connector](https://cocoindex.io/docs/connectors/neo4j/) creates, updates, and removes the matching nodes and edges to match.

## Shared resources: the lifespan

The [lifespan](https://cocoindex.io/docs/programming_guide/context/) provides what every step needs — the graph connection factory, two LLM model ids, and the embedder — once at startup, via [context keys](https://cocoindex.io/docs/programming_guide/context/):

```python title="main.py"
KG_DB = coco.ContextKey[neo4j.ConnectionFactory]("kg_db")
LLM_MODEL = coco.ContextKey[str]("llm_model", detect_change=True)
RESOLUTION_LLM_MODEL = coco.ContextKey[str]("resolution_llm_model", detect_change=True)
EMBEDDER = coco.ContextKey[SentenceTransformerEmbedder]("embedder", detect_change=True)


@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    builder.provide(
        KG_DB,
        neo4j.ConnectionFactory(
            uri=os.environ.get("NEO4J_URI", "bolt://localhost:7687"),
            auth=(os.environ.get("NEO4J_USER", "neo4j"),
                  os.environ.get("NEO4J_PASSWORD", "cocoindex")),
            database=os.environ.get("NEO4J_DATABASE", "neo4j"),
        ),
    )
    builder.provide(LLM_MODEL, os.environ.get("LLM_MODEL", "openai/gpt-5.4"))
    builder.provide(RESOLUTION_LLM_MODEL, os.environ.get("RESOLUTION_LLM_MODEL", "openai/gpt-5-mini"))
    builder.provide(EMBEDDER, SentenceTransformerEmbedder("Snowflake/snowflake-arctic-embed-xs"))
    yield
```

Two models, on purpose: a stronger model (`LLM_MODEL`) does the structured extraction; a smaller, cheaper one (`RESOLUTION_LLM_MODEL`) confirms entity-resolution pairs. Both are [LiteLLM provider strings](https://docs.litellm.ai/docs/providers), so `LLM_MODEL=ollama/llama3.2` runs extraction locally with no API key.

Note `detect_change=True` on the model ids and the embedder: they participate in change detection. Point `LLM_MODEL` at a different model and CocoIndex knows every memoized extraction is stale — the corpus re-extracts on the next run, with no cache to clear by hand.

## Split notes into meetings

A single note file often holds several meetings, one per Markdown heading. We split on `#` / `##` headings preceded by a blank line, keeping each heading with its section:

```python title="main.py"
_HEADING_RE = re.compile(r"\n\n##?\s+")


def _split_meetings(text: str) -> list[str]:
    parts = _HEADING_RE.split("\n\n" + text)
    return [p.strip() for p in parts if p.strip()]
```

## LLM extraction

Extraction is typed end to end: Pydantic models describe what we want, instructor enforces them, and the field descriptions double as instructions to the model.

```python title="main.py"
class ExtractedPerson(pydantic.BaseModel):
    name: str = pydantic.Field(description="Full name of the person, as written in the note.")


class ExtractedTask(pydantic.BaseModel):
    description: str = pydantic.Field(description="Concise, standalone description of the task.")
    assigned_to: list[ExtractedPerson] = pydantic.Field(default_factory=list)


class ExtractedMeeting(pydantic.BaseModel):
    time: datetime.date = pydantic.Field(description="Date of the meeting (YYYY-MM-DD).")
    note: str = pydantic.Field(description="A brief summary of the meeting section.")
    organizer: ExtractedPerson = pydantic.Field(description="The person who organized or led the meeting.")
    participants: list[ExtractedPerson] = pydantic.Field(default_factory=list)
    tasks: list[ExtractedTask] = pydantic.Field(default_factory=list)
```

One memoized function turns a Markdown section into a typed `ExtractedMeeting`:

```python title="main.py"
@coco.fn(memo=True)
async def extract_meeting(section_text: str) -> ExtractedMeeting:
    client = instructor.from_litellm(litellm.acompletion, mode=instructor.Mode.JSON)
    result = await client.chat.completions.create(
        model=coco.use_context(LLM_MODEL),
        response_model=ExtractedMeeting,
        messages=[
            {"role": "system", "content": EXTRACT_PROMPT},
            {"role": "user", "content": section_text},
        ],
    )
    return ExtractedMeeting.model_validate(result.model_dump())
```

[`@coco.fn(memo=True)`](https://cocoindex.io/docs/programming_guide/function/) is what makes iteration affordable: the result is cached keyed by the section text (and the function's own code). Unchanged meeting sections never hit the LLM again.

## Phase 1: per-file extraction

![Phase 1 — one process_file component per note: split into meetings, memoized LLM extraction, declare Meeting and Task nodes into Neo4j, and carry MeetingExtraction forward to phases 2 and 3](https://cocoindex.io/blobs/docs-v1/img/examples/meeting-notes-to-knowledge-graph/stage-phase1.png)

`process_file` runs once per note. For each meeting section it extracts the structured meeting, declares the `Meeting` node, declares a `Task` node + `DECIDED` edge per task, and returns the raw (un-resolved) person names for phase 2:

```python title="main.py"
@coco.fn(memo=True)
async def process_file(
    file: google_drive.DriveFile,
    meeting_table: neo4j.TableTarget[Meeting],
    task_table: neo4j.TableTarget[Task],
    decided_rel: neo4j.RelationTarget[Any],
) -> list[MeetingExtraction]:
    text = await file.read_text()
    note_file = file.file_path.path.as_posix()
    id_generator = IdGenerator()
    extractions = []
    for section in _split_meetings(text):
        extracted = await extract_meeting(section)
        meeting_id = await id_generator.next_id(extracted.time)

        meeting_table.declare_record(
            row=Meeting(id=meeting_id, note_file=note_file,
                        time=extracted.time, note=extracted.note)
        )
        for task in extracted.tasks:
            task_table.declare_record(row=Task(description=task.description))
            decided_rel.declare_relation(from_id=meeting_id, to_id=task.description)

        extractions.append(MeetingExtraction(
            meeting_id=meeting_id,
            organizer=extracted.organizer.name,
            participants=[p.name for p in extracted.participants],
            task_assignees=[(t.description, [a.name for a in t.assigned_to])
                            for t in extracted.tasks],
        ))
    return extractions
```

The `Meeting` id comes from CocoIndex's [`IdGenerator`](https://cocoindex.io/docs/common_resources/id_generation/): `next_id(content)` folds the content in and is stable, so the same meeting always maps to the same node — re-running never duplicates.

Each note runs as its own [processing component](https://cocoindex.io/docs/programming_guide/processing_component/), mounted in `app_main` and keyed by the file path:

```python title="main.py"
file_coros = []
async for path_key, file in source.items():
    file_coros.append(
        coco.use_mount(
            coco.component_subpath("file", path_key),
            process_file, file, meeting_table, task_table, decided_rel,
        )
    )
per_file = list(await asyncio.gather(*file_coros))
all_meetings = [m for ms in per_file for m in ms]
```

Why a component per file? **Ownership.** The component at `("file", path_key)` owns that note's `Meeting` and `Task` nodes — if the file disappears, so does the component, and CocoIndex deletes its nodes (and the `DECIDED` edges) automatically. [`coco.use_mount`](https://cocoindex.io/docs/programming_guide/app/) returns each file's extractions, and `asyncio.gather` runs all files concurrently. `Person` nodes are deliberately *not* declared here — people are shared across notes, so they wait for phases 2 and 3.

## Phase 2: resolve people

![Phase 2 — a single resolve_persons pass: the set of raw person names from every note is deduplicated by embedding similarity plus LLM confirmation into a canonical-name map](https://cocoindex.io/blobs/docs-v1/img/examples/meeting-notes-to-knowledge-graph/stage-phase2.png)

This is the step that separates a useful graph from a messy one. We have a pile of raw names from every note — "Alice", "Alice Chen", "alice c." — and we want one `Person` node per actual person. CocoIndex's [`entity_resolution`](https://cocoindex.io/docs/ops/entity_resolution/) op embeds each name, finds near-matches by vector similarity, and asks an LLM to confirm *only* the close pairs — cheap embeddings filter the field, the expensive model runs only where it's genuinely ambiguous:

```python title="main.py"
@coco.fn(memo=True)
async def _resolve_persons(raw_persons: set[str]) -> ResolvedEntities:
    return await resolve_entities(
        entities=raw_persons,
        embedder=coco.use_context(EMBEDDER),                       # snowflake-arctic-embed-xs
        resolve_pair=LlmPairResolver(model=coco.use_context(RESOLUTION_LLM_MODEL)),
    )
```

It runs as its own component over the deduplicated set of every name seen in phase 1:

```python title="main.py"
raw_persons: set[str] = set()
for m in all_meetings:
    raw_persons.add(m.organizer)
    raw_persons.update(m.participants)
    for _desc, assignees in m.task_assignees:
        raw_persons.update(assignees)

persons = await coco.use_mount(
    coco.component_subpath("resolve_persons"), _resolve_persons, raw_persons,
)
```

Because it's a memoized component keyed by the name set, resolution only re-runs when the set of raw names actually changes — and even then it reuses cached embeddings and only makes fresh LLM calls for genuinely new pairs. The result, `persons`, maps any raw name to its canonical form via `persons.canonical_of(name)`.

## Phase 3: people, attendance, and assignments

![Phase 3 — a single create_person_relations pass: MeetingExtraction from phase 1 and the resolved names from phase 2 are combined to declare Person nodes plus ATTENDED and ASSIGNED_TO edges into Neo4j](https://cocoindex.io/blobs/docs-v1/img/examples/meeting-notes-to-knowledge-graph/stage-phase3.png)

With the canonical mapping in hand, one component declares the `Person` nodes and the two person-touching edge types — the cross-file part of the graph that no single note could own:

```python title="main.py"
@coco.fn
async def create_person_relations(
    meetings: list[MeetingExtraction],
    persons: ResolvedEntities,
    person_table: neo4j.TableTarget[Person],
    attended_rel: neo4j.RelationTarget[Any],
    assigned_rel: neo4j.RelationTarget[Any],
) -> None:
    for canonical_name in persons.canonicals():
        person_table.declare_record(row=Person(name=canonical_name))

    for m in meetings:
        # ATTENDED — organizer flag wins on collision, so a person listed as both
        # organizer and participant gets a single edge with is_organizer=true.
        attendees: dict[str, bool] = {persons.canonical_of(m.organizer): True}
        for p in m.participants:
            attendees.setdefault(persons.canonical_of(p), False)
        for canonical, is_organizer in attendees.items():
            attended_rel.declare_relation(
                from_id=canonical, to_id=m.meeting_id,
                record=AttendedRel(is_organizer=is_organizer),
            )

        # ASSIGNED_TO — dedup per (canonical person, task).
        for task_desc, assignees in m.task_assignees:
            for canonical in {persons.canonical_of(a) for a in assignees}:
                assigned_rel.declare_relation(from_id=canonical, to_id=task_desc)
```

Two details carry the correctness here. Resolution happens *before* aggregation, so two raw names that resolve to the same person collapse into one `ATTENDED` edge instead of two. And because the canonical names are the primary keys, re-asserting the same attendance or assignment from another note is a no-op, not a duplicate.

## Wire it up: app_main

`app_main` mounts the targets and runs the three phases. Node tables come first, because relation targets are declared *between* two node tables — that's how the connector knows each edge's endpoint labels and keys:

```python title="main.py"
@coco.fn
async def app_main() -> None:
    meeting_table = await neo4j.mount_table_target(
        KG_DB, "Meeting",
        await neo4j.TableSchema.from_class(Meeting, primary_key="id"), primary_key="id")
    person_table = await neo4j.mount_table_target(
        KG_DB, "Person",
        await neo4j.TableSchema.from_class(Person, primary_key="name"), primary_key="name")
    task_table = await neo4j.mount_table_target(
        KG_DB, "Task",
        await neo4j.TableSchema.from_class(Task, primary_key="description"), primary_key="description")

    # ATTENDED carries is_organizer; DECIDED and ASSIGNED_TO carry no payload, so
    # they mount without a schema and the connector derives PKs from endpoints.
    attended_rel = await neo4j.mount_relation_target(KG_DB, "ATTENDED", person_table, meeting_table)
    decided_rel = await neo4j.mount_relation_target(KG_DB, "DECIDED", meeting_table, task_table)
    assigned_rel = await neo4j.mount_relation_target(KG_DB, "ASSIGNED_TO", person_table, task_table)

    source = google_drive.GoogleDriveSource(
        service_account_credential_path=os.environ["GOOGLE_SERVICE_ACCOUNT_CREDENTIAL"],
        root_folder_ids=[f.strip() for f in os.environ["GOOGLE_DRIVE_ROOT_FOLDER_IDS"].split(",") if f.strip()],
    )

    # Phase 1: per-file fan-out (above) → all_meetings
    # Phase 2: persons = resolve_persons(all raw names)
    # Phase 3: declare Person nodes + person edges
    await coco.mount(
        coco.component_subpath("person_relations"),
        create_person_relations, all_meetings, persons, person_table, attended_rel, assigned_rel,
    )


app = coco.App(coco.AppConfig(name="MeetingNotesGraphNeo4j"), app_main)
```

That's the whole pipeline — one file, ~250 lines.

## Run the pipeline

You'll need a Neo4j instance, an LLM key, and a Google Drive service account. Start Neo4j with Docker:

```sh
docker run -d \
  -p 7474:7474 -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/cocoindex \
  --name cocoindex-neo4j \
  neo4j:5.26-community
```

The browser UI is at <http://localhost:7474> (log in with `neo4j` / `cocoindex`).

Set up the environment (copy `.env.example` to `.env` and fill in):

```sh
export OPENAI_API_KEY="your-openai-api-key"          # or set LLM_MODEL=ollama/llama3.2
export GOOGLE_SERVICE_ACCOUNT_CREDENTIAL=/absolute/path/to/service_account.json
export GOOGLE_DRIVE_ROOT_FOLDER_IDS=folderId1,folderId2
export NEO4J_URI=bolt://localhost:7687
export NEO4J_USER=neo4j
export NEO4J_PASSWORD=cocoindex
export LLM_MODEL=openai/gpt-5.4
export RESOLUTION_LLM_MODEL=openai/gpt-5-mini         # smaller model for entity resolution
```

The Google Drive source reads Markdown notes from one or more folders shared with the service account — see [Setting up a service account](https://cocoindex.io/docs/connectors/google_drive/#setting-up-a-service-account) for the folder IDs and sharing steps. Install and build the graph:

```sh
uv pip install -e .
cocoindex update main
```

## Explore the graph

Open [Neo4j Browser](http://localhost:7474) (`neo4j` / `cocoindex`) and ask the graph questions:

![The resulting graph in Neo4j Browser: Person, Meeting, and Task nodes joined by ATTENDED, DECIDED, and ASSIGNED_TO edges](https://cocoindex.io/blobs/docs-v1/img/examples/meeting-notes-to-knowledge-graph/neo4j-browser.png)

```cypher
// Everything
MATCH p=()-->() RETURN p LIMIT 100

// Who attended which meetings (including organizer; one edge per attendee)
MATCH (p:Person)-[:ATTENDED]->(m:Meeting)
RETURN p.name, m.note_file, m.time

// Tasks decided in meetings
MATCH (m:Meeting)-[:DECIDED]->(t:Task)
RETURN m.note_file, m.time, t.description

// Everything one person is on the hook for
MATCH (p:Person {name: "Alice Chen"})-[:ASSIGNED_TO]->(t:Task)
RETURN t.description

// Meetings someone organized
MATCH (p:Person)-[r:ATTENDED {is_organizer: true}]->(m:Meeting)
RETURN p.name, m.note_file, m.time
```

## Incremental updates

This is where the declarative model pays for itself. You never compute a diff or write cleanup logic — change something, re-run `cocoindex update main`, and CocoIndex works out the minimum set of LLM calls and graph writes.

**Data changes.**

- **Edit one note** — only that note's component re-runs and re-extracts its sections. Its `Meeting` / `Task` nodes are diffed; if it introduced or dropped a person, phase 2 reruns and phase 3 reconciles the edges. Every other note is served from the memo cache.
- **Add a note** — one new component, a handful of extractions, plus the resolution and graph diff.
- **Delete a note** — its component disappears, so its `Meeting` and `Task` nodes and `DECIDED` edges are cleaned up automatically; people only that note introduced fall out of the canonical set on the next resolution pass.
- **Nothing changed** — the run completes in a fraction of a second with zero LLM calls.

**Logic changes** are reconciled the same way:

- **Tighten the extraction prompt** — the function's code changed, so all sections re-extract; the graph then diffs against what's in the database and applies only the difference.
- **Swap the LLM** — `LLM_MODEL` has `detect_change=True`, so changing the env var invalidates every memoized extraction. No cache to clear, no manual rebuild.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/meeting_notes_graph_neo4j](https://github.com/cocoindex-io/cocoindex/tree/main/examples/meeting_notes_graph_neo4j).

This pipeline is the [docs knowledge graph](https://cocoindex.io/docs/examples/docs-to-knowledge-graph) plus an entity-resolution pass — the natural next step when the LLM names the same thing two ways. For a bigger end-to-end build (transcription, multi-entity schemas, polymorphic edges), see [Turn Podcasts into a Knowledge Graph](https://cocoindex.io/docs/examples/podcast-to-knowledge-graph).

Got a shared drive full of meeting notes, standup logs, or design docs you want to turn into a graph? Come tell us on [Discord](https://discord.com/invite/zpA9S2DR7s) — and if this was useful, [star CocoIndex on GitHub](https://github.com/cocoindex-io/cocoindex).

---

# Example: CSV → Kafka

Source: https://cocoindex.io/docs/examples/csv-to-kafka/

![CSV in, Kafka out — row by row, live, with CocoIndex](https://cocoindex.io/blobs/docs-v1/img/examples/csv-to-kafka/cover.png)

We'll take a folder of CSV files and turn it into a live [Kafka](https://kafka.apache.org/) stream — each row published as a JSON message, keyed by its primary key. Edit a cell and, within a second, exactly one message for that one row lands on the topic. Add a row, get one new message. Delete a file, and every row from it is tombstoned.

The whole pipeline is ordinary `async` Python, and the Kafka topic is just a [target](https://cocoindex.io/docs/connectors/kafka/) you declare — the same way you'd declare a Postgres table or a vector index. CocoIndex's Rust engine does the [incremental processing](https://cocoindex.io/docs/programming_guide/core_concepts/) underneath: it tracks what each row last looked like and produces a message only for rows that *actually changed* — no producer loop, no dedup bookkeeping, no "did I already send this?" logic.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/csv_to_kafka)

## Flow overview

![CocoIndex CSV → Kafka flow: watch a folder of CSV files, run one process_csv component per file that turns each row into a JSON message, and declare it as a Kafka topic target state](https://cocoindex.io/blobs/docs-v1/img/examples/csv-to-kafka/flow-v1.png)

From a high level, these are the steps:

1. Watch a local directory of CSV files (live).
2. For each file, parse rows with `csv.DictReader` and turn each row into a JSON value keyed by its first column.
3. Declare each row as a [target state](https://cocoindex.io/docs/programming_guide/target_state/) on a Kafka topic — CocoIndex produces the upserts and deletes.

You [declare the transformation logic](https://cocoindex.io/docs/programming_guide/core_concepts/) with native Python, without worrying about how updates propagate. Think: **target_state = transformation(source_state)**.

> **Why CSV?** It's the format that shows up everywhere and gets respect nowhere — BI exports, vendor dumps, spreadsheets parked in a shared drive. CSV files look structured but live like unstructured assets: dropped into a folder, edited at random, with no notifications and no schema contract. Turning a directory of them into a clean, row-keyed, diff-only Kafka stream is the same pattern that carries over to PDFs, codebases, and wikis — CSV just keeps the parser out of the way.

## Setup

- A running Kafka broker. Any broker the [`confluent_kafka`](https://github.com/confluentinc/confluent-kafka-python) client can reach works — a local `localhost:9092`, or a managed one like [StreamNative](https://streamnative.io/) with SASL. If you don't have one, a single-container [Redpanda](https://redpanda.com/) (Kafka-API compatible) is the quickest local broker:

  ```sh
  docker run -d --name redpanda -p 9092:9092 redpandadata/redpanda:latest \
    redpanda start --mode dev-container --smp 1 \
    --kafka-addr PLAINTEXT://0.0.0.0:9092 --advertise-kafka-addr PLAINTEXT://localhost:9092

  # CocoIndex never creates topics — create the one it produces into:
  docker exec redpanda rpk topic create cocoindex-csv-rows
  ```

- Install CocoIndex with the Kafka extra:

  ```sh
  pip install -U "cocoindex[kafka]"
  ```

- A `data/` folder with a couple of CSV files. The example ships these:

  ```csv
  # data/products.csv
  sku,name,category,price
  SKU001,Wireless Mouse,Electronics,29.99
  SKU002,Mechanical Keyboard,Electronics,89.99
  SKU003,USB-C Hub,Accessories,45.00
  ```

  The first column (`sku`) is the row's primary key — it becomes the Kafka message key.

## Shared resources: the Kafka producer

The Kafka producer is created once at app startup in a [`lifespan`](https://cocoindex.io/docs/programming_guide/context/) hook and stashed in a [`ContextKey`](https://cocoindex.io/docs/programming_guide/context/), so the rest of the pipeline can grab it without threading it through every call:

```python title="main.py"
import cocoindex as coco
from cocoindex.connectors import kafka, localfs
from confluent_kafka.aio import AIOProducer

KAFKA_PRODUCER = coco.ContextKey[AIOProducer]("kafka_producer")


@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    config: dict[str, str] = {"bootstrap.servers": KAFKA_BOOTSTRAP_SERVERS}
    if KAFKA_SASL_USERNAME:
        config.update({
            "sasl.mechanism": "PLAIN",
            "security.protocol": "SASL_SSL",
            "sasl.username": KAFKA_SASL_USERNAME,
            "sasl.password": KAFKA_SASL_PASSWORD,
        })
    producer = AIOProducer(config)
    builder.provide(KAFKA_PRODUCER, producer)
    yield
```

The SASL block is what a managed broker (StreamNative or similar) wants. For a local broker you can drop it and just point `bootstrap.servers` at `localhost:9092`. The `ContextKey` also does double duty later: CocoIndex's state store identifies the topic by *which key the producer was anchored to* plus the topic name — so rotating the SASL password or swapping the broker endpoint doesn't make it re-broadcast every row.

## Process a file

![One process_csv component per CSV file, fanned out with mount_each: each file's rows become (key, value) target states on the Kafka topic](https://cocoindex.io/blobs/docs-v1/img/examples/csv-to-kafka/stage-process-csv.png)

`process_csv` runs once per file. It reads the text, parses rows with `csv.DictReader` (the header row becomes the keys), and declares each row as a target state — key from the first column, value the JSON-encoded row:

```python title="main.py"
@coco.fn(memo=True)
async def process_csv(file: FileLike, topic_target: kafka.KafkaTopicTarget) -> None:
    text = await file.read_text()
    reader = csv.DictReader(io.StringIO(text))

    headers = reader.fieldnames
    if not headers:
        return
    first_col = headers[0]

    for row in reader:
        key_value = row.get(first_col)
        if key_value is not None:
            value = json.dumps(row)
            topic_target.declare_target_state(key=key_value, value=value)
```

[`@coco.fn(memo=True)`](https://cocoindex.io/docs/programming_guide/function/) makes the per-file work [incremental](https://cocoindex.io/docs/advanced_topics/memoization_keys/): if a file's contents and this function's code are both unchanged, `process_csv` doesn't even run. Each file runs as its own [processing component](https://cocoindex.io/docs/programming_guide/processing_component/) (mounted below), so the engine tracks each file's rows independently — and when a file disappears, its rows are cleaned off the topic automatically.

## Declare states, not messages

The one line worth pausing on is `declare_target_state` — deliberately *not* `send_message()` or `produce()`.

```python
topic_target.declare_target_state(key=key, value=value)
```

CocoIndex is [state-driven](https://cocoindex.io/docs/programming_guide/core_concepts/): like a spreadsheet cell or a SQL materialized view, you describe what the target *should be* as a function of the source, and the engine figures out the transitions. You don't compute deltas, and you don't write separate insert / update / delete code paths.

![Declared target states above the line and the Kafka messages they produce below: editing one CSV cell, adding a row, and removing a row yield exactly the upsert and delete messages needed — the unchanged row produces nothing](https://cocoindex.io/blobs/docs-v1/img/examples/csv-to-kafka/state-vs-messages.svg)

Kafka makes this vivid because its wire model is the opposite of state: a topic is a *log of events*, not a snapshot. CocoIndex owns the gap. When you call `declare_target_state(key=k, value=v)`:

- **`k` is new, or `v` changed** → it produces an **upsert** message `(k, v)`.
- **`k` was declared before but isn't this time** → it produces a **delete** message `(k, None)` (or a tombstone if you supplied a `deletion_value_fn`).
- **`k` was declared with the same `v`** → **nothing is sent.** No message, no broker round-trip, no consumer wakeup.

Messages are derived from state transitions; you only ever talk about states. It's the same shape as the [Postgres target](https://cocoindex.io/docs/connectors/postgres/) (`declare_target_state` → INSERT / UPDATE / DELETE) — the wire ops differ, the API doesn't, because the semantics are the same. The payoff: one `process_csv` is correct on the first run, every subsequent run, and after a crash-restart — there's no separate "initial load" versus "incremental update" path.

## Define the main function

`app_main` wires the source to the target. It mounts the Kafka topic, walks `./data` for CSV files as a live source, and mounts one component per file:

```python title="main.py"
@coco.fn
async def app_main() -> None:
    topic_target = await kafka.mount_kafka_topic_target(KAFKA_PRODUCER, KAFKA_TOPIC)

    files = localfs.walk_dir(
        localfs.FilePath(path="./data"),
        path_matcher=PatternFilePathMatcher(included_patterns=["**/*.csv"]),
        live=True,  # watch for changes; pass -L to `cocoindex update` to run live
    )
    await coco.mount_each(process_csv, files.items(), topic_target)


app = coco.App(coco.AppConfig(name="CsvToKafka"), app_main)
```

Two things to notice:

1. `mount_kafka_topic_target(...)` resolves the producer from the context key and hands back a target handle. The topic itself is **user-managed** — CocoIndex never creates or deletes topics, it just produces into one you already own.
2. `localfs.walk_dir(..., live=True)` makes the [filesystem source](https://cocoindex.io/docs/connectors/localfs/) a [live source](https://cocoindex.io/docs/programming_guide/live_mode/): it scans once, then keeps watching `./data` (via [`watchfiles`](https://github.com/samuelcolvin/watchfiles)) and pushes incremental updates downstream. [`mount_each`](https://cocoindex.io/docs/programming_guide/app/) runs one `process_csv` component per file.

That's the whole pipeline — one file, ~60 lines.

## Run the pipeline

Copy `.env.example` to `.env` and fill in your Kafka bootstrap (and SASL creds if your broker needs them), then run the [`cocoindex` CLI](https://cocoindex.io/docs/cli/). Choose catch-up (scan, sync, exit) or live (catch up, then keep watching):

```sh
# Catch-up run: reconcile the topic up to now, then exit
cocoindex update main.py

# Live run: catch up, then keep watching ./data and produce on every change
cocoindex update -L main.py
```

Live mode is **one keyword argument and one flag** different from catch-up — `live=True` on `walk_dir`, and `-L` on the CLI. `process_csv` and the Kafka target don't change: reconciliation logic is identical, the flag only controls whether the app scans once and exits or keeps watching. There's no separate "streaming" code path to maintain.

## Looking at the topic

Here's the `cocoindex-csv-rows` topic after a run, in StreamNative's hosted console (any Kafka consumer shows the same thing):

![Messages on the Kafka topic after running the CSV → Kafka pipeline: keys are the row primary keys (SKU001, SKU002, …) and values are the JSON-encoded rows](https://cocoindex.io/blobs/docs-v1/img/examples/csv-to-kafka/streamnative-topic.png)

Keys are the row's first column (`SKU001`, `SKU002`, …); values are the JSON-encoded rows. Edit a CSV locally and a new message with the *same key* appears — so log-compacted topics and key-based consumers always see the current state. Each key hashes to a partition via Kafka's default partitioner, exactly as it would with a hand-rolled producer.

## Incremental updates

This is where the declarative model pays for itself. You never compute a diff or write produce logic — change something, and CocoIndex works out the minimum set of messages to bring the topic in line. It keeps an [internal state store](https://cocoindex.io/docs/advanced_topics/internal_storage/) remembering the last value sent for every key, and that store survives restarts, so stopping and restarting never re-broadcasts unchanged rows.

- **Edit one cell** — exactly one upsert message, for that one row. Every other row is silent.
- **Add a row** — one new upsert message.
- **Delete a row** — one delete message for its key.
- **Add a CSV file** — `process_csv` runs once for it and publishes its rows.
- **Delete a CSV file** — every row from it gets a delete message.
- **Nothing changed** — a re-run produces zero messages.

A catch-up run (`cocoindex update main.py`) does this once and exits; live mode (`cocoindex update -L main.py`) keeps watching and applies each change with sub-second latency.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/csv_to_kafka](https://github.com/cocoindex-io/cocoindex/tree/main/examples/csv_to_kafka). The natural next step is the consumer side — [kafka_to_lancedb](https://github.com/cocoindex-io/cocoindex/tree/main/examples/kafka_to_lancedb) reads JSON messages off a topic and dispatches them into LanceDB tables, so the same declarative flow that *produces* changes can *consume* them too.

Got a folder of exports, a vendor dump, or any tabular data you want on a topic? Come tell us on [Discord](https://discord.com/invite/zpA9S2DR7s) — and if this was useful, [star CocoIndex on GitHub](https://github.com/cocoindex-io/cocoindex).

---

# Example: Semantic Search over PDFs

Source: https://cocoindex.io/docs/examples/pdf-embedding/

![Semantic Search over PDFs with CocoIndex V1](https://cocoindex.io/blobs/docs-v1/img/examples/pdf-embedding/cover.png)

We'll take a folder of PDFs — papers, RFCs, manuals, contracts — and turn it into a [vector index](https://github.com/pgvector/pgvector) you can search in plain English. The trick PDFs add over plain text: they have to be *parsed* first. We use [docling](https://github.com/docling-project/docling) to convert each PDF to clean Markdown, then chunk, embed, and store the vectors in Postgres.

The whole pipeline is ordinary `async` Python and your own types. The heavy lifting — [incremental processing](https://cocoindex.io/docs/programming_guide/core_concepts/), change tracking, managed targets — runs in a Rust engine underneath, so only changed PDFs get re-parsed and re-embedded. The one genuinely expensive step (PDF parsing) runs on a [GPU runner](https://cocoindex.io/docs/programming_guide/function/) so it doesn't block the event loop.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/pdf_embedding)

## Flow overview

![CocoIndex PDF embedding flow: walk a folder of PDFs, convert each to Markdown with docling on a GPU runner, split into chunks, embed each chunk, and store the vectors in Postgres with pgvector](https://cocoindex.io/blobs/docs-v1/img/examples/pdf-embedding/flow-v1.png)

From a high level, these are the steps:

1. Read PDF files from a local directory (live).
2. [Convert each PDF to Markdown](https://github.com/docling-project/docling) with docling, [split it into overlapping chunks](https://cocoindex.io/docs/ops/text/), then [embed](https://cocoindex.io/docs/ops/sentence_transformers/) every chunk.
3. Store the chunks and their embeddings in Postgres (as [target states](https://cocoindex.io/docs/programming_guide/target_state/)).

You [declare the transformation logic](https://cocoindex.io/docs/programming_guide/core_concepts/) with native Python, without worrying about how updates propagate. Think: **target_state = transformation(source_state)**.

## Setup

- A running Postgres with the [pgvector](https://github.com/pgvector/pgvector) extension. The repo ships a compose file:

  ```sh
  docker compose -f dev/postgres.yaml up -d
  export POSTGRES_URL="postgres://cocoindex:cocoindex@localhost/cocoindex"
  ```

- Install CocoIndex and the dependencies this example uses (docling pulls in the PDF parser):

  ```sh
  pip install -U "cocoindex[postgres,sentence_transformers]" asyncpg pgvector numpy docling python-dotenv
  ```

- A few PDFs to index. The example ships a `pdf_files/` folder with a couple of papers and an RFC — or drop your own in.

## Define the data and shared resources

`PdfEmbedding` defines one row of the output table — each chunk of text becomes one row, with its filename, character offsets, text, and embedding vector. `coco_lifespan` provides the [shared resources](https://cocoindex.io/docs/programming_guide/context/) every step needs — the Postgres connection pool and the embedding model — once at startup.

```python title="main.py"
EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
PG_DB = coco.ContextKey[asyncpg.Pool]("pdf_embedding_db")
EMBEDDER = coco.ContextKey[SentenceTransformerEmbedder]("embedder", detect_change=True)

_splitter = RecursiveSplitter()


@dataclass
class PdfEmbedding:
    id: int
    filename: str
    chunk_start: int
    chunk_end: int
    text: str
    embedding: Annotated[NDArray, EMBEDDER]


@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    async with asyncpg.create_pool(os.environ["POSTGRES_URL"]) as pool:
        builder.provide(PG_DB, pool)
        builder.provide(EMBEDDER, SentenceTransformerEmbedder(EMBED_MODEL))
        yield
```

`embedding: Annotated[NDArray, EMBEDDER]` ties the vector column to the embedder, so its dimensions are inferred automatically — and if you swap the model later, CocoIndex notices (`detect_change=True`) and re-embeds.

## Convert PDFs to Markdown

This is the one step text embedding doesn't have. [docling](https://github.com/docling-project/docling) reads the PDF and exports clean Markdown — preserving headings, tables, and reading order, which is exactly what makes the downstream chunks coherent.

```python title="main.py"
@functools.cache
def pdf_converter() -> DocumentConverter:
    pipeline_options = PdfPipelineOptions(
        accelerator_options=AcceleratorOptions(device=AcceleratorDevice.CPU)
    )
    return DocumentConverter(
        format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
    )


@coco.fn.as_async(runner=coco.GPU)
def pdf_to_markdown(content: bytes) -> str:
    source = DocumentStream(name="input.pdf", stream=io.BytesIO(content))
    return pdf_converter().convert(source).document.export_to_markdown()
```

Two things make this hold up at scale:

- **`@coco.fn.as_async(runner=coco.GPU)`** wraps a *synchronous*, CPU/GPU-heavy function so CocoIndex runs it on a dedicated [GPU runner](https://cocoindex.io/docs/programming_guide/function/) instead of blocking the async event loop. PDF parsing is the slow part of this pipeline; offloading it keeps the rest of the flow responsive.
- **`@functools.cache`** builds the docling `DocumentConverter` once and reuses it across every PDF — model load happens a single time, not per file.

## Process a file

![One processing component per PDF: convert to Markdown, chunk, embed each chunk, and declare PdfEmbedding rows into Postgres](https://cocoindex.io/blobs/docs-v1/img/examples/pdf-embedding/stage-file-process.png)

`process_file` runs once per PDF. It converts the PDF to Markdown, [splits the text](https://cocoindex.io/docs/ops/text/) into overlapping chunks, and maps each chunk to `process_chunk`.

```python title="main.py"
@coco.fn(memo=True)
async def process_file(
    file: FileLike,
    table: postgres.TableTarget[PdfEmbedding],
) -> None:
    markdown = await pdf_to_markdown(await file.read())
    chunks = _splitter.split(
        markdown, chunk_size=2000, chunk_overlap=500, language="markdown"
    )
    id_gen = IdGenerator()
    await coco.map(process_chunk, chunks, file.file_path.path, id_gen, table)
```

[`@coco.fn`](https://cocoindex.io/docs/programming_guide/function/) with [`memo=True`](https://cocoindex.io/docs/advanced_topics/memoization_keys/) is what makes this incremental: if a PDF's content and this function's code are both unchanged, the whole file is skipped on the next run — so you never re-run docling on a PDF you've already parsed. `coco.map` fans out to one `process_chunk` call per chunk.

## Process a chunk

`process_chunk` embeds the chunk with the shared embedder and declares the target row.

```python title="main.py"
@coco.fn
async def process_chunk(
    chunk: Chunk,
    filename: pathlib.PurePath,
    id_gen: IdGenerator,
    table: postgres.TableTarget[PdfEmbedding],
) -> None:
    table.declare_row(
        row=PdfEmbedding(
            id=await id_gen.next_id(chunk.text),
            filename=str(filename),
            chunk_start=chunk.start.char_offset,
            chunk_end=chunk.end.char_offset,
            text=chunk.text,
            embedding=await coco.use_context(EMBEDDER).embed(chunk.text),
        ),
    )
```

We use [`SentenceTransformerEmbedder`](https://cocoindex.io/docs/ops/sentence_transformers/) with `all-MiniLM-L6-v2` — a small, fast model that runs locally with no API key. `table.declare_row` declares the row as a target state; CocoIndex handles inserting, updating, or deleting it to match. Each row's `id` is derived from the chunk text, so a chunk that survives a re-parse keeps its row.

## Define the main function

![mount_each fans out one processing component per PDF, from the filesystem source to the Postgres target](https://cocoindex.io/blobs/docs-v1/img/examples/pdf-embedding/stage-main-function.png)

`app_main` wires the source to the target. It mounts the Postgres table, walks the source directory for PDFs, and mounts one [processing component](https://cocoindex.io/docs/programming_guide/processing_component/) per file.

```python title="main.py"
@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
    target_table = await postgres.mount_table_target(
        PG_DB,
        table_name=TABLE_NAME,
        table_schema=await postgres.TableSchema.from_class(
            PdfEmbedding, primary_key=["id"],
        ),
        pg_schema_name=PG_SCHEMA_NAME,   # "coco_examples"
    )

    files = localfs.walk_dir(
        sourcedir,
        recursive=True,
        path_matcher=PatternFilePathMatcher(included_patterns=["**/*.pdf"]),
        live=True,  # watch for changes; pass -L to `cocoindex update` to run live
    )
    await coco.mount_each(process_file, files.items(), target_table)


app = coco.App(
    coco.AppConfig(name="PdfEmbeddingV1"),
    app_main,
    sourcedir=pathlib.Path("./pdf_files"),
)
```

`mount_table_target` creates and manages the Postgres table for you — schema, idempotent upserts, and orphan cleanup when a PDF disappears. `live=True` makes the [filesystem source](https://cocoindex.io/docs/connectors/localfs/) [watch for changes](https://cocoindex.io/docs/programming_guide/live_mode/), and `mount_each` runs one component per file so the engine can track and update them independently.

> **No vector index here.** To keep the example minimal, this flow doesn't declare a vector index, so queries do a sequential scan — fine for a few PDFs. For a larger corpus, add one line — `target_table.declare_vector_index(column="embedding")` — exactly as the [Semantic Search 101](https://cocoindex.io/docs/examples/text-embedding/) example does, and pgvector serves approximate-nearest-neighbor queries instead.

## Run the pipeline

Run the [`cocoindex` CLI](https://cocoindex.io/docs/cli/) to build and update the index. Choose catch-up (scan, sync, exit) or live (catch up, then keep watching):

```sh
# Catch-up run
cocoindex update main

# Live run: keep watching for file changes
cocoindex update -L main
```

## Query the index

Match user text against the index with a plain SQL query, reusing the *same* embedder from the indexing flow so indexing and querying stay consistent.

```python title="main.py"
async def query_once(pool, embedder, query: str, *, top_k: int = 5) -> None:
    query_vec = await embedder.embed(query)
    async with pool.acquire() as conn:
        rows = await conn.fetch(
            f"""
            SELECT filename, text, embedding <=> $1 AS distance
            FROM "{PG_SCHEMA_NAME}"."{TABLE_NAME}"
            ORDER BY distance ASC
            LIMIT $2
            """,
            query_vec, top_k,
        )
    for r in rows:
        score = 1.0 - float(r["distance"])
        print(f"[{score:.3f}] {r['filename']}")
        print(f"    {r['text']}")
        print("---")
```

The `<=>` operator is pgvector's cosine distance. We turn it into a similarity score and print the filename and the matching chunk. Run a search straight from the command line:

```bash
python main.py "what is attention?"
```

With the sample papers indexed, the most semantically similar passages come back ranked — even when they share none of the words in your query. That's the whole point of a vector index.

## Incremental updates

CocoIndex keeps the index in sync with your PDFs and does the **minimum work** to get there. You never compute a diff or write update logic. Two pieces make this work. `@coco.fn(memo=True)` decides what to *recompute* — a PDF is skipped when its bytes and the function's code are both unchanged, so docling never re-parses an unchanged file. `mount_table_target` decides what to *write* — each row's [`id`](https://cocoindex.io/docs/common_resources/id_generation/) is derived from its chunk's text, so it upserts only the rows that actually changed and deletes rows whose source is gone.

- **A PDF is added** — only that file is parsed, chunked, and embedded; its rows are inserted. The rest is untouched.
- **A PDF is replaced** — it is re-parsed and re-chunked; chunks whose text is unchanged keep their `id` and embedding, genuinely new chunks are embedded and inserted, and chunks that no longer exist are deleted.
- **A PDF is deleted** — its rows are removed from the target automatically.

The same machinery covers **logic** changes too: tune the chunk size or swap the embedding model, and CocoIndex compares the new output against what's already in Postgres and applies only the difference. A catch-up run (`cocoindex update main`) does this once and exits; live mode (`cocoindex update -L main`) keeps watching and applies each change with low latency.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/pdf_embedding](https://github.com/cocoindex-io/cocoindex/tree/main/examples/pdf_embedding). If your inputs are already plain text or Markdown, [Semantic Search 101](https://cocoindex.io/docs/examples/text-embedding/) is the same flow without the docling step; if you want the Markdown itself as the output, see [PDF → Markdown](https://cocoindex.io/docs/examples/pdf-to-markdown/).

Got a folder of papers, reports, or scanned docs you want to search by meaning? Come tell us on [Discord](https://discord.com/invite/zpA9S2DR7s) — and if this was useful, [star CocoIndex on GitHub](https://github.com/cocoindex-io/cocoindex).

---

# Example: Search Images by Text

Source: https://cocoindex.io/docs/examples/image-search/

![Search images by text with CLIP and CocoIndex V1](https://cocoindex.io/blobs/docs-v1/img/examples/image-search/cover.png)

We'll take a folder of images and make it searchable in plain English — type *"long neck"* and get the giraffe back, with no tags, no captions, no manual labeling. The trick is [CLIP](https://openai.com/research/clip): it embeds images **and** text into the *same* vector space, so a text query and a matching picture land near each other. We store the image vectors in [Qdrant](https://qdrant.tech/) and serve search through a small FastAPI + React app.

The whole pipeline is ordinary `async` Python and your own types. The heavy lifting — [incremental processing](https://cocoindex.io/docs/programming_guide/core_concepts/), change tracking, the managed Qdrant collection — runs in a Rust engine underneath, and the flow runs in [live mode](https://cocoindex.io/docs/programming_guide/live_mode/) inside the API server, so dropping a new photo into the folder updates the index within a second.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/image_search)

## Flow overview

![CocoIndex image search indexing flow: walk a folder of images, embed each with the CLIP image encoder, and declare a point into a Qdrant collection](https://cocoindex.io/blobs/docs-v1/img/examples/image-search/flow-v1.png)

The indexing path is short — there's no text to chunk, just one embedding per image:

1. Read image files from a local directory (live).
2. Embed each image with the [CLIP](https://huggingface.co/openai/clip-vit-large-patch14) image encoder.
3. Store the vector in Qdrant (as a [point](https://cocoindex.io/docs/connectors/qdrant/), keyed by a stable id, with the filename in the payload).

You [declare the transformation logic](https://cocoindex.io/docs/programming_guide/core_concepts/) with native Python, without worrying about how updates propagate. Think: **target_state = transformation(source_state)**.

## One embedding space for images and text

This is the idea the whole example rests on. CLIP was trained to pull an image and its caption *together* in vector space, so the image of a giraffe and the words "long neck" end up close — even though one is pixels and the other is text. That means **indexing and querying use the same model, two different encoders**:

```python title="pipeline.py"
@functools.cache
def get_clip_model() -> tuple[CLIPModel, CLIPProcessor]:
    model = CLIPModel.from_pretrained(CLIP_MODEL_NAME)       # openai/clip-vit-large-patch14
    processor = CLIPProcessor.from_pretrained(CLIP_MODEL_NAME)
    return model, processor


def embed_image_bytes(img_bytes: bytes) -> list[float]:     # indexing side
    model, processor = get_clip_model()
    image = Image.open(io.BytesIO(img_bytes)).convert("RGB")
    inputs = processor(images=image, return_tensors="pt")
    with torch.no_grad():
        out = model.get_image_features(**inputs)
    return _projected_features(out)[0].tolist()


def embed_query(text: str) -> list[float]:                  # query side
    model, processor = get_clip_model()
    inputs = processor(text=[text], return_tensors="pt", padding=True)
    with torch.no_grad():
        out = model.get_text_features(**inputs)
    return _projected_features(out)[0].tolist()
```

Both produce a 768-d vector in the same space, so a cosine search with a text vector finds the nearest *image* vectors. `@functools.cache` loads the (large) CLIP model once and reuses it for every image and every query.

## Setup

- A running [Qdrant](https://qdrant.tech/):

  ```sh
  docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant
  export QDRANT_URL="http://localhost:6334/"
  ```

- Install CocoIndex and the dependencies this example uses:

  ```sh
  pip install -U "cocoindex[qdrant]" torch transformers pillow fastapi "uvicorn[standard]" python-dotenv
  ```

- A few images. The example ships an `img/` folder (a cat, a dog, an elephant, a giraffe) — or drop your own `.jpg` / `.png` files in.

## Shared resources: the Qdrant client

The [lifespan](https://cocoindex.io/docs/programming_guide/context/) provides the Qdrant client once at startup, via a [context key](https://cocoindex.io/docs/programming_guide/context/):

```python title="pipeline.py"
QDRANT_DB = coco.ContextKey[QdrantClient]("image_search_qdrant")


@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    client = qdrant.create_client(qdrant_url(), prefer_grpc=True)
    builder.provide(QDRANT_DB, client)
    yield
```

## Process an image

![One process_file component per image, fanned out with mount_each: each image is CLIP-embedded and declared as a Qdrant point](https://cocoindex.io/blobs/docs-v1/img/examples/image-search/stage-file-process.png)

`process_file` runs once per image: read the bytes, embed with CLIP, and declare a Qdrant point keyed by a stable id derived from the path, with the filename in the payload.

```python title="pipeline.py"
@coco.fn(memo=True)
async def process_file(file: FileLike, target: qdrant.CollectionTarget) -> None:
    content = await file.read()
    embedding = embed_image_bytes(content)
    point = qdrant.PointStruct(
        id=_image_id(file.file_path.path),                  # uuid5 of the path — stable
        vector=embedding,
        payload={"filename": str(file.file_path.path)},
    )
    target.declare_point(point)
```

[`@coco.fn(memo=True)`](https://cocoindex.io/docs/programming_guide/function/) makes it [incremental](https://cocoindex.io/docs/advanced_topics/memoization_keys/): an unchanged image is never re-embedded. Each image runs as its own [processing component](https://cocoindex.io/docs/programming_guide/processing_component/), so the engine tracks them independently — delete an image and its point is removed from Qdrant automatically. `declare_point` declares the point as a [target state](https://cocoindex.io/docs/programming_guide/target_state/); CocoIndex upserts or deletes to match.

## Define the main function

`app_main` mounts the Qdrant collection — sizing the vector to CLIP's projection dimension and using cosine distance — then walks the image folder and mounts one component per file:

```python title="pipeline.py"
@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
    model, _ = get_clip_model()
    dim = model.config.projection_dim   # 768 for ViT-L/14

    target_collection = await qdrant.mount_collection_target(
        QDRANT_DB,
        collection_name=QDRANT_COLLECTION,   # "ImageSearch"
        schema=await qdrant.CollectionSchema.create(
            vectors=qdrant.QdrantVectorDef(
                schema=VectorSchema(dtype=np.dtype(np.float32), size=dim),
                distance="cosine",
            )
        ),
    )

    files = localfs.walk_dir(
        sourcedir,
        recursive=True,
        path_matcher=PatternFilePathMatcher(
            included_patterns=["**/*.jpg", "**/*.jpeg", "**/*.png"]
        ),
        live=True,   # api.py runs the app with live=True
    )
    await coco.mount_each(process_file, files.items(), target_collection)


app = coco.App(
    coco.AppConfig(name="ImageSearchQdrantV1"),
    app_main,
    sourcedir=pathlib.Path("./img"),
)
```

`mount_collection_target` creates and manages the Qdrant collection for you — schema, idempotent upserts, and cleanup when an image disappears. The vector size comes straight from the model, so swapping CLIP variants just works.

## Run it as a service

Unlike the batch examples, image search runs as a server. `api.py` is a FastAPI app whose [lifespan](https://fastapi.tiangolo.com/advanced/events/) starts the CocoIndex flow in **live mode** in the background — it blocks startup until the initial sweep finishes (so the collection is queryable), then keeps watching `img/` while it serves requests. There's no separate "build the index" step.

```python title="api.py"
@asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncIterator[None]:
    async with coco.runtime():
        _client = qdrant.create_client(pipeline.qdrant_url(), prefer_grpc=True)

        # Start a live update; block until the initial sweep is READY, then run on.
        update_handle = pipeline.app.update(live=True)
        async for snap in update_handle.watch():
            if snap.status is coco.UpdateStatus.READY:
                break
        update_task = asyncio.create_task(update_handle.result())
        try:
            yield
        finally:
            update_task.cancel()


@app.get("/search")
async def search(q: str, limit: int = 5) -> dict:
    query_embedding = pipeline.embed_query(q)               # text → CLIP vector
    results = pipeline._qdrant_search(_client, pipeline.QDRANT_COLLECTION, query_embedding, limit)
    return {"results": [{"filename": (r.payload or {}).get("filename"), "score": r.score} for r in results]}
```

Start the server, then the frontend:

```sh
python -m uvicorn api:app --reload --host 0.0.0.0 --port 8000

cd frontend && npm install && npm run dev   # http://localhost:5173
```

## Search it

The React app posts your query to `/search`, which embeds the text with CLIP and runs a cosine search in Qdrant. Here it is answering *"long neck"* — the giraffe ranks first, then the other animals by visual similarity, none of which was ever tagged with a word:

![The image search app: a query for "long neck" returns the giraffe first (score 0.231), then elephant, cat, and dog, ranked by CLIP similarity — alongside the indexed images and their 768-element embeddings](https://cocoindex.io/blobs/docs-v1/img/examples/image-search/search-results.png)

That's the whole point of a shared image-text space: the match is by *meaning*, not metadata.

## Incremental updates

Because the flow runs live inside the server, the index tracks the folder with no extra work from you:

- **Add an image** — `process_file` runs once for it, embeds it, and upserts one Qdrant point. It's searchable within a second.
- **Replace an image** — same id (derived from the path), new vector; the point is updated in place.
- **Delete an image** — its component disappears and the point is removed from Qdrant.
- **Restart the server** — the initial sweep reconciles against what's already in Qdrant and re-embeds nothing that's unchanged.

Swap the CLIP model and CocoIndex re-embeds everything against the new space; leave it alone and a restart is nearly free.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/image_search](https://github.com/cocoindex-io/cocoindex/tree/main/examples/image_search). For higher-fidelity retrieval, [image_search_colpali](https://github.com/cocoindex-io/cocoindex/tree/main/examples/image_search_colpali) swaps CLIP for the multi-vector ColPali model with Qdrant MaxSim; for the text equivalent, see [Semantic Search 101](https://cocoindex.io/docs/examples/text-embedding/).

Got a photo library, a product catalog, or a screenshot pile you want to search by meaning? Come tell us on [Discord](https://discord.com/invite/zpA9S2DR7s) — and if this was useful, [star CocoIndex on GitHub](https://github.com/cocoindex-io/cocoindex).

---

# Example: Audio to Text

Source: https://cocoindex.io/docs/examples/audio-to-text/

![Audio to Text with CocoIndex V1](https://cocoindex.io/blobs/docs-v1/img/examples/audio-to-text/cover.png)

We'll take a folder of audio files — voice memos, meeting recordings, podcast clips — and turn each one into a searchable transcript. CocoIndex walks the directory, sends every file to a [LiteLLM](https://docs.litellm.ai/) transcription model, and writes the text to Postgres as one row per file, keyed by filename. The result is a plain table you can query, join, or feed into a downstream embedding pipeline.

The whole thing is ordinary `async` Python and your own types. The heavy lifting — [incremental processing](https://cocoindex.io/docs/programming_guide/core_concepts/), change tracking, managed targets — runs in a Rust engine underneath, so only files that are new or changed get re-transcribed, and removed files have their rows cleaned up automatically.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/audio_to_text)

## Flow overview

![CocoIndex audio-to-text flow: read audio files from a local directory, transcribe each with LiteLLM, and store one transcript row per file in Postgres](https://cocoindex.io/blobs/docs-v1/img/examples/audio-to-text/flow-v1.png)

From a high level, these are the steps:

1. Read audio files from a local directory (recursively, matching common audio extensions).
2. [Transcribe each file](https://cocoindex.io/docs/ops/litellm/) with a LiteLLM speech-to-text model (`whisper-1` by default).
3. Store one transcript per file in Postgres (as a [target state](https://cocoindex.io/docs/programming_guide/target_state/)), keyed by filename — no chunking.

You [declare the transformation logic](https://cocoindex.io/docs/programming_guide/core_concepts/) with native Python, without worrying about how updates propagate. Think: **target_state = transformation(source_state)**.

## Setup

- A running Postgres. CocoIndex supports [many targets](https://cocoindex.io/docs/connectors/postgres/), so you can pick another store. If you don't have one, start a local instance:

  ```sh
  docker compose -f dev/postgres.yaml up -d
  export POSTGRES_URL="postgres://cocoindex:cocoindex@localhost/cocoindex"
  ```

- LiteLLM credentials for the transcription model. For the default `whisper-1`, set your OpenAI key:

  ```sh
  export OPENAI_API_KEY="..."
  ```

- Install CocoIndex with the extras this example uses:

  ```sh
  pip install -U "cocoindex[litellm,postgres]" asyncpg
  ```

- A few audio files to transcribe. Drop them into an `audio_files/` directory — the example recursively picks up `.aac`, `.aiff`, `.flac`, `.m4a`, `.mp3`, `.ogg`, `.wav`, and `.webm`.

## Define the data and shared resources

[Apps](https://cocoindex.io/docs/programming_guide/app/) are the top-level runnable unit in CocoIndex. Before the App, we set up the pieces the rest of the code builds on. `AudioTranscription` defines one row of the output table — each audio file becomes one row, with its filename and transcript text. `_transcriber` is the LiteLLM model, created once and reused. `coco_lifespan` provides the [shared resource](https://cocoindex.io/docs/programming_guide/context/) every step needs — the Postgres connection pool — once at startup.

```python title="main.py"
import os
import pathlib
from dataclasses import dataclass
from typing import AsyncIterator

import asyncpg

import cocoindex as coco
from cocoindex.connectors import localfs, postgres
from cocoindex.ops.litellm import LiteLLMTranscriber
from cocoindex.resources.file import PatternFilePathMatcher

DATABASE_URL = os.getenv("POSTGRES_URL", "postgres://cocoindex:cocoindex@localhost/cocoindex")
TABLE_NAME = "audio_transcriptions"
PG_SCHEMA_NAME = "coco_examples"
PG_DB = coco.ContextKey[asyncpg.Pool]("audio_to_text_db")

_transcriber = LiteLLMTranscriber("whisper-1")


@dataclass
class AudioTranscription:
    filename: str
    text: str


@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    async with await asyncpg.create_pool(DATABASE_URL) as pool:
        builder.provide(PG_DB, pool)
        yield
```

[`LiteLLMTranscriber("whisper-1")`](https://cocoindex.io/docs/ops/litellm/) wraps LiteLLM's transcription API, so you can swap in any model LiteLLM supports — `elevenlabs/scribe_v1`, a self-hosted endpoint, whatever — by changing that one string (and the matching credential).

## Process a file

![One processing component per file: each audio file is transcribed with LiteLLM, producing one AudioTranscription row written to Postgres](https://cocoindex.io/blobs/docs-v1/img/examples/audio-to-text/stage-file-process.png)

`process_file` runs once per file. It reads the audio, [transcribes it](https://cocoindex.io/docs/ops/litellm/), and declares a single target row — no chunking, one row per file.

```python title="main.py"
@coco.fn(memo=True)
async def process_file(
    file: localfs.File,
    table: postgres.TableTarget[AudioTranscription],
) -> None:
    transcript = await _transcriber.transcribe(file)
    table.declare_row(
        row=AudioTranscription(
            filename=str(file.file_path.path),
            text=transcript,
        ),
    )
```

`_transcriber.transcribe(file)` reads the file's bytes and calls the LiteLLM model, returning plain text. `table.declare_row` declares that row as a target state; CocoIndex handles inserting, updating, or deleting it to match. Because the filename is the primary key, the table doubles as an index of which files have been transcribed.

[`@coco.fn`](https://cocoindex.io/docs/programming_guide/function/) with [`memo=True`](https://cocoindex.io/docs/advanced_topics/memoization_keys/) is what makes this incremental: if a file's content and this function's code are both unchanged, the whole file is skipped on the next run — so you don't pay for the same transcription twice.

## Define the main function

`app_main` wires the source to the target. It mounts the Postgres table, walks the source directory for audio files, and mounts one [processing component](https://cocoindex.io/docs/programming_guide/processing_component/) per file.

```python title="main.py"
@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
    target_table = await postgres.mount_table_target(
        PG_DB,
        table_name=TABLE_NAME,
        table_schema=await postgres.TableSchema.from_class(
            AudioTranscription,
            primary_key=["filename"],
        ),
        pg_schema_name=PG_SCHEMA_NAME,
    )

    files = localfs.walk_dir(
        sourcedir,
        recursive=True,
        path_matcher=PatternFilePathMatcher(
            included_patterns=[
                "**/*.aac", "**/*.aiff", "**/*.flac", "**/*.m4a",
                "**/*.mp3", "**/*.ogg", "**/*.wav", "**/*.webm",
            ],
        ),
    )
    await coco.mount_each(process_file, files.items(), target_table)
```

`mount_table_target` creates and manages the Postgres table for you: schema, idempotent upserts, and orphan cleanup when a file disappears. `primary_key=["filename"]` is what makes each file map to exactly one row. `mount_each` runs one component per file so the engine can track and update them independently.

## Create the App

Bind `app_main` into a `coco.App` and point it at the folder of audio files.

```python title="main.py"
app = coco.App(
    "AudioToText",
    app_main,
    sourcedir=pathlib.Path("./audio_files"),
)
```

That is the entire indexing path.

## Run the pipeline

Run the [`cocoindex` CLI](https://cocoindex.io/docs/cli/) to build and update the table:

```sh
cocoindex update main.py
```

The target table is `coco_examples.audio_transcriptions`, with `filename` as the primary key and `text` as the transcript. Check the results with plain SQL:

```sh
psql "$POSTGRES_URL" -c \
  'SELECT filename, left(text, 200) AS preview FROM coco_examples.audio_transcriptions ORDER BY filename;'
```

## Incremental updates

CocoIndex keeps the table in sync with your files and does the **minimum work** to get there. You never compute a diff or write update logic: you change something, and CocoIndex works out exactly what to transcribe, upsert, and delete. Two pieces make this work. `@coco.fn(memo=True)` decides what to *recompute* — a file is skipped when its content and the function's code are both unchanged, so an unchanged file never hits the transcription API again. `mount_table_target` decides what to *write* — each row is keyed by `filename`, so it upserts only the rows that actually changed and deletes rows whose source file is gone.

- **A file is added** — only that file is transcribed, and its one row is inserted. The rest is untouched.
- **A file is changed** — it is re-transcribed and its row is updated in place. Files with identical content keep their cached transcript and are left as-is.
- **A file is deleted** — its row is removed from the target automatically.

The same machinery covers **logic** changes too: swap the transcription model and CocoIndex re-transcribes against the new model, comparing the result with what is already in Postgres and applying only the difference. Re-running `cocoindex update main.py` does this once and exits.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/audio_to_text](https://github.com/cocoindex-io/cocoindex/tree/main/examples/audio_to_text). Once this clicks, [Semantic Search 101](https://cocoindex.io/docs/examples/text-embedding/) is the natural next step — embed those transcripts and search them by meaning.

If CocoIndex helps you, star us on [GitHub](https://github.com/cocoindex-io/cocoindex) and come say hi in our [Discord](https://discord.com/invite/zpA9S2DR7s).

---

# Example: Trending Topics from HackerNews

Source: https://cocoindex.io/docs/examples/hackernews-trending-topics/

![Trending Topics from HackerNews](https://cocoindex.io/blobs/docs-v1/img/examples/hackernews-trending-topics/cover.png)

What is the tech community talking about right now? In this tutorial, we'll build a pipeline that scrapes recent HackerNews stories and their comment threads, uses an LLM to pull out the topics each message is about, and stores everything in Postgres so you can rank what's trending and search by topic.

The data source here isn't a folder of files — it's a public HTTP API. We fetch threads on the fly with the [Algolia HackerNews API](https://hn.algolia.com/api), so this example doubles as a recipe for plugging any custom source into a CocoIndex pipeline. Each run catches up on the latest stories, and because CocoIndex memoizes per-message work, re-running only does the new work.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/hn_trending_topics)

## Flow overview

![Flow](https://cocoindex.io/blobs/docs-v1/img/examples/hackernews-trending-topics/flow-v1.png)

1. Fetch a list of recent thread IDs from the Algolia HackerNews API
2. For each thread, fetch the story and all of its comments
3. Extract topics from each message (thread + every comment) using an LLM
4. Store messages and their topics as rows in two Postgres tables

You declare the transformation logic with native Python without worrying about changes.

Think:
**target_state = transformation(source_state)**

When the HackerNews feed moves on, or your processing logic changes (for example, switching to a different model, or refining the topic-extraction prompt), CocoIndex performs smart incremental processing that only reprocesses the minimum, and keeps your `hn_messages` and `hn_topics` tables in sync.

## Define the data models

We model the HackerNews content we scrape, and the rows we want in Postgres, as plain Python dataclasses. The scraped `Thread` and `Comment` are what we pull from the API; `HnMessage` and `HnTopic` are the table schemas.

```python title="main.py"
@dataclass
class Comment:
    id: str
    author: str | None
    text: str | None
    created_at: datetime | None


@dataclass
class Thread:
    id: str
    author: str | None
    text: str
    url: str | None
    created_at: datetime | None
    comments: list[Comment]


@dataclass
class HnMessage:
    """Schema for hn_messages table."""

    id: str
    thread_id: str
    content_type: str
    author: str | None
    text: str | None
    url: str | None
    created_at: datetime | None


@dataclass
class HnTopic:
    """Schema for hn_topics table."""

    topic: str
    message_id: str
    thread_id: str
    content_type: str
    created_at: datetime | None
```

A thread and each of its comments both become an `HnMessage` (distinguished by `content_type`). Every extracted topic becomes an `HnTopic` row keyed on `(topic, message_id)`, so the same topic mentioned in many messages shows up many times — exactly what we want for ranking.

## Extract topics with an LLM

The core transformation is a single CocoIndex function: text in, a list of topics out. We use [litellm](https://docs.litellm.ai/docs/providers) so any provider works, and ask the model to return structured JSON validated by a small Pydantic model.

```python title="models.py"
class TopicsResponse(BaseModel):
    """Response containing a list of extracted topics."""

    topics: list[str] = Field(
        description="""List of extracted topics.

Each topic can be a product name, technology, model, people, company name, business domain, etc.
Capitalize for proper nouns and acronyms only.
Use the form that is clear alone.
Avoid acronyms unless very popular and unambiguous for common people even without context.
..."""
    )
```

```python title="main.py"
@coco.fn
async def extract_topics(text: str | None) -> list[str]:
    """Extract topics from text using LLM."""
    if not text or not text.strip():
        return []

    response = await acompletion(
        model=LLM_MODEL,
        messages=[
            {
                "role": "user",
                "content": f"Extract topics from the following text:\n\n{text[:4000]}",
            }
        ],
        response_format=TopicsResponse,
    )

    content = response.choices[0].message.content
    return TopicsResponse.model_validate_json(content).topics
```

The prompt in `TopicsResponse` does the heavy lifting: it tells the model to normalize phrases into separate topics ("books for autistic kids" → "book", "autistic", "autistic kids"), keep proper nouns canonical ("PostgreSQL", "Claude"), and emit multiple aliases for the same thing ("JFK", "John Kennedy"). That normalization is what makes the trending ranking meaningful later.

[→ Function](https://cocoindex.io/docs/programming_guide/function)

## Fetch from the HackerNews API

The source is a custom one — two small async functions over the Algolia HN API. `fetch_thread_list` returns the most recent story IDs; `fetch_thread` pulls one story and recursively flattens its comment tree.

```python title="main.py"
async def fetch_thread_list(
    session: aiohttp.ClientSession, max_results: int = MAX_THREADS
) -> list[str]:
    """Fetch list of recent thread IDs from HackerNews."""
    search_url = "https://hn.algolia.com/api/v1/search_by_date"
    params: dict[str, str | int] = {"tags": "story", "hitsPerPage": max_results}

    async with session.get(search_url, params=params) as response:
        response.raise_for_status()
        data = await response.json()
        return [hit["objectID"] for hit in data.get("hits", []) if hit.get("objectID")]


async def fetch_thread(session: aiohttp.ClientSession, thread_id: str) -> Thread:
    """Fetch a single thread with all its comments."""
    item_url = f"https://hn.algolia.com/api/v1/items/{thread_id}"

    async with session.get(item_url) as response:
        response.raise_for_status()
        data = await response.json()

        comments: list[Comment] = []

        # Parse comments recursively
        def parse_comments(parent: dict[str, Any]) -> None:
            for child in parent.get("children", []):
                if comment_id := child.get("id"):
                    ctime = child.get("created_at")
                    comments.append(
                        Comment(
                            id=str(comment_id),
                            author=child.get("author"),
                            text=child.get("text"),
                            created_at=datetime.fromisoformat(ctime) if ctime else None,
                        )
                    )
                parse_comments(child)

        parse_comments(data)

        ctime = data.get("created_at")
        text = data.get("title", "")
        if more_text := data.get("text"):
            text += "\n\n" + more_text

        return Thread(
            id=thread_id,
            author=data.get("author"),
            text=text,
            url=data.get("url"),
            created_at=datetime.fromisoformat(ctime) if ctime else None,
            comments=comments,
        )
```

These are ordinary `async def` functions — no special CocoIndex decorators. Any HTTP API, queue, or third-party SDK can be a source this way: you fetch the data in plain Python and hand it to the pipeline.

## Process each thread

Each thread is processed by its own component. `process_thread` fetches the thread, extracts topics for the story and every comment, and declares the resulting rows.

![Process thread](https://cocoindex.io/blobs/docs-v1/img/examples/hackernews-trending-topics/stage-file-process.png)

```python title="main.py"
@coco.fn
async def process_thread(
    thread_id: str,
    targets: TableTargets,
) -> None:
    """Fetch and process a single thread and its comments."""
    async with aiohttp.ClientSession() as session:
        thread = await fetch_thread(session, thread_id)
    thread_topics = await extract_topics(thread.text)

    # Declare thread message row
    targets.messages.declare_row(
        row=HnMessage(
            id=thread.id,
            thread_id=thread.id,
            content_type="thread",
            author=thread.author,
            text=thread.text,
            url=thread.url,
            created_at=thread.created_at,
        ),
    )
    # Declare thread topic rows
    for topic in thread_topics:
        targets.topics.declare_row(
            row=HnTopic(
                topic=topic,
                message_id=thread.id,
                thread_id=thread.id,
                content_type="thread",
                created_at=thread.created_at,
            ),
        )
    # Process comments
    for comment in thread.comments:
        comment_topics = await extract_topics(comment.text)

        targets.messages.declare_row(
            row=HnMessage(
                id=comment.id,
                thread_id=thread.id,
                content_type="comment",
                author=comment.author,
                text=comment.text,
                url="",
                created_at=comment.created_at,
            ),
        )
        for topic in comment_topics:
            targets.topics.declare_row(
                row=HnTopic(
                    topic=topic,
                    message_id=comment.id,
                    thread_id=thread.id,
                    content_type="comment",
                    created_at=comment.created_at,
                ),
            )
```

You *declare* what rows should exist — you don't write inserts or deletes. When this component finishes, CocoIndex diffs the declared rows against the previous run at the same component path and applies only the create/update/delete needed. If a thread drops out of the feed, the rows it owned are cleaned up automatically.

**Why a component per thread?** A processing component groups one thread's work together with its target rows. Each one runs independently and in parallel, and its rows are committed to Postgres as soon as that thread is done — no waiting for the rest of the batch.

[→ Processing Component](https://cocoindex.io/docs/programming_guide/processing_component)

## Wire up the app

The main function mounts the two Postgres table targets, fetches the recent thread IDs, and fans out one `process_thread` component per thread with `mount_each`.

```python title="main.py"
@dataclass
class TableTargets:
    """Container for table targets."""

    messages: postgres.TableTarget[HnMessage]
    topics: postgres.TableTarget[HnTopic]


@coco.fn
async def app_main() -> None:
    """Main pipeline function."""
    # Set up table targets
    messages_table = await postgres.mount_table_target(
        PG_DB,
        table_name="hn_messages",
        table_schema=await postgres.TableSchema.from_class(
            HnMessage, primary_key=["id"]
        ),
        pg_schema_name="coco_examples",
    )
    topics_table = await postgres.mount_table_target(
        PG_DB,
        table_name="hn_topics",
        table_schema=await postgres.TableSchema.from_class(
            HnTopic, primary_key=["topic", "message_id"]
        ),
        pg_schema_name="coco_examples",
    )
    targets = TableTargets(messages=messages_table, topics=topics_table)

    # Fetch thread IDs from HackerNews
    async with aiohttp.ClientSession() as session:
        thread_ids = await fetch_thread_list(session)

    # Process threads (each component fetches its own thread data)
    await coco.mount_each(process_thread, ((tid, tid) for tid in thread_ids), targets)


app = coco.App(
    coco.AppConfig(name="HNTrendingTopics"),
    app_main,
)
```

`mount_each` takes one `(component_key, *args)` tuple per item, so each thread gets a stable component path keyed on its ID. The `TableSchema.from_class` calls derive the SQL columns straight from the dataclasses, and the Postgres pool is provided once in the lifespan:

```python title="main.py"
@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    builder.settings.db_path = pathlib.Path("./cocoindex.db")
    async with asyncpg.create_pool(DATABASE_URL) as pool:
        builder.provide(PG_DB, pool)
        yield
```

[→ App](https://cocoindex.io/docs/programming_guide/app)

## Setup

1. Install CocoIndex and dependencies:

    ```bash
    pip install 'cocoindex[postgres]>=1.0.7' asyncpg aiohttp litellm pydantic python-dotenv
    ```

2. Start Postgres if you don't have one running:

    ```bash
    docker compose -f dev/postgres.yaml up -d
    ```

3. Set your Postgres connection and LLM credentials (the default model is `gemini/gemini-2.5-flash`):

    ```bash
    export POSTGRES_URL="postgres://cocoindex:cocoindex@localhost/cocoindex"
    export GEMINI_API_KEY="your-api-key"
    # Optional: any litellm model id, then set the matching provider key
    # export LLM_MODEL="gemini/gemini-2.5-flash"
    ```

    You can also put these in a `.env` file in the example directory — `python main.py` loads it automatically.

## Run the pipeline

Build the index — a one-shot catch-up over the latest threads (this example doesn't use a live source):

```bash
cocoindex update main
```

CocoIndex will:

1. Fetch the most recent `MAX_THREADS` (default 10) story IDs from the Algolia HN API
2. Fetch each story and its comments, and run LLM topic extraction on every message
3. Write rows into `coco_examples.hn_messages` and `coco_examples.hn_topics`

Then explore the results. Show the top trending topics and drop into a search loop:

```bash
python main.py
```

Or jump straight to a topic search:

```bash
python main.py "rust"
```

The trending score is computed in SQL: a thread-level mention counts for more than a comment-level one, grouped by topic and ordered by score.

## Incremental updates

The real power shows when you run the pipeline again:

**Catch up on new stories:**

```bash
cocoindex update main
```

New threads in the feed get processed; threads already in the database reuse their committed rows. Because per-message extraction is tracked, the expensive LLM calls only run for content CocoIndex hasn't seen.

**Change the extraction logic:**

Edit the topic-extraction prompt or switch `LLM_MODEL`, then run `cocoindex update main` again. CocoIndex detects the changed logic and re-extracts, keeping the `hn_topics` table consistent with your new logic — no manual migration.

## Run it

Full source on GitHub:

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/hn_trending_topics)

If CocoIndex helps, give us a star ⭐ on [GitHub](https://github.com/cocoindex-io/cocoindex) and join the conversation on [Discord](https://discord.com/invite/zpA9S2DR7s) — we'd love to hear what you build.

---

# Example: Index Academic Papers

Source: https://cocoindex.io/docs/examples/paper-metadata/

![Index academic papers and extract metadata for AI agents with CocoIndex V1](https://cocoindex.io/blobs/docs-v1/img/examples/paper-metadata/cover.png)

We'll take a folder of academic PDFs and pull out the parts you actually want to query — **title, authors, abstract** — as structured, typed rows. The first page of a paper holds almost all of this, so we read just that page, hand the text to an LLM with a strict schema, and get back clean JSON. The same metadata is then embedded so you can search papers by meaning, not just by exact words.

The whole pipeline is ordinary `async` Python and your own types. The heavy lifting — [incremental processing](https://cocoindex.io/docs/programming_guide/core_concepts/), change tracking, managed targets — runs in a Rust engine underneath, so only changed PDFs get re-extracted and re-embedded. One file fans out into three Postgres tables — paper metadata, an author-to-paper index, and embeddings — and CocoIndex keeps all three in sync for you.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/paper_metadata)

## Flow overview

![CocoIndex paper metadata flow: walk a folder of PDFs, read the first page, LLM-extract title/authors/abstract into a typed model, embed the title and abstract chunks, and store metadata, authors, and embeddings in three Postgres tables](https://cocoindex.io/blobs/docs-v1/img/examples/paper-metadata/flow-v1.png)

From a high level, these are the steps:

1. Read PDF files from a local directory (live).
2. Pull the first page out of each PDF, [extract its text](https://github.com/py-pdf/pypdf), and ask an LLM to return `title`, `authors`, and `abstract` as structured JSON.
3. Embed the title and the [abstract chunks](https://cocoindex.io/docs/ops/text/), then declare the metadata, the author index, and the [embeddings](https://cocoindex.io/docs/ops/sentence_transformers/) into Postgres (as [target states](https://cocoindex.io/docs/programming_guide/target_state/)).

You [declare the transformation logic](https://cocoindex.io/docs/programming_guide/core_concepts/) with native Python, without worrying about how updates propagate. Think: **target_state = transformation(source_state)**.

## Setup

- A running Postgres with the [pgvector](https://github.com/pgvector/pgvector) extension. The repo ships a compose file:

  ```sh
  docker compose -f dev/postgres.yaml up -d
  export POSTGRES_URL="postgres://cocoindex:cocoindex@localhost/cocoindex"
  ```

- An [OpenAI API key](https://platform.openai.com/) for the extraction step:

  ```sh
  export OPENAI_API_KEY="your_key"
  ```

- Install CocoIndex and the dependencies this example uses:

  ```sh
  pip install -U "cocoindex[postgres,sentence_transformers]" asyncpg pgvector numpy pypdf openai pydantic python-dotenv
  ```

- A few PDFs to index. The example ships a `papers/` folder with a handful of well-known papers — or drop your own in.

## Define the schema you want back

Before touching the pipeline, pin down the shape of the metadata. These [Pydantic](https://docs.pydantic.dev/) models are what we ask the LLM to fill in — `model_validate_json` rejects anything that doesn't match, so a malformed response fails loudly instead of writing junk to the database.

```python title="models.py"
class AuthorModel(BaseModel):
    name: str
    email: str | None = None
    affiliation: str | None = None


class PaperMetadataModel(BaseModel):
    title: str
    authors: list[AuthorModel] = Field(default_factory=list)
    abstract: str
```

## Define the data and shared resources

Each output table maps to one dataclass: `PaperMetadataRow` is one row per paper, `AuthorPaperRow` is one row per (author, paper) pair — an index you can join against — and `MetadataEmbeddingRow` is one embedded chunk of text. `coco_lifespan` provides the [shared resources](https://cocoindex.io/docs/programming_guide/context/) every step needs — the Postgres connection pool and the embedding model — once at startup.

```python title="main.py"
EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
PG_DB = coco.ContextKey[asyncpg.Pool]("paper_metadata_db")
EMBEDDER = coco.ContextKey[SentenceTransformerEmbedder]("embedder", detect_change=True)


@dataclass
class PaperMetadataRow:
    filename: str
    title: str
    authors: list[dict[str, str | None]]
    abstract: str
    num_pages: int


@dataclass
class AuthorPaperRow:
    author_name: str
    filename: str


@dataclass
class MetadataEmbeddingRow:
    id: uuid.UUID
    filename: str
    location: str
    text: str
    embedding: Annotated[NDArray, EMBEDDER]


@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    async with asyncpg.create_pool(os.environ["POSTGRES_URL"]) as pool:
        builder.provide(PG_DB, pool)
        builder.provide(EMBEDDER, SentenceTransformerEmbedder(EMBED_MODEL))
        yield
```

`embedding: Annotated[NDArray, EMBEDDER]` ties the vector column to the embedder, so its dimensions are inferred automatically — and if you swap the model later, CocoIndex notices (`detect_change=True`) and re-embeds.

## Read the first page and extract metadata

Three small functions do the extraction. `extract_basic_info` slices the first page out of the PDF (and counts the pages), `pdf_to_markdown` pulls the text off that page, and `extract_metadata` hands it to the LLM with a strict instruction to return only the three fields we want.

```python title="main.py"
@coco.fn
def extract_basic_info(content: bytes) -> PaperBasicInfo:
    reader = PdfReader(io.BytesIO(content))
    output = io.BytesIO()
    writer = PdfWriter()
    writer.add_page(reader.pages[0])
    writer.write(output)
    return PaperBasicInfo(num_pages=len(reader.pages), first_page=output.getvalue())


@coco.fn
def pdf_to_markdown(content: bytes) -> str:
    reader = PdfReader(io.BytesIO(content))
    return (reader.pages[0].extract_text() if reader.pages else "") or ""


@coco.fn
def extract_metadata(markdown: str) -> PaperMetadataModel:
    response = openai_client().chat.completions.create(
        model=LLM_MODEL,
        messages=[
            {"role": "system", "content": (
                "You extract metadata from academic paper first pages. "
                "Return only JSON with keys: title, authors, abstract. "
                "authors is a list of {name, email, affiliation}. "
                "Use null for missing fields."
            )},
            {"role": "user", "content": markdown[:4000]},
        ],
        response_format={"type": "json_object"},
        temperature=0,
    )
    content = response.choices[0].message.content
    if not content:
        raise RuntimeError("LLM returned empty content.")
    return PaperMetadataModel.model_validate_json(content)
```

Only the first page is read, and the prompt is capped at `markdown[:4000]` characters — that's almost always enough to cover the title block and abstract, and it keeps the token cost flat regardless of how long the paper is. `response_format={"type": "json_object"}` with `temperature=0` makes the output deterministic JSON, and `PaperMetadataModel.model_validate_json` parses it straight into our typed model.

## Process a file

![One processing component per PDF: read the first page, LLM-extract metadata, embed the title and abstract chunks, and declare rows into three Postgres tables](https://cocoindex.io/blobs/docs-v1/img/examples/paper-metadata/stage-file-process.png)

`process_file` runs once per PDF and ties the steps together. It extracts the metadata, then declares the rows: one metadata row, one author-index row per author, and one embedding row for the title plus one for each abstract chunk.

```python title="main.py"
@coco.fn(memo=True)
async def process_file(
    file: FileLike,
    metadata_table: postgres.TableTarget[PaperMetadataRow],
    author_table: postgres.TableTarget[AuthorPaperRow],
    embedding_table: postgres.TableTarget[MetadataEmbeddingRow],
) -> None:
    content = await file.read()
    basic_info = extract_basic_info(content)
    first_page_md = pdf_to_markdown(basic_info.first_page)
    metadata = extract_metadata(first_page_md)

    metadata_table.declare_row(
        row=PaperMetadataRow(
            filename=str(file.file_path.path),
            title=metadata.title,
            authors=[a.model_dump() for a in metadata.authors],
            abstract=metadata.abstract,
            num_pages=basic_info.num_pages,
        ),
    )

    for author in metadata.authors:
        if author.name:
            author_table.declare_row(
                row=AuthorPaperRow(
                    author_name=author.name,
                    filename=str(file.file_path.path),
                ),
            )

    title_embedding = await coco.use_context(EMBEDDER).embed(metadata.title)
    embedding_table.declare_row(
        row=MetadataEmbeddingRow(
            id=uuid.uuid4(), filename=str(file.file_path.path),
            location="title", text=metadata.title, embedding=title_embedding,
        ),
    )

    abstract_chunks = _abstract_splitter.split(
        metadata.abstract, chunk_size=500, min_chunk_size=200,
        chunk_overlap=150, language="abstract",
    )
    for chunk in abstract_chunks:
        embedding_table.declare_row(
            row=MetadataEmbeddingRow(
                id=uuid.uuid4(), filename=str(file.file_path.path),
                location="abstract", text=chunk.text,
                embedding=await coco.use_context(EMBEDDER).embed(chunk.text),
            ),
        )
```

[`@coco.fn`](https://cocoindex.io/docs/programming_guide/function/) with [`memo=True`](https://cocoindex.io/docs/advanced_topics/memoization_keys/) is what makes this incremental: if a PDF's content and this function's code are both unchanged, the whole file is skipped on the next run — so you never pay for the LLM call or the embeddings on a PDF you've already processed. We embed the title as one row and the abstract as a few overlapping chunks (a [`RecursiveSplitter`](https://cocoindex.io/docs/ops/text/) tuned to break on sentence boundaries), and `location` marks which is which so a search can tell a title hit from an abstract hit. `table.declare_row` declares each row as a target state; CocoIndex handles inserting, updating, or deleting it to match.

## Define the main function

`app_main` wires the source to the targets. It mounts the three Postgres tables, walks the source directory for PDFs, and mounts one [processing component](https://cocoindex.io/docs/programming_guide/processing_component/) per file.

```python title="main.py"
@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
    metadata_table = await postgres.mount_table_target(
        PG_DB, table_name=TABLE_METADATA,
        table_schema=await postgres.TableSchema.from_class(
            PaperMetadataRow, primary_key=["filename"],
        ),
        pg_schema_name=PG_SCHEMA_NAME,   # "coco_examples_v1"
    )
    author_table = await postgres.mount_table_target(
        PG_DB, table_name=TABLE_AUTHOR_PAPERS,
        table_schema=await postgres.TableSchema.from_class(
            AuthorPaperRow, primary_key=["author_name", "filename"],
        ),
        pg_schema_name=PG_SCHEMA_NAME,
    )
    embedding_table = await postgres.mount_table_target(
        PG_DB, table_name=TABLE_EMBEDDINGS,
        table_schema=await postgres.TableSchema.from_class(
            MetadataEmbeddingRow, primary_key=["id"],
        ),
        pg_schema_name=PG_SCHEMA_NAME,
    )

    files = localfs.walk_dir(
        sourcedir,
        recursive=True,
        path_matcher=PatternFilePathMatcher(included_patterns=["**/*.pdf"]),
        live=True,  # watch for changes; pass -L to `cocoindex update` to run live
    )
    await coco.mount_each(
        process_file, files.items(), metadata_table, author_table, embedding_table
    )


app = coco.App(
    coco.AppConfig(name="PaperMetadataV1"),
    app_main,
    sourcedir=pathlib.Path("./papers"),
)
```

Each `mount_table_target` creates and manages a Postgres table for you — schema, idempotent upserts, and orphan cleanup when a PDF disappears. Note the different primary keys: paper metadata is keyed by `filename`, the author index by the `(author_name, filename)` pair, and the embeddings by a generated `id`. `live=True` makes the [filesystem source](https://cocoindex.io/docs/connectors/localfs/) [watch for changes](https://cocoindex.io/docs/programming_guide/live_mode/), and `mount_each` runs one component per file so the engine can track and update each PDF independently while writing into all three tables.

> **No vector index here.** To keep the example minimal, this flow doesn't declare a vector index, so queries do a sequential scan — fine for a few papers. For a larger corpus, add one line — `embedding_table.declare_vector_index(column="embedding")` — exactly as the [Semantic Search 101](https://cocoindex.io/docs/examples/text-embedding/) example does, and pgvector serves approximate-nearest-neighbor queries instead.

## Run the pipeline

Run the [`cocoindex` CLI](https://cocoindex.io/docs/cli/) to build and update the index. Choose catch-up (scan, sync, exit) or live (catch up, then keep watching):

```sh
# Catch-up run
cocoindex update main

# Live run: keep watching for file changes
cocoindex update -L main
```

## Query the index

Match user text against the embeddings with a plain SQL query, reusing the *same* embedder from the indexing flow so indexing and querying stay consistent.

```python title="main.py"
async def query_once(pool, embedder, query: str, *, top_k: int = 5) -> None:
    query_vec = await embedder.embed(query)
    async with pool.acquire() as conn:
        rows = await conn.fetch(
            f"""
            SELECT filename, location, text, embedding <=> $1 AS distance
            FROM "{PG_SCHEMA_NAME}"."{TABLE_EMBEDDINGS}"
            ORDER BY distance ASC
            LIMIT $2
            """,
            query_vec, top_k,
        )
    for r in rows:
        score = 1.0 - float(r["distance"])
        print(f"[{score:.3f}] {r['filename']} ({r['location']})")
        print(f"    {r['text']}")
        print("---")
```

The `<=>` operator is pgvector's cosine distance. We turn it into a similarity score and print the filename, whether the hit was a `title` or an `abstract` chunk, and the matching text. Run a search straight from the command line:

```bash
python main.py "graph neural networks"
```

With the sample papers indexed, the most semantically similar titles and abstracts come back ranked — even when they share none of the words in your query. That's the whole point of embedding the metadata.

## Incremental updates

CocoIndex keeps the three tables in sync with your PDFs and does the **minimum work** to get there. You never compute a diff or write update logic. Two pieces make this work. `@coco.fn(memo=True)` decides what to *recompute* — a PDF is skipped when its bytes and the function's code are both unchanged, so neither the LLM nor the embedder ever runs on an unchanged file. `mount_table_target` decides what to *write* — it upserts only the rows that actually changed and deletes rows whose source is gone, across all three tables.

- **A PDF is added** — only that file is read, extracted, and embedded; its metadata, author, and embedding rows are inserted. The rest is untouched.
- **A PDF is replaced** — it is re-extracted; the metadata row is updated, author rows are reconciled against the new author list, and the embeddings are recomputed.
- **A PDF is deleted** — all of its rows are removed from all three tables automatically.

The same machinery covers **logic** changes too: tweak the prompt, swap `gpt-4o` for another model, or change the embedding model, and CocoIndex compares the new output against what's already in Postgres and applies only the difference. A catch-up run (`cocoindex update main`) does this once and exits; live mode (`cocoindex update -L main`) keeps watching and applies each change with low latency.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/paper_metadata](https://github.com/cocoindex-io/cocoindex/tree/main/examples/paper_metadata). If you just want to search PDFs by meaning without the structured extraction, [Semantic Search over PDFs](https://github.com/cocoindex-io/cocoindex/tree/main/examples/pdf_embedding) chunks and embeds the full text instead; if you want the Markdown itself as the output, see [PDF → Markdown](https://cocoindex.io/docs/examples/pdf-to-markdown/).

Got a folder of papers, reports, or filings you want to turn into structured, searchable rows? Come tell us on [Discord](https://discord.com/invite/zpA9S2DR7s) — and if this was useful, [star CocoIndex on GitHub](https://github.com/cocoindex-io/cocoindex).

---

# Example: Patient Intake Forms to Typed JSON with BAML

Source: https://cocoindex.io/docs/examples/patient-intake-baml/

![Patient Intake Forms to Typed JSON with BAML and CocoIndex V1](https://cocoindex.io/blobs/docs-v1/img/examples/patient-intake-baml/cover.png)

We'll take a folder of patient intake forms — the messy, multi-section PDFs a clinic hands you on a clipboard — and turn each one into a clean, validated JSON record: demographics, insurance, medications, allergies, surgeries, consent. The hard part isn't reading the PDF; it's getting back data that *matches a schema* every time, so downstream code can trust it. We use [BAML](https://boundaryml.com/) to declare that schema and run a single type-safe extraction per form against a Gemini vision model.

The whole pipeline is ordinary `async` Python. You write a [BAML schema](https://docs.boundaryml.com/) for the `Patient` type and one extraction function, wrap it in a [CocoIndex function](https://cocoindex.io/docs/programming_guide/function/), and let the Rust engine underneath handle [incremental processing](https://cocoindex.io/docs/programming_guide/core_concepts/) — only changed PDFs get re-extracted, and the LLM call (the one genuinely expensive step) is skipped entirely for forms you've already processed.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/patient_intake_extraction_baml)

## Flow overview

![CocoIndex patient intake extraction flow: walk a folder of PDF intake forms, run one BAML Gemini-vision extraction per form into a typed Patient model, dump it to JSON, and write one JSON file per form to a local directory](https://cocoindex.io/blobs/docs-v1/img/examples/patient-intake-baml/flow-v1.png)

From a high level, these are the steps:

1. Read PDF intake forms from a local directory.
2. [Extract a typed `Patient`](https://boundaryml.com/) from each PDF with one BAML function call to a Gemini vision model, then serialize it to JSON.
3. Write one JSON file per form to an output directory (as [target states](https://cocoindex.io/docs/programming_guide/target_state/)).

You [declare the transformation logic](https://cocoindex.io/docs/programming_guide/core_concepts/) with native Python, without worrying about how updates propagate. Think: **target_state = transformation(source_state)**.

## Define the schema in BAML

The schema lives in `baml_src/patient.baml`, not in Python. You describe the shape of a patient record as BAML classes — and the same file holds the extraction function and the model client. BAML turns this into a strongly-typed Pydantic client, and at runtime it forces the model's output to conform to the schema (with `"N/A"` filled in for required fields that the form leaves blank).

```baml title="baml_src/patient.baml"
class Patient {
  name string
  dob string
  gender string
  address Address
  phone string
  email string
  preferred_contact_method string
  emergency_contact Contact
  insurance Insurance?
  reason_for_visit string
  symptoms_duration string
  past_conditions Condition[]
  current_medications Medication[]
  allergies Allergy[]
  surgeries Surgery[]
  occupation string?
  pharmacy Pharmacy?
  consent_given bool
  consent_date string?
}

function ExtractPatientInfo(intake_form: pdf) -> Patient {
  client Gemini
  prompt #"
    Extract all patient information from the following intake form document.
    Please be thorough and extract all available information accurately.

    {{ _.role("user") }}
    {{ intake_form }}

    Fill in with "N/A" for required fields if the information is not available.

    {{ ctx.output_format }}
  "#
}
```

The function signature is the contract: `ExtractPatientInfo` takes a `pdf` and returns a `Patient`. `{{ ctx.output_format }}` is where BAML injects the schema into the prompt, and the `Gemini` client (declared in the same file, pointing at `gemini-2.5-flash`) reads PDFs natively as vision input — no separate parse or OCR step. Nested types like `Address`, `Insurance`, and the `Condition[]` / `Medication[]` lists are defined the same way; see the full `patient.baml` in the repo.

Running `baml generate` compiles this into a `baml_client/` package you import from Python — `b.ExtractPatientInfo(...)` and the `Patient` Pydantic model.

## Wrap BAML in a CocoIndex function

`extract_patient_info` is the single transform: PDF bytes in, a typed `Patient` out. BAML's `baml_py.Pdf.from_base64` takes the raw bytes, and the generated `b.ExtractPatientInfo` does the typed extraction.

```python title="main.py"
import base64
import pathlib

import cocoindex as coco
from cocoindex.resources.file import FileLike, PatternFilePathMatcher
from cocoindex.connectors import localfs
from baml_client import b
from baml_client.types import Patient
import baml_py


@coco.fn
async def extract_patient_info(content: bytes) -> Patient:
    """Extract patient information from PDF content using BAML."""
    pdf = baml_py.Pdf.from_base64(base64.b64encode(content).decode("utf-8"))
    return await b.ExtractPatientInfo(pdf)
```

The return type is `Patient` — the actual Pydantic class BAML generated, not a dict — so everything downstream is typed and validated. There's no prompt engineering or response parsing here; that all lives in the BAML schema, and the LLM call is one `await`.

## Process a file

![One processing component per intake form: read the PDF, extract a typed Patient with BAML, serialize to JSON, and declare a JSON file into the output directory](https://cocoindex.io/blobs/docs-v1/img/examples/patient-intake-baml/stage-file-process.png)

`process_patient_form` runs once per PDF. It reads the file, runs the BAML extraction, dumps the typed `Patient` to JSON, and declares one output file named after the source form.

```python title="main.py"
@coco.fn(memo=True)
async def process_patient_form(file: FileLike, outdir: pathlib.Path) -> None:
    """Process a patient intake form PDF and extract structured information."""
    content = await file.read()
    patient_info = await extract_patient_info(content)
    patient_json = patient_info.model_dump_json(indent=2)
    output_filename = file.file_path.path.stem + ".json"
    localfs.declare_file(
        outdir / output_filename, patient_json, create_parent_dirs=True
    )
```

[`@coco.fn`](https://cocoindex.io/docs/programming_guide/function/) with [`memo=True`](https://cocoindex.io/docs/advanced_topics/memoization_keys/) is what makes this incremental: if a PDF's content and this function's code are both unchanged, the whole file is skipped on the next run — so you never pay for a second Gemini call on a form you've already extracted. `patient_info.model_dump_json(indent=2)` serializes the validated model, and `localfs.declare_file` declares the JSON file as a [target state](https://cocoindex.io/docs/programming_guide/target_state/); CocoIndex handles writing, updating, or deleting it to match.

## Define the main function

`app_main` wires the source to the target. It walks the source directory for PDFs and mounts one [processing component](https://cocoindex.io/docs/programming_guide/processing_component/) per file with `mount_each`.

```python title="main.py"
@coco.fn
async def app_main(sourcedir: pathlib.Path, outdir: pathlib.Path) -> None:
    """Main application function that processes patient intake forms."""
    files = localfs.walk_dir(
        sourcedir,
        path_matcher=PatternFilePathMatcher(included_patterns=["**/*.pdf"]),
    )
    await coco.mount_each(process_patient_form, files.items(), outdir)


app = coco.App(
    coco.AppConfig(name="PatientIntakeExtractionBaml"),
    app_main,
    sourcedir=pathlib.Path("./data/patient_forms"),
    outdir=pathlib.Path("./output_patients"),
)
```

The [filesystem source](https://cocoindex.io/docs/connectors/localfs/) walks `data/patient_forms/` for `*.pdf`, and `mount_each` runs one component per form so the engine can track and update each independently. `coco.App` binds the main function to its arguments — the source and output directories — into a runnable unit.

## Setup

- Install CocoIndex and the dependencies this example uses (BAML ships the client generator and the Python runtime):

  ```sh
  pip install -U cocoindex baml-py pydantic python-dotenv
  ```

- Generate the BAML client from the schema. This compiles `baml_src/patient.baml` into the `baml_client/` package that `main.py` imports:

  ```sh
  baml generate
  ```

- The extraction uses a Gemini vision model. Put your key in a `.env` file in the example directory (it's auto-loaded when you run):

  ```sh
  echo "GEMINI_API_KEY=your_api_key_here" > .env
  ```

- A few intake forms to extract. The example ships a `data/patient_forms/` folder with a handful of artificial PDFs — or drop your own in.

## Run the pipeline

Run the [`cocoindex` CLI](https://cocoindex.io/docs/cli/) to build and update the index. A catch-up run scans the source, extracts, writes, and exits:

```sh
cocoindex update main.py
```

This reads each PDF in `data/patient_forms/`, extracts a `Patient`, and writes one JSON file per form to `output_patients/`. Check the output:

```sh
ls output_patients/
# Patient_Intake_Form_David_Artificial.json
# Patient_Intake_Form_Emily_Artificial.json
# ...one .json per intake PDF
```

Each file is a fully populated, schema-validated patient record — the same shape every time, ready to load into a database, a chart, or another pipeline.

## Incremental updates

CocoIndex keeps the output in sync with your intake forms and does the **minimum work** to get there. You never compute a diff or write update logic. Two pieces make this work. `@coco.fn(memo=True)` decides what to *recompute* — a form is skipped when its bytes and the function's code are both unchanged, so Gemini never re-extracts an unchanged PDF. `localfs.declare_file` decides what to *write* — the engine compares declared output files against what's on disk and applies only the difference.

- **A form is added** — only that PDF is extracted; its JSON file is written. The rest is untouched.
- **A form is replaced** — it is re-extracted and its JSON file is rewritten; every other form is left alone.
- **A form is deleted** — its JSON file is removed from the output directory automatically.

The same machinery covers **logic** changes too: edit the BAML schema (add a field, tighten a type) or swap the model, run `baml generate` again, and the next `cocoindex update main.py` re-extracts and rewrites — applying only the difference against what's already there.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/patient_intake_extraction_baml](https://github.com/cocoindex-io/cocoindex/tree/main/examples/patient_intake_extraction_baml). The exact same task — intake PDFs to a typed `Patient` — has a [DSPy twin](https://github.com/cocoindex-io/cocoindex/tree/main/examples/patient_intake_extraction_dspy) that swaps BAML for a DSPy signature and module, so you can compare the two structured-extraction libraries side by side on one flow.

Got a stack of forms, invoices, or reports you want to turn into validated records? Come tell us on [Discord](https://discord.com/invite/zpA9S2DR7s) — and if this was useful, [star CocoIndex on GitHub](https://github.com/cocoindex-io/cocoindex).

---

# Example: Patient Intake Forms to Typed JSON with DSPy

Source: https://cocoindex.io/docs/examples/patient-intake-dspy/

![Patient Intake Extraction with DSPy on CocoIndex V1](https://cocoindex.io/blobs/docs-v1/img/examples/patient-intake-dspy/cover.png)

We'll take a folder of patient intake PDFs — names, addresses, insurance, medications, allergies, consent — and turn each one into a clean, validated JSON record. The hard part isn't the file plumbing; it's that intake forms are *visual*: checkboxes, hand-filled fields, tables of medications. So instead of extracting text, we render each PDF page to an image and let a vision model read the form the way a person would. [DSPy](https://github.com/stanfordnlp/dspy) handles the prompting — you declare a typed `Signature` and it produces a [Pydantic](https://docs.pydantic.dev/) `Patient`, no prompt strings to hand-tune.

The whole pipeline is ordinary `async` Python and your own types. The heavy lifting — [incremental processing](https://cocoindex.io/docs/programming_guide/core_concepts/), change tracking, managed targets — runs in a Rust engine underneath, so only changed forms get re-extracted, and each one becomes exactly one JSON file on disk.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/patient_intake_extraction_dspy)

## Flow overview

![CocoIndex patient intake flow: walk a folder of PDF intake forms, render each page to an image, extract a typed Patient with a DSPy ChainOfThought vision module on Gemini, and write one JSON file per form to a local directory](https://cocoindex.io/blobs/docs-v1/img/examples/patient-intake-dspy/flow-v1.png)

From a high level, these are the steps:

1. Read PDF intake forms from a local directory.
2. [Render each page to an image](https://pymupdf.readthedocs.io/) with PyMuPDF, then extract a typed `Patient` with a [DSPy `ChainOfThought`](https://github.com/stanfordnlp/dspy) vision module on Gemini.
3. Write one JSON file per form to a local directory (as [target states](https://cocoindex.io/docs/programming_guide/target_state/)).

You [declare the transformation logic](https://cocoindex.io/docs/programming_guide/core_concepts/) with native Python, without worrying about how updates propagate. Think: **target_state = transformation(source_state)**.

## Setup

- A [Gemini API key](https://aistudio.google.com/apikey) — the extraction runs on a Gemini vision model. The example auto-loads a `.env` file:

  ```sh
  echo "GEMINI_API_KEY=your_api_key_here" > .env
  ```

- Install CocoIndex and the dependencies this example uses (DSPy for the extraction, PyMuPDF to rasterize PDFs, Pillow for the image bytes DSPy passes along):

  ```sh
  pip install -U cocoindex dspy-ai pymupdf pillow pydantic python-dotenv
  ```

- A few intake PDFs to extract. The example ships a `data/patient_forms/` folder with a handful of artificial forms — or drop your own in.

## Define the output schema

The output is just Python types. `Patient` is a Pydantic model that describes everything we want off the form, with nested models for the pieces that have structure of their own — address, insurance, medications, allergies, surgeries. This *is* the contract: DSPy fills it in, Pydantic validates it, and it serializes straight to JSON.

```python title="models.py"
class Address(BaseModel):
    street: str
    city: str
    state: str
    zip_code: str


class Medication(BaseModel):
    name: str
    dosage: str


class Patient(BaseModel):
    """Complete patient information extracted from intake form."""

    name: str
    dob: datetime.date
    gender: str
    address: Address
    phone: str
    email: str
    emergency_contact: Contact
    insurance: Insurance | None = None
    reason_for_visit: str
    past_conditions: list[Condition] = Field(default_factory=list)
    current_medications: list[Medication] = Field(default_factory=list)
    allergies: list[Allergy] = Field(default_factory=list)
    surgeries: list[Surgery] = Field(default_factory=list)
    consent_given: bool
    consent_date: str | None = None
```

Optional fields (`insurance`, `consent_date`) and `default_factory=list` collections mean a form that doesn't mention medications produces an empty list, not a failure — the model bends to whatever the form actually contains.

## Declare the extraction with DSPy

Rather than write a prompt, you declare a [DSPy `Signature`](https://github.com/stanfordnlp/dspy): a list of form-page images comes in, a typed `Patient` comes out. `ChainOfThought` wraps it so the model reasons before it answers, which helps on dense, checkbox-heavy forms. DSPy compiles the typed in/out into the actual prompt for you.

```python title="main.py"
class PatientExtractionSignature(dspy.Signature):
    """Extract structured patient information from a medical intake form image."""

    form_images: list[dspy.Image] = dspy.InputField(
        desc="Images of the patient intake form pages"
    )
    patient: Patient = dspy.OutputField(
        desc="Extracted patient information with all available fields filled"
    )


class PatientExtractor(dspy.Module):
    """DSPy module for extracting patient information from intake form images."""

    def __init__(self) -> None:
        super().__init__()
        self.extract = dspy.ChainOfThought(PatientExtractionSignature)

    def forward(self, form_images: list[dspy.Image]) -> Patient:
        result = self.extract(form_images=form_images)
        return result.patient
```

The model is configured once at module load — `dspy.configure(lm=dspy.LM("gemini/gemini-2.5-flash"))` — so the same vision LM is reused for every form. Because the `OutputField` is typed as `Patient`, DSPy asks the model for that exact shape and hands you back a validated Pydantic object, not a string to parse.

## Render the PDF and extract

`extract_patient` is the one custom transform. It rasterizes every page of the PDF to a PNG with PyMuPDF (at 2× scale, so small print stays legible), wraps each page as a `dspy.Image`, and runs the extractor. No text extraction, no Markdown conversion — the model reads the rendered form directly.

```python title="main.py"
@coco.fn
def extract_patient(pdf_content: bytes) -> Patient:
    """Extract patient information from PDF content."""
    pdf_doc = pymupdf.open(stream=pdf_content, filetype="pdf")

    form_images = []
    for page in pdf_doc:
        pix = page.get_pixmap(matrix=pymupdf.Matrix(2, 2))
        img_bytes = pix.tobytes("png")
        form_images.append(dspy.Image(img_bytes))

    pdf_doc.close()

    extractor = PatientExtractor()
    patient = extractor(form_images=form_images)
    return patient
```

[`@coco.fn`](https://cocoindex.io/docs/programming_guide/function/) makes this a CocoIndex function so the engine can track it. Rendering at `Matrix(2, 2)` matters: forms are full of small, hand-entered text, and the extra resolution is the difference between the model reading a zip code and guessing one.

## Process a file

![One processing component per PDF: render to images, extract a Patient with DSPy, and declare a JSON file into the output directory](https://cocoindex.io/blobs/docs-v1/img/examples/patient-intake-dspy/stage-file-process.png)

`process_patient_form` runs once per PDF. It reads the file's bytes, extracts the `Patient`, serializes it to pretty-printed JSON, and declares one output file named after the source form.

```python title="main.py"
@coco.fn(memo=True)
async def process_patient_form(file: FileLike, outdir: pathlib.Path) -> None:
    """Process a patient intake form PDF and extract structured information."""
    content = await file.read()
    patient_info = extract_patient(content)
    patient_json = patient_info.model_dump_json(indent=2)
    output_filename = file.file_path.path.stem + ".json"
    localfs.declare_file(
        outdir / output_filename, patient_json, create_parent_dirs=True
    )
```

[`@coco.fn`](https://cocoindex.io/docs/programming_guide/function/) with [`memo=True`](https://cocoindex.io/docs/advanced_topics/memoization_keys/) is what makes this incremental: if a form's content and this function's code are both unchanged, the whole file is skipped on the next run — so you never re-run the (paid, slow) vision extraction on a form you've already processed. `localfs.declare_file` declares the JSON as a [target state](https://cocoindex.io/docs/programming_guide/target_state/); CocoIndex writes, rewrites, or deletes it to match.

## Define the main function

`app_main` wires the source to the target. It walks the source directory for PDFs and mounts one [processing component](https://cocoindex.io/docs/programming_guide/processing_component/) per file.

```python title="main.py"
@coco.fn
async def app_main(sourcedir: pathlib.Path, outdir: pathlib.Path) -> None:
    """Main application function that processes patient intake forms."""
    files = localfs.walk_dir(
        sourcedir,
        path_matcher=PatternFilePathMatcher(included_patterns=["**/*.pdf"]),
    )
    await coco.mount_each(process_patient_form, files.items(), outdir)


app = coco.App(
    coco.AppConfig(name="PatientIntakeExtractionDSPy"),
    app_main,
    sourcedir=pathlib.Path("./data/patient_forms"),
    outdir=pathlib.Path("./output_patients"),
)
```

[`walk_dir`](https://cocoindex.io/docs/connectors/localfs/) scans the [filesystem source](https://cocoindex.io/docs/connectors/localfs/) for `*.pdf` files, and [`mount_each`](https://cocoindex.io/docs/programming_guide/processing_component/) runs one component per file so the engine can track and update them independently. Each component owns exactly one output JSON, so the mapping from form to record is one-to-one — and when a form disappears, its JSON is cleaned up automatically.

## Run the pipeline

Run the [`cocoindex` CLI](https://cocoindex.io/docs/cli/) to build the index — scan the forms, extract, write the JSON files, and exit:

```sh
cocoindex update main.py
```

Each PDF in `data/patient_forms/` becomes a JSON file in `output_patients/`, named after the source form:

```sh
ls output_patients/
# Patient_Intake_Form_David_Artificial.json
# Patient_Intake_Form_Emily_Artificial.json
# ...
```

Open one and you'll see the full `Patient` record — name, date of birth, address, insurance, the medication and allergy lists, consent — extracted straight from the rendered form and validated against the schema.

## Incremental updates

CocoIndex keeps the output JSON in sync with your forms and does the **minimum work** to get there. You never compute a diff or write update logic. Two pieces make this work. `@coco.fn(memo=True)` decides what to *recompute* — a form is skipped when its bytes and the function's code are both unchanged, so the vision model never re-reads a form you've already extracted. `mount_each` decides what to *write* — each component owns one JSON file, so the engine creates, rewrites, or deletes exactly the files that changed.

- **A form is added** — only that PDF is rendered and extracted; its JSON is written. The rest is untouched.
- **A form is replaced** — it is re-rendered and re-extracted, and its single JSON is rewritten.
- **A form is deleted** — its JSON is removed from the output directory automatically.

The same machinery covers **logic** changes too: add a field to the `Patient` model or switch the LM, and CocoIndex re-runs the extraction and rewrites the affected JSON. A catch-up run (`cocoindex update main.py`) does this once and exits.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/patient_intake_extraction_dspy](https://github.com/cocoindex-io/cocoindex/tree/main/examples/patient_intake_extraction_dspy). Prefer to define your schema and extraction in a typed DSL instead of Python? The twin example [Patient Intake Extraction with BAML](https://github.com/cocoindex-io/cocoindex/tree/main/examples/patient_intake_extraction_baml) runs the exact same flow with [BAML](https://boundaryml.com/) doing the extraction. For a Postgres-backed structured-extraction flow with embeddings, see [Paper Metadata](https://github.com/cocoindex-io/cocoindex/tree/main/examples/paper_metadata).

Got a stack of forms, invoices, or scanned records you want as clean structured data? Come tell us on [Discord](https://discord.com/invite/zpA9S2DR7s) — and if this was useful, [star CocoIndex on GitHub](https://github.com/cocoindex-io/cocoindex).

---

# Example: Postgres as a Source

Source: https://cocoindex.io/docs/examples/postgres-source/

![Postgres as a source with CocoIndex V1](https://cocoindex.io/blobs/docs-v1/img/examples/postgres-source/cover.png)

Most data already lives in a database. This example takes an existing Postgres table of products, reads it row by row, derives a couple of fields, [embeds](https://cocoindex.io/docs/ops/sentence_transformers/) each row, and writes the result — including the vector — back into Postgres with [pgvector](https://github.com/pgvector/pgvector). Point it at any table and you have a semantic index over your structured data, kept in sync as the table changes.

The whole pipeline is ordinary `async` Python and your own types. The heavy lifting — [incremental processing](https://cocoindex.io/docs/programming_guide/core_concepts/), change tracking, managed targets — runs in a Rust engine underneath, so only the rows that changed get re-embedded and re-upserted.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/postgres_source)

## Flow overview

![CocoIndex Postgres source flow: read product rows from a Postgres table, derive fields and embed each row, and store the vectors in Postgres with pgvector](https://cocoindex.io/blobs/docs-v1/img/examples/postgres-source/flow-v1.png)

From a high level, these are the steps:

1. Read product rows from an existing Postgres table with [`PgTableSource`](https://cocoindex.io/docs/connectors/postgres/).
2. For each row, derive a description and a `total_value`, then [embed](https://cocoindex.io/docs/ops/sentence_transformers/) the description.
3. Store the enriched rows and their embeddings in Postgres (as [target states](https://cocoindex.io/docs/programming_guide/target_state/)).

You [declare the transformation logic](https://cocoindex.io/docs/programming_guide/core_concepts/) with native Python, without worrying about how updates propagate. Think: **target_state = transformation(source_state)**.

> **New to embeddings?** An [*embedding*](https://cocoindex.io/docs/ops/sentence_transformers/) is a list of numbers (a vector) that captures the *meaning* of a piece of text, so rows with similar meaning land close together in vector space. A [*vector index*](https://cocoindex.io/docs/common_resources/vector_schema/) stores those vectors and finds the nearest ones to your query fast. That's what lets search match by meaning instead of exact words.

## Setup

- A running Postgres with the [pgvector](https://github.com/pgvector/pgvector) extension. The same instance can hold both the source table and the target — or set `SOURCE_DATABASE_URL` to read from a separate database.

  ```sh
  export POSTGRES_URL="postgres://cocoindex:cocoindex@localhost/cocoindex"
  export SOURCE_DATABASE_URL="postgres://cocoindex:cocoindex@localhost/cocoindex"
  ```

- Install CocoIndex and the dependencies this example uses:

  ```sh
  pip install -U "cocoindex[postgres,sentence_transformers]" asyncpg pgvector numpy python-dotenv
  ```

- A source table to read from. Create `source_products` with the sample rows from the repo:

  ```sh
  psql "$SOURCE_DATABASE_URL" -f ./prepare_source_data.sql
  ```

## Define the data and shared resources

[Apps](https://cocoindex.io/docs/programming_guide/app/) are the top-level runnable unit in CocoIndex. Before the App, we set up the data shapes and the [shared resources](https://cocoindex.io/docs/programming_guide/context/) the rest of the code builds on. `SourceProduct` describes one row read from the source table; `OutputProduct` describes one row written to the target, with the two derived fields and the embedding vector. `coco_lifespan` provides everything every step needs — a Postgres pool for the target, a pool for the source, and the embedding model — once at startup.

```python title="main.py"
DATABASE_URL = os.getenv("POSTGRES_URL", "postgres://cocoindex:cocoindex@localhost/cocoindex")
SOURCE_DATABASE_URL = os.getenv("SOURCE_DATABASE_URL", DATABASE_URL)
TABLE_NAME = "output"
PG_SCHEMA_NAME = "coco_examples_v1"

EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
PG_DB = coco.ContextKey[asyncpg.Pool]("postgres_source_db")
SOURCE_POOL = coco.ContextKey[asyncpg.Pool]("source_pool")
EMBEDDER = coco.ContextKey[SentenceTransformerEmbedder]("embedder", detect_change=True)


@dataclass
class SourceProduct:
    product_category: str
    product_name: str
    description: str
    price: float
    amount: int


@dataclass
class OutputProduct:
    product_category: str
    product_name: str
    description: str
    price: float
    amount: int
    total_value: float
    embedding: Annotated[NDArray, EMBEDDER]


@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    async with (
        asyncpg.create_pool(DATABASE_URL) as target_pool,
        asyncpg.create_pool(SOURCE_DATABASE_URL) as source_pool,
    ):
        builder.provide(PG_DB, target_pool)
        builder.provide(SOURCE_POOL, source_pool)
        builder.provide(EMBEDDER, SentenceTransformerEmbedder(EMBED_MODEL))
        yield
```

`embedding: Annotated[NDArray, EMBEDDER]` ties the vector column to the embedder, so its dimensions are inferred automatically — and if you swap the model later, CocoIndex notices (`detect_change=True`) and re-embeds.

## Process a row

![One processing component per row: each source row is derived and embedded, producing an OutputProduct row written to Postgres](https://cocoindex.io/blobs/docs-v1/img/examples/postgres-source/stage-file-process.png)

`process_product` runs once per source row. It builds a `full_description` from the category, name, and body, computes `total_value`, embeds the description, and declares the target row.

```python title="main.py"
@coco.fn(memo=True)
async def process_product(
    product: SourceProduct,
    table: postgres.TableTarget[OutputProduct],
) -> None:
    full_description = f"Category: {product.product_category}\nName: {product.product_name}\n\n{product.description}"
    total_value = product.price * product.amount
    embedding = await coco.use_context(EMBEDDER).embed(full_description)
    table.declare_row(
        row=OutputProduct(
            product_category=product.product_category,
            product_name=product.product_name,
            description=product.description,
            price=product.price,
            amount=product.amount,
            total_value=total_value,
            embedding=embedding,
        ),
    )
```

We embed the composed description rather than the raw body, so the category and name carry weight in the vector — a search for "wireless audio" matches even when the body never says it. We use [`SentenceTransformerEmbedder`](https://cocoindex.io/docs/ops/sentence_transformers/) with `all-MiniLM-L6-v2`, a small, fast model that runs locally with no API key; there are 12k+ sentence-transformer models on [Hugging Face](https://huggingface.co/models?other=sentence-transformers), so swap in whichever you prefer.

[`@coco.fn`](https://cocoindex.io/docs/programming_guide/function/) with [`memo=True`](https://cocoindex.io/docs/advanced_topics/memoization_keys/) is what makes this incremental: if a row's content and this function's code are both unchanged, the row is skipped on the next run. `table.declare_row` declares the row as a [target state](https://cocoindex.io/docs/programming_guide/target_state/); CocoIndex handles inserting, updating, or deleting it to match.

## Define the main function

`app_main` wires the source to the target. It mounts the Postgres target table, opens the source table, and mounts one [processing component](https://cocoindex.io/docs/programming_guide/processing_component/) per source row.

```python title="main.py"
@coco.fn
async def app_main() -> None:
    target_table = await postgres.mount_table_target(
        PG_DB,
        table_name=TABLE_NAME,
        table_schema=await postgres.TableSchema.from_class(
            OutputProduct,
            primary_key=["product_category", "product_name"],
        ),
        pg_schema_name=PG_SCHEMA_NAME,
    )

    source = postgres.PgTableSource(
        coco.use_context(SOURCE_POOL),
        table_name="source_products",
        row_type=SourceProduct,
    )

    await coco.mount_each(
        process_product,
        source.fetch_rows().items(lambda p: (p.product_category, p.product_name)),
        target_table,
    )


app = coco.App(
    coco.AppConfig(name="PostgresSourceV1"),
    app_main,
)
```

[`PgTableSource`](https://cocoindex.io/docs/connectors/postgres/) reads the table — passing `row_type=SourceProduct` maps each row straight into the dataclass and selects exactly its fields. `fetch_rows().items(...)` streams rows over a cursor and tags each one with a [stable key](https://cocoindex.io/docs/programming_guide/processing_component/), here the `(product_category, product_name)` composite primary key. `mount_table_target` creates and manages the Postgres target table for you: schema, idempotent upserts, and orphan cleanup when a source row disappears. `mount_each` runs one component per row so the engine can track and update them independently.

## Setup and run

Run the [`cocoindex` CLI](https://cocoindex.io/docs/cli/) to build and update the index. The Postgres source runs as a one-shot catch-up — it scans the source table, syncs the target, and exits:

```sh
cocoindex update main
```

## Query the index

Match user text against the index with a plain SQL query, reusing the *same* embedder from the indexing flow so indexing and querying stay consistent.

```python title="main.py"
async def query_once(pool, embedder, query: str, *, top_k: int = TOP_K) -> None:
    query_vec = await embedder.embed(query)
    async with pool.acquire() as conn:
        rows = await conn.fetch(
            f"""
            SELECT
                product_category, product_name, description,
                amount, total_value,
                embedding <=> $1 AS distance
            FROM "{PG_SCHEMA_NAME}"."{TABLE_NAME}"
            ORDER BY distance ASC
            LIMIT $2
            """,
            query_vec, top_k,
        )
    for r in rows:
        score = 1.0 - float(r["distance"])
        print(f"[{score:.3f}] {r['product_category']} | {r['product_name']} | {r['amount']} | {r['total_value']}")
        print(f"    {r['description']}")
        print("---")
```

The `<=>` operator is pgvector's cosine distance. We turn it into a similarity score and print the derived fields alongside the description. Run a search straight from the command line:

```bash
python main.py "wireless headphones"
```

The most semantically similar products come back ranked — even when they share none of the words in your query. That's the whole point of a vector index.

## Incremental updates

CocoIndex keeps the target in sync with the source table and does the **minimum work** to get there. You never compute a diff or write update logic: the source row changes, and CocoIndex works out exactly what to re-embed, upsert, and delete. Two pieces make this work. `@coco.fn(memo=True)` decides what to *recompute* — a row is skipped when its content and the function's code are both unchanged. `mount_table_target` decides what to *write* — each output row's primary key is derived from the source row's `(product_category, product_name)`, so it upserts only the rows that actually changed and deletes rows whose source is gone.

- **A row is added** — only that row is derived and embedded, and it is inserted. The rest is untouched.
- **A row is edited** — it is re-derived; if the embedded description changed it is re-embedded, and the target row is updated in place.
- **A row is deleted** — its row is removed from the target automatically.

The same machinery covers **logic** changes too: tweak how `full_description` is composed or swap the embedding model, and CocoIndex compares the new output against what is already in Postgres and applies only the difference. Each `cocoindex update main` does this once and exits; re-run it after the source table changes to bring the index back in sync.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/postgres_source](https://github.com/cocoindex-io/cocoindex/tree/main/examples/postgres_source). If this is useful, a ⭐ on [GitHub](https://github.com/cocoindex-io/cocoindex) helps, and the [Discord](https://discord.com/invite/zpA9S2DR7s) is the place to ask questions and share what you build.

---

# Example: Transform a Folder of Files

Source: https://cocoindex.io/docs/examples/files-transform/

![Transform a folder of files with CocoIndex V1](https://cocoindex.io/blobs/docs-v1/img/examples/files-transform/cover.png)

We'll take a folder of Markdown files and render each one to HTML, writing the results to a second folder that stays in sync with the source. No database, no embeddings, no API keys — just files in, files out. It's the smallest complete CocoIndex pipeline, and the cleanest way to see the **source → transform → target** shape that every larger example is built from.

The transform is your own ordinary `async` function. The heavy lifting — [incremental processing](https://cocoindex.io/docs/programming_guide/core_concepts/), change tracking, watching the directory, and keeping the output folder in sync — runs in a Rust engine underneath, so only the files that actually changed get re-rendered and re-written.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/files_transform)

## Flow overview

![CocoIndex files transform flow: watch a directory of Markdown, render each file to HTML with markdown-it-py, and write the .html outputs to a local folder](https://cocoindex.io/blobs/docs-v1/img/examples/files-transform/flow-v1.png)

From a high level, these are the steps:

1. Read Markdown files from a local directory, [watching for changes](https://cocoindex.io/docs/programming_guide/live_mode/).
2. Render each file to HTML with [markdown-it-py](https://github.com/executablebooks/markdown-it-py).
3. Write each `.html` file to an output folder (as [target states](https://cocoindex.io/docs/programming_guide/target_state/)) on the [local filesystem](https://cocoindex.io/docs/connectors/localfs/).

You [declare the transformation logic](https://cocoindex.io/docs/programming_guide/core_concepts/) with native Python, without worrying about how updates propagate. Think: **target_state = transformation(source_state)**.

## Process a file

![One processing component per file: each Markdown file is rendered to HTML and written as a file target on the local filesystem](https://cocoindex.io/blobs/docs-v1/img/examples/files-transform/stage-file-process.png)

`process_file` runs once per file. It reads the Markdown, renders it to HTML, derives an output name from the source path, and declares the output file as a target state.

```python title="main.py"
import pathlib

import cocoindex as coco
from cocoindex.resources.file import FileLike, PatternFilePathMatcher
from cocoindex.connectors import localfs
from markdown_it import MarkdownIt

_markdown_it = MarkdownIt("gfm-like")


@coco.fn(memo=True)
async def process_file(file: FileLike, outdir: pathlib.Path) -> None:
    html = _markdown_it.render(await file.read_text())
    outname = "__".join(file.file_path.path.parts) + ".html"
    localfs.declare_file(outdir / outname, html, create_parent_dirs=True)
```

The transform itself is just two lines: read the text, render it. The output name joins the source path parts with `__` so `subdir/file.md` becomes `subdir__file.html` — a flat, collision-free name in the output folder.

[`localfs.declare_file`](https://cocoindex.io/docs/connectors/localfs/) declares the `.html` file as a [target state](https://cocoindex.io/docs/programming_guide/target_state/) on the local filesystem. You describe the file you *want to exist*; CocoIndex handles writing it, overwriting it when the content changes, and deleting it when the source Markdown is gone.

[`@coco.fn`](https://cocoindex.io/docs/programming_guide/function/) with [`memo=True`](https://cocoindex.io/docs/advanced_topics/memoization_keys/) is what makes this incremental: if a file's content and this function's code are both unchanged, the whole file is skipped on the next run, and its HTML output is left exactly as it is.

## Define the main function

`app_main` wires the source to the target. It walks the source directory for Markdown files and mounts one [processing component](https://cocoindex.io/docs/programming_guide/processing_component/) per file.

```python title="main.py"
@coco.fn
async def app_main(sourcedir: pathlib.Path, outdir: pathlib.Path) -> None:
    files = localfs.walk_dir(
        sourcedir,
        path_matcher=PatternFilePathMatcher(included_patterns=["**/*.md"]),
        live=True,
    )
    await coco.mount_each(process_file, files.items(), outdir)
```

[`walk_dir`](https://cocoindex.io/docs/connectors/localfs/) lists the source folder, filtered to `*.md` by the [`PatternFilePathMatcher`](https://cocoindex.io/docs/connectors/localfs/). `live=True` makes the [filesystem source](https://cocoindex.io/docs/connectors/localfs/) [watch for changes](https://cocoindex.io/docs/programming_guide/live_mode/), and [`mount_each`](https://cocoindex.io/docs/programming_guide/processing_component/) runs one component per file so the engine can track and update each one independently — add, edit, or delete a Markdown file and only that file's HTML moves.

## Create the App

Bind `app_main` into a [`coco.App`](https://cocoindex.io/docs/programming_guide/app/), pointing it at the source folder and the output folder.

```python title="main.py"
app = coco.App(
    coco.AppConfig(name="FilesTransform"),
    app_main,
    sourcedir=pathlib.Path("./data"),
    outdir=pathlib.Path("./output_html"),
)
```

That is the entire pipeline — about 25 lines.

## Setup

- No external services required. Install CocoIndex and markdown-it-py:

  ```sh
  pip install -U cocoindex "markdown-it-py[linkify,plugins]"
  ```

- A few `.md` files to convert. Grab the [sample files](https://github.com/cocoindex-io/cocoindex/tree/main/examples/files_transform/data) from the repo, or drop your own notes into a `data/` directory.

## Run the pipeline

Run the [`cocoindex` CLI](https://cocoindex.io/docs/cli/) to build the output folder. Choose catch-up (scan, sync, exit) or live (catch up, then keep watching):

```sh
# Catch-up run
cocoindex update main

# Live run: keep watching for file changes
cocoindex update -L main
```

The converted files appear in `./output_html/`, one `.html` per source `.md`.

## Incremental updates

CocoIndex keeps the output folder in sync with your source files and does the **minimum work** to get there. You never compute a diff or write update logic: you change something, and CocoIndex works out exactly what to re-render and re-write. Two pieces make this work. `@coco.fn(memo=True)` decides what to *recompute* — a file is skipped when its content and the function's code are both unchanged. `localfs.declare_file` decides what to *write* — the output file is created, overwritten, or deleted to match the declared target state.

- **A file is added** — only that file is rendered, and its `.html` is written. The rest is untouched.
- **A file is edited** — it is re-rendered and its `.html` is overwritten in place.
- **A file is deleted** — its `.html` output is removed from the target folder automatically.

The same machinery covers **logic** changes too: change the markdown-it preset or the output naming, and CocoIndex compares the new output against what is already on disk and applies only the difference. A catch-up run (`cocoindex update main`) does this once and exits; live mode (`cocoindex update -L main`) keeps watching and applies each change with low latency.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/files_transform](https://github.com/cocoindex-io/cocoindex/tree/main/examples/files_transform). This is the minimal building block — once it clicks, swap the transform for chunking and embedding and you have [Semantic Search 101](https://cocoindex.io/docs/examples/text-embedding/), or point the same flow at a Postgres or vector target.

If this helped, [give CocoIndex a star on GitHub](https://github.com/cocoindex-io/cocoindex) and come say hi in our [Discord](https://discord.com/invite/zpA9S2DR7s) — we'd love to see what you build.

---

# Example: Consume Kafka into LanceDB

Source: https://cocoindex.io/docs/examples/kafka-to-lancedb/

![Kafka in, LanceDB out — one message at a time, routed by shape, with CocoIndex](https://cocoindex.io/blobs/docs-v1/img/examples/kafka-to-lancedb/cover.png)

We'll take a live [Kafka](https://kafka.apache.org/) topic of JSON messages and fan them into [LanceDB](https://lancedb.com/) tables — each message parsed, inspected, and routed to the table that matches its shape. A message with a `sku` field becomes a row in `products`; one with an `emp_id` field becomes a row in `employees`. This is the consumer side of [csv-to-kafka](https://github.com/cocoindex-io/cocoindex/tree/main/examples/csv_to_kafka): the same declarative flow that *produced* the topic now *consumes* it.

The whole pipeline is ordinary `async` Python. Kafka is just a [source](https://cocoindex.io/docs/connectors/kafka/) you treat as a keyed map, and each LanceDB table is a [target](https://cocoindex.io/docs/connectors/lancedb/) you declare rows on. CocoIndex's Rust engine does the [incremental processing](https://cocoindex.io/docs/programming_guide/core_concepts/) underneath: it consumes one message per processing component, writes the row, and only then commits the Kafka offset — so a crash mid-flight replays from the last durably-written message, with no consumer loop and no offset bookkeeping in your code.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/kafka_to_lancedb)

## Flow overview

![CocoIndex Kafka → LanceDB flow: subscribe a topic as a keyed map, run one process_message component per message that parses the JSON and dispatches by shape, and declare each row on the products or employees LanceDB table](https://cocoindex.io/blobs/docs-v1/img/examples/kafka-to-lancedb/flow-v1.png)

From a high level, these are the steps:

1. Subscribe to a Kafka topic as a [live keyed map](https://cocoindex.io/docs/connectors/kafka/) — each message is an item keyed by its Kafka message key.
2. For each message, decode the value and `json.loads` it into a row dict.
3. Dispatch by shape: a `sku` field declares a `Product` [row](https://cocoindex.io/docs/programming_guide/target_state/) on the `products` table; an `emp_id` field declares an `Employee` row on the `employees` table.

You [declare the transformation logic](https://cocoindex.io/docs/programming_guide/core_concepts/) with native Python, without worrying about how updates propagate. Think: **target_state = transformation(source_state)**.

> **Why two tables from one topic?** A topic is often a firehose of heterogeneous events — orders, users, inventory, whatever a service emits — sharing a transport but not a schema. The consumer's job is to *sort the mail*: read each envelope, decide what it is, and put it where it belongs. Branching on a discriminator field (`sku` vs `emp_id` here, but just as easily an `event_type` or a [JSON Schema](https://json-schema.org/) `$id`) and declaring a typed row is the same pattern whether the destination is LanceDB, Postgres, or a vector index.

## Setup

- A running Kafka broker with a topic to consume. Any broker the [`confluent_kafka`](https://github.com/confluentinc/confluent-kafka-python) client can reach works — a local `localhost:9092`, or a managed one like [StreamNative](https://streamnative.io/) with SASL. The easy way to populate it: run [csv-to-kafka](https://github.com/cocoindex-io/cocoindex/tree/main/examples/csv_to_kafka) first.

- Install CocoIndex with the Kafka and LanceDB extras:

  ```sh
  pip install -U "cocoindex[kafka,lancedb]"
  ```

- A local directory for LanceDB (`./lancedb_data` by default). LanceDB is embedded — there's no server to run; the tables are just files on disk.

## Shared resources: the LanceDB connection

The LanceDB connection is opened once at app startup in a [`lifespan`](https://cocoindex.io/docs/programming_guide/context/) hook and stashed in a [`ContextKey`](https://cocoindex.io/docs/programming_guide/context/), so the rest of the pipeline can grab it without threading it through every call:

```python title="main.py"
import cocoindex as coco
from cocoindex.connectors import kafka, lancedb

LANCE_DB = coco.ContextKey[lancedb.LanceAsyncConnection]("kafka_to_lancedb_db")


@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    conn = await lancedb.connect_async(LANCEDB_URI)
    builder.provide(LANCE_DB, conn)
    yield
```

The `ContextKey` does double duty later: CocoIndex's state store identifies each table by *which key the connection was anchored to* plus the table name — so pointing `LANCEDB_URI` at a new path is what gives you a fresh database, and reusing the same path reconnects to the existing tables without re-ingesting anything.

## Define the row schemas

Each table has a typed row. These are plain dataclasses — CocoIndex maps them to the LanceDB/PyArrow column types for you:

```python title="main.py"
@dataclass
class Product:
    sku: str
    name: str
    category: str
    price: float


@dataclass
class Employee:
    emp_id: str
    first_name: str
    last_name: str
    department: str
    email: str
```

The dataclass *is* the schema. When we mount the table below, [`TableSchema.from_class`](https://cocoindex.io/docs/connectors/lancedb/) reads these fields and their types to build the table, with the primary key you nominate. A `Product` is keyed by `sku`, an `Employee` by `emp_id` — the same primary keys that keyed the Kafka messages on the way in.

## Process a message

![One process_message component per Kafka message, fanned out with mount_each: each message is parsed and dispatched by shape to the products or employees LanceDB table](https://cocoindex.io/blobs/docs-v1/img/examples/kafka-to-lancedb/stage-file-process.png)

`process_message` runs once per message. It decodes the value, parses the JSON, and dispatches on shape — declaring a typed row on whichever table matches:

```python title="main.py"
@coco.fn
async def process_message(
    msg: Message,
    products_table: lancedb.TableTarget[Product],
    employees_table: lancedb.TableTarget[Employee],
) -> None:
    value = msg.value()
    if value is None:
        return
    text = value.decode() if isinstance(value, bytes) else value
    row = json.loads(text)

    if "sku" in row:
        products_table.declare_row(
            row=Product(**{**row, "price": float(row["price"])}),
        )
    elif "emp_id" in row:
        employees_table.declare_row(row=Employee(**row))
```

Each message runs as its own [processing component](https://cocoindex.io/docs/programming_guide/processing_component/) (mounted below), so the engine tracks each one independently. The component owns whichever row it declares; when its offset is committed, that row is durably in LanceDB. The `value()` may be `bytes` or `str` depending on the broker, so we normalize before `json.loads`. A message that matches neither shape declares nothing — it's quietly skipped, no row, no error.

## Declare rows, not writes

The line worth pausing on is `declare_row` — deliberately *not* `insert()` or `upsert()`.

```python
products_table.declare_row(row=Product(...))
```

CocoIndex is [state-driven](https://cocoindex.io/docs/programming_guide/core_concepts/): like a spreadsheet cell or a SQL materialized view, you describe what the row *should be* as a function of the source, and the engine figures out the transition. You don't write separate insert / update / delete code paths. When you call `declare_row(row=r)`:

- **the primary key is new, or the row changed** → it's **upserted** into the table.
- **the primary key was declared before but isn't this time** → that row is **removed**.
- **the same row is declared with the same values** → **nothing is written.** No round-trip, no rewrite.

It's the same `declare_*` shape as the [Postgres target](https://cocoindex.io/docs/connectors/postgres/) and the [Kafka target](https://cocoindex.io/docs/connectors/kafka/) on the producer side — the storage differs, the API doesn't, because the semantics are the same. The payoff: `process_message` is correct on the first message, every subsequent message, and after a crash-restart — there's no separate "initial load" versus "incremental update" path.

## Define the main function

`app_main` wires the source to the targets. It mounts both LanceDB tables, subscribes the Kafka consumer, and mounts one component per message:

```python title="main.py"
@coco.fn
async def app_main() -> None:
    products_table = await lancedb.mount_table_target(
        LANCE_DB,
        table_name="products",
        table_schema=await lancedb.TableSchema.from_class(Product, primary_key=["sku"]),
    )

    employees_table = await lancedb.mount_table_target(
        LANCE_DB,
        table_name="employees",
        table_schema=await lancedb.TableSchema.from_class(
            Employee, primary_key=["emp_id"]
        ),
    )

    config: dict[str, str] = {
        "bootstrap.servers": KAFKA_BOOTSTRAP_SERVERS,
        "group.id": KAFKA_GROUP_ID,
        "enable.auto.commit": "false",
        "auto.offset.reset": "earliest",
    }

    consumer = AIOConsumer(config)
    items = kafka.topic_as_map(consumer, [KAFKA_TOPIC])
    await coco.mount_each(process_message, items, products_table, employees_table)


app = coco.App(coco.AppConfig(name="KafkaToLanceDB"), app_main)
```

Three things to notice:

1. `mount_table_target(...)` resolves the connection from the context key and creates the table from the dataclass schema — `products` keyed by `sku`, `employees` keyed by `emp_id`. The handle it returns is what you call `declare_row` on.
2. `enable.auto.commit` is **off** on purpose. CocoIndex commits each offset *after* the row is durably written, so the consumer group always resumes from the last message it actually persisted — at-least-once delivery without `__consumer_offsets` drifting ahead of your data. `auto.offset.reset="earliest"` means a fresh group reads the topic from the start.
3. [`topic_as_map`](https://cocoindex.io/docs/connectors/kafka/) treats the topic as a [live keyed map](https://cocoindex.io/docs/programming_guide/live_mode/): each message becomes an item keyed by its Kafka key, and a tombstone (null value) deletes that key's row. [`mount_each`](https://cocoindex.io/docs/programming_guide/app/) runs one `process_message` component per message.

That's the whole pipeline — one file.

## Run the pipeline

Copy `.env.example` to `.env`, point `KAFKA_TOPIC` at the topic csv-to-kafka produced (and fill in SASL creds if your broker needs them), then run the [`cocoindex` CLI](https://cocoindex.io/docs/cli/). Choose catch-up (drain what's there, then exit) or live (catch up, then keep consuming):

```sh
# Catch-up run: consume everything up to now, write the rows, then exit
cocoindex update main.py

# Live run: catch up, then keep consuming new messages as they arrive
cocoindex update -L main.py
```

Live mode is **one flag** different from catch-up — `-L` on the CLI. `process_message` and the LanceDB tables don't change: the same reconciliation logic runs either way, the flag only controls whether the app drains the current backlog and exits or keeps consuming. There's no separate "streaming" code path to maintain.

## Looking at the tables

After a run, the tables are just files under `./lancedb_data`. Open them with the LanceDB client to confirm the dispatch landed:

```python
import lancedb

db = lancedb.connect("./lancedb_data")

print("=== Products ===")
for row in db.open_table("products").to_arrow().to_pylist():
    print(row)

print("\n=== Employees ===")
for row in db.open_table("employees").to_arrow().to_pylist():
    print(row)
```

Every `sku` message is a row in `products`, every `emp_id` message a row in `employees` — keyed exactly as it was on the topic, so re-consuming the same key updates the row in place rather than duplicating it.

## Incremental updates

This is where the declarative model pays for itself. You never write upsert logic or track which messages you've already handled — CocoIndex consumes each message, writes the row, and commits the offset, in that order. It keeps an [internal state store](https://cocoindex.io/docs/advanced_topics/internal_storage/) plus the committed Kafka offsets, and both survive restarts, so stopping and restarting resumes cleanly from the last durably-written row.

- **A new message** — one upsert into the matching table.
- **A re-keyed message (same key, changed value)** — the existing row is updated in place.
- **A tombstone (null value) for a key** — that key's row is removed.
- **A message matching neither shape** — skipped; no row, no error.
- **Crash mid-flight** — replays from the last committed offset; rows already written aren't re-applied wastefully.

A catch-up run (`cocoindex update main.py`) drains the backlog once and exits; live mode (`cocoindex update -L main.py`) keeps consuming and applies each new message with sub-second latency.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/kafka_to_lancedb](https://github.com/cocoindex-io/cocoindex/tree/main/examples/kafka_to_lancedb). It pairs with the producer side — [csv-to-kafka](https://github.com/cocoindex-io/cocoindex/tree/main/examples/csv_to_kafka) turns a folder of CSV files into the very topic this example consumes, so you can run both and watch a row edited on disk land in the right LanceDB table.

Got a topic full of mixed events you want sorted into typed tables? Come tell us on [Discord](https://discord.com/invite/zpA9S2DR7s) — and if this was useful, [star CocoIndex on GitHub](https://github.com/cocoindex-io/cocoindex).

---

# Example: Search Your AI Coding Sessions

Source: https://cocoindex.io/docs/examples/entire-session-search/

![Search your AI coding sessions with CocoIndex V1](https://cocoindex.io/blobs/docs-v1/img/examples/entire-session-search/cover.png)

[Entire](https://entire.io) captures every AI coding session you run — the full conversation transcript, the prompt you started from, an AI-written context summary, and metadata like token counts and files touched — as checkpoints on disk. We'll take that folder of checkpoints and turn it into a [vector index](https://github.com/pgvector/pgvector) you can search in plain English: "how did I fix the auth bug" finds the right session even when it shares no keywords with what you typed.

The whole pipeline is ordinary `async` Python and your own types. The heavy lifting — [incremental processing](https://cocoindex.io/docs/programming_guide/core_concepts/), change tracking, managed targets — runs in a Rust engine underneath, so each new session you capture only embeds what changed, and every kind of checkpoint file is parsed by the same `process_file` component.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/entire_session_search)

## Flow overview

![CocoIndex Entire session search flow: walk a folder of checkpoints, route each file by name through process_file, embed transcripts, prompts, and context summaries, and store the vectors in Postgres with pgvector alongside a metadata table](https://cocoindex.io/blobs/docs-v1/img/examples/entire-session-search/flow-v1.png)

From a high level, these are the steps:

1. Read Entire checkpoint files from a local directory (live).
2. Route each file by name: parse `full.jsonl` into per-turn transcript chunks, take `prompt.txt` whole, [split](https://cocoindex.io/docs/ops/text/) `context.md` into overlapping chunks, then [embed](https://cocoindex.io/docs/ops/sentence_transformers/) the text — while `metadata.json` becomes a structured row.
3. Store the embeddings and metadata in two Postgres tables (as [target states](https://cocoindex.io/docs/programming_guide/target_state/)).

You [declare the transformation logic](https://cocoindex.io/docs/programming_guide/core_concepts/) with native Python, without worrying about how updates propagate. Think: **target_state = transformation(source_state)**.

> **New to embeddings?** An [*embedding*](https://cocoindex.io/docs/ops/sentence_transformers/) is a list of numbers (a vector) that captures the *meaning* of a piece of text, so passages with similar meaning land close together in vector space. A [*vector index*](https://cocoindex.io/docs/common_resources/vector_schema/) stores those vectors and finds the nearest ones to your query fast. That's what lets search match by meaning instead of exact words.

## Setup

- A running Postgres with the [pgvector](https://github.com/pgvector/pgvector) extension. The repo ships a compose file:

  ```sh
  docker compose -f dev/postgres.yaml up -d
  export POSTGRES_URL="postgres://cocoindex:cocoindex@localhost/cocoindex"
  ```

- Install CocoIndex and the dependencies this example uses:

  ```sh
  pip install -U "cocoindex[postgres,sentence_transformers]" asyncpg pgvector numpy python-dotenv
  ```

- Some Entire checkpoints to index. From any repo where [Entire](https://entire.io) is capturing sessions, check the checkpoint data out next to the example:

  ```sh
  git worktree add entire_checkpoints entire/checkpoints/v1
  ```

  Each session is laid out as `<checkpoint_id[:2]>/<checkpoint_id[2:]>/<session_idx>/` with `full.jsonl` (transcript), `prompt.txt` (initial prompt), `context.md` (AI-written summary), and `metadata.json` (token counts, files touched).

## Define the data and shared resources

Each row of the embeddings table is one searchable piece of text — a transcript turn, a prompt, or a context chunk — tagged with its `content_type`, `role`, and the session it came from. The metadata table keeps one row per session for the structured fields. `coco_lifespan` provides the [shared resources](https://cocoindex.io/docs/programming_guide/context/) every step needs — the Postgres connection pool and the embedding model — once at startup.

```python title="main.py"
EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
PG_DB = coco.ContextKey[asyncpg.Pool]("entire_session_db")
EMBEDDER = coco.ContextKey[SentenceTransformerEmbedder]("embedder", detect_change=True)

_splitter = RecursiveSplitter()


@dataclass
class SessionEmbeddingRow:
    id: int
    checkpoint_id: str
    session_index: str
    content_type: str  # "transcript", "prompt", or "context"
    role: str  # "user", "assistant", or "" for non-transcript
    text: str
    embedding: Annotated[NDArray, EMBEDDER]


@dataclass
class SessionMetadataRow:
    checkpoint_id: str
    session_index: str
    prompt_summary: str
    total_tokens: int
    files_touched: str  # JSON array
    agent_percentage: float | None


@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    async with asyncpg.create_pool(DATABASE_URL) as pool:
        builder.provide(PG_DB, pool)
        builder.provide(EMBEDDER, SentenceTransformerEmbedder(EMBED_MODEL))
        yield
```

`embedding: Annotated[NDArray, EMBEDDER]` ties the vector column to the embedder, so its dimensions are inferred automatically — and if you swap the model later, CocoIndex notices (`detect_change=True`) and re-embeds.

## Process a file

![One processing component per checkpoint file: process_file routes by filename, embeds transcript, prompt, and context text into the embeddings table, and writes one metadata row](https://cocoindex.io/blobs/docs-v1/img/examples/entire-session-search/stage-file-process.png)

`process_file` runs once per checkpoint file and routes on its name. The checkpoint id and session index come straight from the file's path, and a fresh `IdGenerator` numbers the rows this file produces.

```python title="main.py"
@coco.fn(memo=True)
async def process_file(
    file: FileLike,
    emb_table: postgres.TableTarget[SessionEmbeddingRow],
    meta_table: postgres.TableTarget[SessionMetadataRow],
) -> None:
    info = extract_session_info(file)
    filename = file.file_path.path.name
    id_gen = IdGenerator()

    if filename == "full.jsonl":
        content = await file.read_text()
        chunks = parse_transcript(content)
        await coco.map(
            process_chunk,
            [
                ChunkInput(text=c.text, content_type="transcript", role=c.role)
                for c in chunks
            ],
            info, id_gen, emb_table,
        )

    elif filename == "prompt.txt":
        text = (await file.read_text()).strip()
        if text:
            emb_table.declare_row(
                row=SessionEmbeddingRow(
                    id=await id_gen.next_id(text),
                    checkpoint_id=info.checkpoint_id,
                    session_index=info.session_index,
                    content_type="prompt",
                    role="user",
                    text=text,
                    embedding=await coco.use_context(EMBEDDER).embed(text),
                ),
            )

    elif filename == "context.md":
        text = (await file.read_text()).strip()
        if text:
            chunks = _splitter.split(
                text, chunk_size=2000, chunk_overlap=500, language="markdown"
            )
            await coco.map(
                process_chunk,
                [
                    ChunkInput(text=c.text, content_type="context", role="")
                    for c in chunks
                ],
                info, id_gen, emb_table,
            )

    elif filename == "metadata.json":
        meta = json.loads(await file.read_text())
        usage = meta.get("token_usage", {})
        agent_pct = meta.get("initial_attribution", {}).get("agent_percentage")
        meta_table.declare_row(
            row=SessionMetadataRow(
                checkpoint_id=info.checkpoint_id,
                session_index=info.session_index,
                prompt_summary=meta.get("summary", {}).get("intent", ""),
                total_tokens=(usage.get("input_tokens") or 0) + (usage.get("output_tokens") or 0),
                files_touched=json.dumps(meta.get("files_touched", [])),
                agent_percentage=float(agent_pct) if agent_pct is not None else None,
            ),
        )
```

The transcript and the context summary each fan out to many rows, so they map to `process_chunk`; the prompt is a single short string, so it's embedded inline; and the metadata file declares one row directly into the *other* table — three content types and a structured record, all from one component.

[`@coco.fn`](https://cocoindex.io/docs/programming_guide/function/) with [`memo=True`](https://cocoindex.io/docs/advanced_topics/memoization_keys/) is what makes this incremental: if a file's content and this function's code are both unchanged, it's skipped on the next run, so finished sessions are never re-embedded. `coco.map` fans out to one `process_chunk` call per chunk.

## Process a chunk

`process_chunk` embeds one piece of text with the shared embedder and declares the target row. Both the transcript and the context paths funnel through it, carrying their own `content_type` and `role`.

```python title="main.py"
@coco.fn
async def process_chunk(
    chunk: ChunkInput,
    info: SessionInfo,
    id_gen: IdGenerator,
    emb_table: postgres.TableTarget[SessionEmbeddingRow],
) -> None:
    emb_table.declare_row(
        row=SessionEmbeddingRow(
            id=await id_gen.next_id(chunk.text),
            checkpoint_id=info.checkpoint_id,
            session_index=info.session_index,
            content_type=chunk.content_type,
            role=chunk.role,
            text=chunk.text,
            embedding=await coco.use_context(EMBEDDER).embed(chunk.text),
        ),
    )
```

We use [`SentenceTransformerEmbedder`](https://cocoindex.io/docs/ops/sentence_transformers/) with `all-MiniLM-L6-v2` — a small, fast model that runs locally with no API key. There are 12k+ sentence-transformer models on [Hugging Face](https://huggingface.co/models?other=sentence-transformers), so swap in whichever you prefer. `emb_table.declare_row` declares the row as a target state; CocoIndex handles inserting, updating, or deleting it to match. Each row's `id` is derived from the chunk text, so a turn that survives a re-parse keeps its row.

## Define the main function

![mount_each fans out one process_file component per checkpoint file, from the Entire filesystem source to the two Postgres tables](https://cocoindex.io/blobs/docs-v1/img/examples/entire-session-search/stage-main-function.png)

`app_main` wires the source to the targets. It mounts both Postgres tables, walks the checkpoint directory for the four file types, and mounts one [processing component](https://cocoindex.io/docs/programming_guide/processing_component/) per file.

```python title="main.py"
@coco.fn
async def app_main(checkpoints_dir: pathlib.Path) -> None:
    emb_table = await postgres.mount_table_target(
        PG_DB,
        table_name=TABLE_EMBEDDINGS,
        table_schema=await postgres.TableSchema.from_class(
            SessionEmbeddingRow, primary_key=["id"],
        ),
        pg_schema_name=PG_SCHEMA_NAME,   # "entire"
    )

    meta_table = await postgres.mount_table_target(
        PG_DB,
        table_name=TABLE_METADATA,
        table_schema=await postgres.TableSchema.from_class(
            SessionMetadataRow, primary_key=["checkpoint_id", "session_index"],
        ),
        pg_schema_name=PG_SCHEMA_NAME,
    )

    files = localfs.walk_dir(
        checkpoints_dir,
        recursive=True,
        path_matcher=PatternFilePathMatcher(
            included_patterns=[
                "**/full.jsonl", "**/prompt.txt",
                "**/context.md", "**/metadata.json",
            ],
        ),
        live=True,  # watch for changes; pass -L to `cocoindex update` to run live
    )
    await coco.mount_each(process_file, files.items(), emb_table, meta_table)


app = coco.App(
    coco.AppConfig(name="EntireSessionSearch"),
    app_main,
    checkpoints_dir=pathlib.Path("./entire_checkpoints"),
)
```

`mount_table_target` creates and manages each Postgres table for you — schema, idempotent upserts, and orphan cleanup when a session disappears. The `included_patterns` are what makes one component handle four different files: every match flows through the same `process_file`, which routes on the name. `live=True` makes the [filesystem source](https://cocoindex.io/docs/connectors/localfs/) [watch for changes](https://cocoindex.io/docs/programming_guide/live_mode/), and `mount_each` runs one component per file so the engine can track and update them independently.

> **No vector index here.** To keep the example minimal, this flow doesn't declare a vector index, so queries do a sequential scan — fine for a personal session history. For a larger corpus, add one line — `emb_table.declare_vector_index(column="embedding")` — exactly as the [Semantic Search 101](https://cocoindex.io/docs/examples/text-embedding/) example does, and pgvector serves approximate-nearest-neighbor queries instead.

## Run the pipeline

Run the [`cocoindex` CLI](https://cocoindex.io/docs/cli/) to build and update the index. Choose catch-up (scan, sync, exit) or live (catch up, then keep watching):

```sh
# Catch-up run
cocoindex update main

# Live run: keep watching for new sessions
cocoindex update -L main
```

## Query the index

Match user text against the index with a plain SQL query, reusing the *same* embedder from the indexing flow so indexing and querying stay consistent.

```python title="main.py"
async def query_once(pool, embedder, query: str, *, top_k: int = 5) -> None:
    query_vec = await embedder.embed(query)
    async with pool.acquire() as conn:
        rows = await conn.fetch(
            f"""
            SELECT checkpoint_id, session_index, content_type, role, text,
                   embedding <=> $1 AS distance
            FROM "{PG_SCHEMA_NAME}"."{TABLE_EMBEDDINGS}"
            ORDER BY distance ASC
            LIMIT $2
            """,
            query_vec, top_k,
        )
    for r in rows:
        score = 1.0 - float(r["distance"])
        tag = r["content_type"] + (f"/{r['role']}" if r["role"] else "")
        print(f"[{score:.3f}] {r['checkpoint_id']}/{r['session_index']} ({tag})")
        print(f"    {r['text'][:200]}")
        print("---")
```

The `<=>` operator is pgvector's cosine distance. We turn it into a similarity score and print which session and content type matched, so a transcript turn, a prompt, and a context chunk are all distinguishable in the results. Run a search straight from the command line:

```bash
python main.py "how did I fix the auth bug"
```

The most semantically similar sessions come back ranked — even when they share none of the words in your query. That's the whole point of a vector index.

## Incremental updates

CocoIndex keeps the index in sync with your sessions and does the **minimum work** to get there. You never compute a diff or write update logic. Two pieces make this work. `@coco.fn(memo=True)` decides what to *recompute* — a file is skipped when its content and the function's code are both unchanged, so a finished session is never re-embedded. `mount_table_target` decides what to *write* — each embedding row's [`id`](https://cocoindex.io/docs/common_resources/id_generation/) is derived from its text, so it upserts only the rows that actually changed and deletes rows whose source is gone.

- **A new session is captured** — only its files are parsed, chunked, and embedded; their rows are inserted. Everything already indexed is untouched.
- **A session is updated** — its files are re-routed and re-chunked; turns whose text is unchanged keep their `id` and embedding, genuinely new turns are embedded and inserted, and turns that no longer exist are deleted.
- **A session is removed** — its embedding and metadata rows are removed from both tables automatically.

The same machinery covers **logic** changes too: tune the chunk size or swap the embedding model, and CocoIndex compares the new output against what's already in Postgres and applies only the difference. A catch-up run (`cocoindex update main`) does this once and exits; live mode (`cocoindex update -L main`) keeps watching and applies each new session with low latency.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/entire_session_search](https://github.com/cocoindex-io/cocoindex/tree/main/examples/entire_session_search). If your inputs are plain text or Markdown rather than session checkpoints, [Semantic Search 101](https://cocoindex.io/docs/examples/text-embedding/) is the same flow without the per-file routing; to search a folder of PDFs, see [Semantic Search over PDFs](https://github.com/cocoindex-io/cocoindex/tree/main/examples/pdf_embedding).

Want to search your own AI coding history by meaning? Come tell us on [Discord](https://discord.com/invite/zpA9S2DR7s) — and if this was useful, [star CocoIndex on GitHub](https://github.com/cocoindex-io/cocoindex).

---

# Example: Image Search with ColPali

Source: https://cocoindex.io/docs/examples/image-search-colpali/

![Image search with ColPali multi-vector embeddings and CocoIndex V1](https://cocoindex.io/blobs/docs-v1/img/examples/image-search-colpali/cover.png)

This is the multi-vector cousin of the [CLIP image search example](https://github.com/cocoindex-io/cocoindex/tree/main/examples/image_search). Same idea — type *"long neck"*, get the giraffe back, no tags or captions — but instead of squeezing each image into a *single* vector, [ColPali](https://huggingface.co/vidore/colpali-v1.2) emits a *bag* of vectors, one per image patch, and matches a query the same way it reads a document: token against patch. The cost is more vectors per image; the payoff is finer-grained retrieval that holds up on dense, text-heavy, or busy images where a single embedding blurs everything together.

The store does the heavy lifting on the query side. We give [Qdrant](https://qdrant.tech/) a **multivector** collection configured for **MaxSim**, so a query's bag of vectors and an image's bag of patch vectors are scored late-interaction style — each query vector finds its best-matching patch, summed across the query. The whole pipeline is ordinary `async` Python and your own types; [incremental processing](https://cocoindex.io/docs/programming_guide/core_concepts/), change tracking, and the managed Qdrant collection run in a Rust engine underneath, in [live mode](https://cocoindex.io/docs/programming_guide/live_mode/) inside the API server, so a new photo in the folder is searchable within a second.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/image_search_colpali)

## Flow overview

![CocoIndex ColPali image search indexing flow: walk a folder of images, embed each into a multi-vector bag of patch vectors with ColPali, and declare a point into a Qdrant MaxSim multivector collection](https://cocoindex.io/blobs/docs-v1/img/examples/image-search-colpali/flow-v1.png)

The indexing path is short — there's no text to chunk, just one multi-vector embedding per image:

1. Read image files from a local directory (live).
2. Embed each image with [ColPali](https://huggingface.co/vidore/colpali-v1.2) into a *multi-vector* — a list of 128-d patch vectors, not one fixed vector.
3. Store it in Qdrant (as a [point](https://cocoindex.io/docs/connectors/qdrant/) in a MaxSim multivector collection, keyed by a stable id, with the filename in the payload).

You [declare the transformation logic](https://cocoindex.io/docs/programming_guide/core_concepts/) with native Python, without worrying about how updates propagate. Think: **target_state = transformation(source_state)**.

## Multi-vector embeddings: a bag of vectors per image

This is what sets the example apart from its CLIP sibling. CLIP gives you *one* vector per image; ColPali gives you *many* — a vector per visual patch — and embeds a text query into the same per-token space. Both indexing and querying use the same model, two different entry points: `process_images` for the index side, `process_queries` for the query side.

```python title="pipeline.py"
@functools.cache
def get_colpali() -> tuple[ColPali, ColPaliProcessor, str]:
    model = ColPali.from_pretrained(COLPALI_MODEL_NAME)       # vidore/colpali-v1.2
    processor = ColPaliProcessor.from_pretrained(COLPALI_MODEL_NAME)
    device = get_torch_device("auto")
    model = model.to(device)
    model.eval()
    return model, processor, device


def embed_image_bytes(img_bytes: bytes) -> list[list[float]]:    # indexing side
    model, processor, device = get_colpali()
    image = Image.open(io.BytesIO(img_bytes)).convert("RGB")
    batch = processor.process_images([image]).to(device)
    with torch.no_grad():
        embeddings = model(**batch)
    return _postprocess_embeddings(embeddings, processor)


def embed_query(text: str) -> list[list[float]]:                 # query side
    model, processor, device = get_colpali()
    batch = processor.process_queries(texts=[text]).to(device)
    with torch.no_grad():
        embeddings = model(**batch)
    return _postprocess_embeddings(embeddings, processor)
```

Note the return type: `list[list[float]]`, not `list[float]`. Each image becomes a list of 128-d patch vectors, and each query becomes a list of 128-d token vectors. `_postprocess_embeddings` strips the model's padding so only real patches/tokens survive, and `@functools.cache` loads the (large) ColPali model once and reuses it for every image and every query.

## Setup

- A running [Qdrant](https://qdrant.tech/):

  ```sh
  docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant
  export QDRANT_URL="http://localhost:6334/"
  ```

- Install CocoIndex with the ColPali and Qdrant extras, plus the dependencies this example uses:

  ```sh
  pip install -U "cocoindex[colpali,qdrant]" torch transformers pillow fastapi "uvicorn[standard]" python-dotenv
  ```

- A few images. The example ships an `img/` folder (a cat, a dog, an elephant, a giraffe) — or drop your own `.jpg` / `.png` files in.

## Shared resources: the Qdrant client

The [lifespan](https://cocoindex.io/docs/programming_guide/context/) provides the Qdrant client once at startup, via a [context key](https://cocoindex.io/docs/programming_guide/context/):

```python title="pipeline.py"
QDRANT_DB = coco.ContextKey[QdrantClient]("image_search_colpali")


@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    client = qdrant.create_client(qdrant_url(), prefer_grpc=True)
    builder.provide(QDRANT_DB, client)
    yield
```

## Process an image

![One process_file component per image, fanned out with mount_each: each image is ColPali-embedded into a bag of patch vectors and declared as a Qdrant multivector point](https://cocoindex.io/blobs/docs-v1/img/examples/image-search-colpali/stage-file-process.png)

`process_file` runs once per image: read the bytes, embed with ColPali into a multi-vector, and declare a Qdrant point keyed by a stable id derived from the path, with the filename in the payload. The only difference from the CLIP version is the shape of `embedding` — a list of patch vectors rather than one vector.

```python title="pipeline.py"
@coco.fn(memo=True)
async def process_file(file: FileLike, target: qdrant.CollectionTarget) -> None:
    content = await file.read()
    embedding = embed_image_bytes(content)                  # list[list[float]] — multi-vector
    point = qdrant.PointStruct(
        id=_image_id(file.file_path.path),                  # uuid5 of the path — stable
        vector=embedding,
        payload={"filename": str(file.file_path.path)},
    )
    target.declare_point(point)
```

[`@coco.fn(memo=True)`](https://cocoindex.io/docs/programming_guide/function/) makes it [incremental](https://cocoindex.io/docs/advanced_topics/memoization_keys/): an unchanged image is never re-embedded. Each image runs as its own [processing component](https://cocoindex.io/docs/programming_guide/processing_component/), so the engine tracks them independently — delete an image and its point is removed from Qdrant automatically. `declare_point` declares the point as a [target state](https://cocoindex.io/docs/programming_guide/target_state/); CocoIndex upserts or deletes to match.

## Define the main function

`app_main` mounts the Qdrant collection — this is where the multi-vector setup lives. The vector schema is wrapped in a `MultiVectorSchema`, and the collection is configured with `multivector_comparator="max_sim"` so Qdrant scores points with late interaction. The per-vector dimension comes straight from the model (`model.dim`, 128 for ColPali), then it walks the image folder and mounts one component per file:

```python title="pipeline.py"
@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
    model, _, _ = get_colpali()
    dim = int(getattr(model, "dim", 128))   # 128 per patch/token vector

    target_collection = await qdrant.mount_collection_target(
        QDRANT_DB,
        collection_name=QDRANT_COLLECTION,   # "ImageSearchColpali"
        schema=await qdrant.CollectionSchema.create(
            vectors=qdrant.QdrantVectorDef(
                schema=MultiVectorSchema(
                    vector_schema=VectorSchema(dtype=np.dtype(np.float32), size=dim)
                ),
                distance="cosine",
                multivector_comparator="max_sim",   # late-interaction MaxSim
            )
        ),
    )

    files = localfs.walk_dir(
        sourcedir,
        recursive=True,
        path_matcher=PatternFilePathMatcher(
            included_patterns=["**/*.jpg", "**/*.jpeg", "**/*.png"]
        ),
        live=True,   # api.py runs the app with live=True
    )
    await coco.mount_each(process_file, files.items(), target_collection)


app = coco.App(
    coco.AppConfig(name="ImageSearchColpaliV1"),
    app_main,
    sourcedir=pathlib.Path("./img"),
)
```

`mount_collection_target` creates and manages the Qdrant collection for you — multivector schema, idempotent upserts, and cleanup when an image disappears. Because the per-vector size comes from the model, swapping ColPali variants just works.

## Run it as a service

Like the CLIP example, image search runs as a server. `api.py` is a FastAPI app whose [lifespan](https://fastapi.tiangolo.com/advanced/events/) starts the CocoIndex flow in **live mode** in the background — it blocks startup until the initial sweep finishes (so the collection is queryable), then keeps watching `img/` while it serves requests. There's no separate "build the index" step.

```python title="api.py"
@asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncIterator[None]:
    global _client
    async with coco.runtime():
        _client = qdrant.create_client(pipeline.qdrant_url(), prefer_grpc=True)

        # Start a live update; block until the initial sweep is READY, then run on.
        update_handle = pipeline.app.update(live=True)
        async for snap in update_handle.watch():
            if snap.status is coco.UpdateStatus.READY:
                break
        update_task = asyncio.create_task(update_handle.result())
        try:
            yield
        finally:
            update_task.cancel()


@app.get("/search")
async def search(q: str, limit: int = 5) -> dict:
    query_embedding = pipeline.embed_query(q)               # text → ColPali multi-vector
    results = pipeline._qdrant_search(_client, pipeline.QDRANT_COLLECTION, query_embedding, limit)
    return {"results": [{"filename": (r.payload or {}).get("filename"), "score": r.score} for r in results]}
```

`_qdrant_search` calls Qdrant's `query_points` with the query's *bag* of vectors — Qdrant handles the MaxSim scoring against each point's patch vectors. Start the server, then the frontend:

```sh
python -m uvicorn api:app --reload --host 0.0.0.0 --port 8000

cd frontend && npm install && npm run dev   # http://localhost:5173
```

The React app posts your query to `/search`, which embeds the text into ColPali's per-token space and runs a MaxSim search in Qdrant — the match is by *meaning*, patch by patch, never by metadata.

## Incremental updates

Because the flow runs live inside the server, the index tracks the folder with no extra work from you:

- **Add an image** — `process_file` runs once for it, embeds it into a multi-vector, and upserts one Qdrant point. It's searchable within a second.
- **Replace an image** — same id (derived from the path), new bag of vectors; the point is updated in place.
- **Delete an image** — its component disappears and the point is removed from Qdrant.
- **Restart the server** — the initial sweep reconciles against what's already in Qdrant and re-embeds nothing that's unchanged.

Swap the ColPali model and CocoIndex re-embeds everything against the new space; leave it alone and a restart is nearly free.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/image_search_colpali](https://github.com/cocoindex-io/cocoindex/tree/main/examples/image_search_colpali). For the lighter, single-vector version that fits more images in memory and indexes faster, see the [CLIP image search example](https://github.com/cocoindex-io/cocoindex/tree/main/examples/image_search); for the text equivalent, see [Semantic Search 101](https://cocoindex.io/docs/examples/text-embedding/).

Got a document-image archive, a product catalog, or a screenshot pile you want to search by meaning? Come tell us on [Discord](https://discord.com/invite/zpA9S2DR7s) — and if this was useful, [star CocoIndex on GitHub](https://github.com/cocoindex-io/cocoindex).

---

# Example: Semantic Search with Qdrant

Source: https://cocoindex.io/docs/examples/text-embedding-qdrant/

![Semantic Search with Qdrant on CocoIndex V1](https://cocoindex.io/blobs/docs-v1/img/examples/text-embedding-qdrant/cover.png)

This is the [Semantic Search 101](https://cocoindex.io/docs/examples/text-embedding/) example with one thing swapped: instead of Postgres + pgvector, the vectors land in a [Qdrant](https://qdrant.tech/) collection. Everything else — walk Markdown, chunk, embed locally with `all-MiniLM-L6-v2` — is identical, so this post stays short and spends its words on the connector, the collection setup, and how to run it.

If you want the full chunk-and-embed walkthrough, read the [base example](https://cocoindex.io/docs/examples/text-embedding/) first; the only difference here is the target.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/text_embedding_qdrant)

## Flow overview

![CocoIndex text embedding flow with Qdrant: read Markdown, split into chunks, embed each chunk, and upsert the vectors into a Qdrant collection](https://cocoindex.io/blobs/docs-v1/img/examples/text-embedding-qdrant/flow-v1.png)

From a high level, these are the steps:

1. Read Markdown files from a local directory.
2. [Split each file into overlapping chunks](https://cocoindex.io/docs/ops/text/), then [embed](https://cocoindex.io/docs/ops/sentence_transformers/) every chunk.
3. Upsert each chunk's embedding (with its text and metadata) as a point in a [Qdrant](https://cocoindex.io/docs/connectors/qdrant/) collection (as [target states](https://cocoindex.io/docs/programming_guide/target_state/)).

You [declare the transformation logic](https://cocoindex.io/docs/programming_guide/core_concepts/) with native Python, without worrying about how updates propagate. Think: **target_state = transformation(source_state)**.

## Setup

- A running Qdrant. The local container exposes HTTP on `6333` and gRPC on `6334`:

  ```sh
  docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant
  ```

- Install CocoIndex and the dependencies this example uses:

  ```sh
  pip install -U "cocoindex[qdrant,sentence_transformers]" qdrant-client numpy python-dotenv
  ```

- A few `.md` files to index. Grab the [sample file](https://github.com/cocoindex-io/cocoindex/tree/main/examples/text_embedding_qdrant/markdown_files) from the repo, or drop your own notes into a `markdown_files/` directory.

## Connect to Qdrant

The Qdrant client is a [shared resource](https://cocoindex.io/docs/programming_guide/context/): provide it once in the [lifespan](https://cocoindex.io/docs/programming_guide/app/) and every step reuses it. We connect over gRPC (`prefer_grpc=True`) for fast point upserts, and provide the same embedder the base example uses.

```python title="main.py"
QDRANT_URL = "http://localhost:6334"
QDRANT_COLLECTION = "TextEmbedding"
EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"

QDRANT_DB = coco.ContextKey[QdrantClient]("text_embedding_qdrant")
EMBEDDER = coco.ContextKey[SentenceTransformerEmbedder]("embedder", detect_change=True)


@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    client = qdrant.create_client(QDRANT_URL, prefer_grpc=True)
    builder.provide(QDRANT_DB, client)
    builder.provide(EMBEDDER, SentenceTransformerEmbedder(EMBED_MODEL))
    yield
```

## Mount the collection

`app_main` wires the source to the target. The one Qdrant-specific call is [`mount_collection_target`](https://cocoindex.io/docs/connectors/qdrant/): it creates and manages the collection, deriving the vector dimensions straight from the embedder via `QdrantVectorDef(schema=EMBEDDER)` — no hardcoded `384`. The rest is the same `walk_dir` → `mount_each` shape as the base example.

```python title="main.py"
@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
    target_collection = await qdrant.mount_collection_target(
        QDRANT_DB,
        collection_name=QDRANT_COLLECTION,
        schema=await qdrant.CollectionSchema.create(
            vectors=qdrant.QdrantVectorDef(schema=EMBEDDER)
        ),
    )
    files = localfs.walk_dir(
        sourcedir,
        recursive=True,
        path_matcher=PatternFilePathMatcher(included_patterns=["**/*.md"]),
        live=True,  # watch for changes; pass -L to `cocoindex update` to run live
    )
    await coco.mount_each(process_file, files.items(), target_collection)
```

`mount_collection_target` handles collection creation, idempotent point upserts, and orphan cleanup when a file disappears — the same managed-target guarantees pgvector gets in the base example.

## Declare a point

`process_file` chunks the text and maps each chunk to `process_chunk` (identical to the base walkthrough). The only difference is the target state: instead of a typed table row, each chunk becomes a Qdrant [`PointStruct`](https://cocoindex.io/docs/connectors/qdrant/). The chunk text and offsets go in the `payload`, the embedding is the `vector`, and `id_gen` derives a stable point id from the chunk text so re-runs upsert in place.

```python title="main.py"
@coco.fn
async def process_chunk(
    chunk: Chunk,
    filename: pathlib.PurePath,
    id_gen: IdGenerator,
    target: qdrant.CollectionTarget,
) -> None:
    embedding_vec = await coco.use_context(EMBEDDER).embed(chunk.text)

    point = qdrant.PointStruct(
        id=await id_gen.next_id(chunk.text),
        vector=embedding_vec.tolist(),
        payload={
            "filename": str(filename),
            "chunk_start": chunk.start.char_offset,
            "chunk_end": chunk.end.char_offset,
            "text": chunk.text,
        },
    )
    target.declare_point(point)
```

`target.declare_point` declares the point as a target state; CocoIndex inserts, updates, or deletes it to match — you never write upsert calls yourself.

## Run the pipeline

Run the [`cocoindex` CLI](https://cocoindex.io/docs/cli/) to build and update the index. Choose catch-up (scan, sync, exit) or live (catch up, then keep watching):

```sh
# Catch-up run
cocoindex update main

# Live run: keep watching for file changes
cocoindex update -L main
```

Then run a search from the command line — it embeds your query with the *same* model and asks Qdrant for the nearest points:

```bash
python main.py "what is self-attention?"
```

You can also browse the collection in the Qdrant dashboard at <http://localhost:6333/dashboard>.

## Incremental updates

CocoIndex keeps the Qdrant collection in sync and does the **minimum work** to get there — exactly as in the [base example](https://cocoindex.io/docs/examples/text-embedding/), just against Qdrant. `@coco.fn(memo=True)` on `process_file` decides what to *recompute* (a file is skipped when its content and code are unchanged), and each point's [`id`](https://cocoindex.io/docs/common_resources/id_generation/) is derived from its chunk's text, so `mount_collection_target` upserts only the points that changed and deletes points whose source is gone. Add a file and only it is embedded; edit one and unchanged chunks keep their id while new chunks are upserted and vanished chunks deleted; delete a file and its points are removed automatically. Swap the embedding model and `detect_change=True` re-embeds everything. A catch-up run applies the difference once and exits; live mode keeps watching.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/text_embedding_qdrant](https://github.com/cocoindex-io/cocoindex/tree/main/examples/text_embedding_qdrant). It's the [Semantic Search 101](https://cocoindex.io/docs/examples/text-embedding/) flow with Qdrant as the store — start there if you want the chunk-and-embed details, and see [the Postgres version](https://cocoindex.io/docs/examples/text-embedding/) to compare targets.

Already running Qdrant and want your docs searchable by meaning? Come tell us on [Discord](https://discord.com/invite/zpA9S2DR7s) — and if this was useful, [star CocoIndex on GitHub](https://github.com/cocoindex-io/cocoindex).

---

# Example: Semantic Search with LanceDB

Source: https://cocoindex.io/docs/examples/text-embedding-lancedb/

![Semantic Search with LanceDB and CocoIndex V1](https://cocoindex.io/blobs/docs-v1/img/examples/text-embedding-lancedb/cover.png)

This is the [Semantic Search 101](https://cocoindex.io/docs/examples/text-embedding/) example with one thing changed: the vectors land in [LanceDB](https://lancedb.github.io/lancedb/) instead of Postgres. LanceDB is an embedded, file-based vector store — no server to stand up, no `POSTGRES_URL`, just a directory on disk you can copy to move. Everything else — read Markdown, chunk, embed each chunk — is identical, so this post focuses on the one part that differs: the connector.

The whole pipeline is ordinary `async` Python and your own types. The heavy lifting — [incremental processing](https://cocoindex.io/docs/programming_guide/core_concepts/), change tracking, managed targets — runs in a Rust engine underneath, so only what changed gets re-embedded and re-upserted.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/text_embedding_lancedb)

## Flow overview

![CocoIndex text embedding flow with LanceDB: read Markdown, split into chunks, embed each chunk, and store the vectors in an embedded LanceDB table](https://cocoindex.io/blobs/docs-v1/img/examples/text-embedding-lancedb/flow-v1.png)

From a high level, these are the steps:

1. Read Markdown files from a local directory.
2. [Split each file into overlapping chunks](https://cocoindex.io/docs/ops/text/), then [embed](https://cocoindex.io/docs/ops/sentence_transformers/) every chunk.
3. Store the chunks and their embeddings in a LanceDB table (as [target states](https://cocoindex.io/docs/programming_guide/target_state/)).

You [declare the transformation logic](https://cocoindex.io/docs/programming_guide/core_concepts/) with native Python, without worrying about how updates propagate. Think: **target_state = transformation(source_state)**.

The chunk-and-embed half is unchanged from the base example — `RecursiveSplitter` cuts each file into overlapping Markdown chunks, and a local [`SentenceTransformerEmbedder`](https://cocoindex.io/docs/ops/sentence_transformers/) (`all-MiniLM-L6-v2`, no API key) turns each chunk into a 384-d vector. See the [base walkthrough](https://cocoindex.io/docs/examples/text-embedding/) for the chunk/embed details. What changes here is the target.

## Connect to LanceDB

LanceDB is embedded, so the "connection" is just a path on disk — the directory is created on first run. The [shared resources](https://cocoindex.io/docs/programming_guide/context/) the rest of the code builds on are the LanceDB connection and the embedding model, both provided once at startup in [`coco.lifespan`](https://cocoindex.io/docs/programming_guide/context/). `DocEmbedding` defines one output row — each chunk becomes one row.

```python title="main.py"
import cocoindex as coco
from cocoindex.connectors import lancedb, localfs
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder

LANCEDB_URI = "./lancedb_data"
TABLE_NAME = "doc_embeddings"
EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"

LANCE_DB = coco.ContextKey[lancedb.LanceAsyncConnection]("text_embedding_db")
EMBEDDER = coco.ContextKey[SentenceTransformerEmbedder]("embedder", detect_change=True)


@dataclass
class DocEmbedding:
    id: int
    filename: str
    chunk_start: int
    chunk_end: int
    text: str
    embedding: Annotated[NDArray, EMBEDDER]


@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    conn = await lancedb.connect_async(LANCEDB_URI)
    builder.provide(LANCE_DB, conn)
    builder.provide(EMBEDDER, SentenceTransformerEmbedder(EMBED_MODEL))
    yield
```

Compared to the Postgres version, the only difference is the resource: `lancedb.connect_async(LANCEDB_URI)` instead of an `asyncpg` pool, and a `LanceAsyncConnection` context key instead of `asyncpg.Pool`. `embedding: Annotated[NDArray, EMBEDDER]` still ties the vector column to the embedder, so its dimensions are inferred automatically — and if you swap the model later, CocoIndex notices (`detect_change=True`) and re-embeds.

## Mount the LanceDB table

`app_main` wires the source to the target. It mounts the LanceDB table, walks the source directory, and mounts one [processing component](https://cocoindex.io/docs/programming_guide/processing_component/) per file.

```python title="main.py"
@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
    target_table = await lancedb.mount_table_target(
        LANCE_DB,
        table_name=TABLE_NAME,
        table_schema=await lancedb.TableSchema.from_class(
            DocEmbedding, primary_key=["id"]
        ),
    )

    files = localfs.walk_dir(
        sourcedir,
        recursive=True,
        path_matcher=PatternFilePathMatcher(included_patterns=["**/*.md"]),
        live=True,  # watch for changes; pass -L to `cocoindex update` to run live
    )
    await coco.mount_each(process_file, files.items(), target_table)
```

`lancedb.mount_table_target` is the LanceDB counterpart to the Postgres `mount_table_target` — same call shape, same managed-table guarantees: it creates and manages the table for you, handles idempotent upserts keyed on the primary key, and cleans up orphan rows when a file disappears. `process_file` and `process_chunk` take a `lancedb.TableTarget[DocEmbedding]` and call `table.declare_row(...)` exactly as before; only the import changed. `live=True` makes the [filesystem source](https://cocoindex.io/docs/connectors/localfs/) [watch for changes](https://cocoindex.io/docs/programming_guide/live_mode/), and `mount_each` runs one component per file.

The App binds it all together and points at the Markdown folder:

```python title="main.py"
app = coco.App(
    coco.AppConfig(name="TextEmbeddingLanceDBV1"),
    app_main,
    sourcedir=pathlib.Path("./markdown_files"),
)
```

## Setup and run

No database to install — LanceDB writes to `./lancedb_data/`, created on first run. Install the example's dependencies and grab a few `.md` files (the repo ships [sample files](https://github.com/cocoindex-io/cocoindex/tree/main/examples/text_embedding_lancedb/markdown_files), or drop your own into `markdown_files/`):

```sh
pip install -U "cocoindex[sentence_transformers,lancedb]" python-dotenv
```

Then build and update the index with the [`cocoindex` CLI](https://cocoindex.io/docs/cli/) — catch-up (scan, sync, exit) or live (catch up, then keep watching):

```sh
# Catch-up run
cocoindex update main

# Live run: keep watching for file changes
cocoindex update -L main
```

## Query the index

Query with LanceDB's async search API, reusing the *same* embedder from the indexing flow so indexing and querying stay consistent.

```python title="main.py"
async def query_once(conn, embedder, query_text: str, *, top_k: int = TOP_K) -> None:
    query_vec = await embedder.embed(query_text)
    table = await conn.open_table(TABLE_NAME)
    search = await table.search(query_vec, vector_column_name="embedding")
    results = await search.limit(top_k).to_list()
    for r in results:
        score = 1.0 - r["_distance"]
        print(f"[{score:.3f}] {r['filename']}")
        print(f"    {r['text']}")
        print("---")
```

`table.search(...).limit(top_k)` returns the nearest vectors; `_distance` is LanceDB's distance, which we turn into a similarity score. Run a search straight from the command line:

```sh
python main.py "what is self-attention?"
```

The most semantically similar chunks come back ranked — even when they share none of the words in your query.

## Incremental updates

The incremental story is identical to the [base example](https://cocoindex.io/docs/examples/text-embedding/): `@coco.fn(memo=True)` decides what to *recompute* (a file is skipped when its content and the function's code are both unchanged), and `lancedb.mount_table_target` decides what to *write* — each row's [`id`](https://cocoindex.io/docs/common_resources/id_generation/) is derived from its chunk's text, so it upserts only the rows that actually changed and deletes rows whose source is gone.

- **A file is added** — only that file is chunked and embedded, and its rows are inserted.
- **A file is edited** — it is re-chunked; unchanged chunks keep their `id` and embedding, genuinely new chunks are embedded and inserted, and chunks that no longer exist are deleted.
- **A file is deleted** — its rows are removed from the LanceDB table automatically.

A catch-up run (`cocoindex update main`) does this once and exits; live mode (`cocoindex update -L main`) keeps watching and applies each change with low latency.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/text_embedding_lancedb](https://github.com/cocoindex-io/cocoindex/tree/main/examples/text_embedding_lancedb). For the chunk-and-embed walkthrough, see [Semantic Search 101](https://cocoindex.io/docs/examples/text-embedding/) — same flow, Postgres as the target.

If this was useful, [star CocoIndex on GitHub](https://github.com/cocoindex-io/cocoindex) and come say hi in our [Discord community](https://discord.com/invite/zpA9S2DR7s).

---

# Example: Semantic Search with Turbopuffer

Source: https://cocoindex.io/docs/examples/text-embedding-turbopuffer/

![Semantic Search with Turbopuffer using CocoIndex V1](https://cocoindex.io/blobs/docs-v1/img/examples/text-embedding-turbopuffer/cover.png)

This is the [Semantic Search 101 example](https://cocoindex.io/docs/examples/text-embedding/) with one thing swapped: instead of storing the vectors in Postgres with pgvector, we write them to a [Turbopuffer](https://turbopuffer.com/) namespace — a managed, serverless vector store, so there's no database to run yourself. The chunking and embedding are identical; only the target changes.

Everything else stays the same: ordinary `async` Python and your own types, with [incremental processing](https://cocoindex.io/docs/programming_guide/core_concepts/), change tracking, and managed targets running in the Rust engine underneath — so when a file changes, only the affected chunks get re-embedded and re-upserted.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/text_embedding_turbopuffer)

## Flow overview

![CocoIndex flow: read Markdown, split into chunks, embed each chunk, and upsert the vectors into a Turbopuffer namespace](https://cocoindex.io/blobs/docs-v1/img/examples/text-embedding-turbopuffer/flow-v1.png)

From a high level, these are the steps:

1. Read Markdown files from a local directory.
2. [Split each file into overlapping chunks](https://cocoindex.io/docs/ops/text/), then [embed](https://cocoindex.io/docs/ops/sentence_transformers/) every chunk.
3. Upsert each chunk and its embedding as a row in a Turbopuffer namespace (as [target states](https://cocoindex.io/docs/programming_guide/target_state/)).

You [declare the transformation logic](https://cocoindex.io/docs/programming_guide/core_concepts/) with native Python, without worrying about how updates propagate. Think: **target_state = transformation(source_state)**.

For the chunk-and-embed half of the pipeline — `process_file`, `RecursiveSplitter`, and `SentenceTransformerEmbedder` — read the [base walkthrough](https://cocoindex.io/docs/examples/text-embedding/). Here we focus on the one part that differs: the Turbopuffer target.

## Set up the Turbopuffer client

Turbopuffer is a cloud service, so the [shared resource](https://cocoindex.io/docs/programming_guide/context/) the pipeline needs is an API client rather than a database pool. We provide an `AsyncTurbopuffer` client in the lifespan, alongside the embedder, keyed off the `TURBOPUFFER_API_KEY` env var.

```python title="main.py"
TPUF_REGION = os.environ.get("TURBOPUFFER_REGION", "gcp-us-central1")
TPUF_NAMESPACE = "TextEmbedding"

EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
TPUF_DB = coco.ContextKey[turbopuffer.AsyncTurbopuffer]("text_embedding_turbopuffer")
EMBEDDER = coco.ContextKey[SentenceTransformerEmbedder]("embedder", detect_change=True)


@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    api_key = os.environ.get("TURBOPUFFER_API_KEY")
    if not api_key:
        raise RuntimeError("TURBOPUFFER_API_KEY is not set")
    client = turbopuffer.AsyncTurbopuffer(region=TPUF_REGION, api_key=api_key)
    builder.provide(TPUF_DB, client)
    builder.provide(EMBEDDER, SentenceTransformerEmbedder(EMBED_MODEL))
    yield
```

## Declare a row in the namespace

A Turbopuffer row is an `id`, a `vector`, and an open bag of `attributes`. Instead of a typed table column per field, the filename, text, and offsets ride along as attributes — the embedding is the indexed vector.

```python title="main.py"
@coco.fn
async def process_chunk(
    chunk: Chunk,
    filename: pathlib.PurePath,
    id_gen: IdGenerator,
    target: turbopuffer.NamespaceTarget,
) -> None:
    embedding_vec = await coco.use_context(EMBEDDER).embed(chunk.text)
    target.declare_row(
        turbopuffer.Row(
            id=str(await id_gen.next_id(chunk.text)),
            vector=embedding_vec,
            attributes={
                "filename": str(filename),
                "chunk_start": chunk.start.char_offset,
                "chunk_end": chunk.end.char_offset,
                "text": chunk.text,
            },
        )
    )
```

`target.declare_row` declares the row as a target state; CocoIndex handles upserting and deleting it to match. The `id` is [derived from the chunk's text](https://cocoindex.io/docs/common_resources/id_generation/), so unchanged chunks keep their id and embedding across runs.

## Mount the namespace target

`app_main` mounts the namespace, then fans out one [processing component](https://cocoindex.io/docs/programming_guide/processing_component/) per file. The vector schema comes straight from the embedder, so the namespace's dimension matches what we write.

```python title="main.py"
@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
    target_namespace = await turbopuffer.mount_namespace_target(
        TPUF_DB,
        namespace_name=TPUF_NAMESPACE,
        schema=await turbopuffer.NamespaceSchema.create(
            vectors=turbopuffer.VectorDef(schema=EMBEDDER),
        ),
    )
    files = localfs.walk_dir(
        sourcedir,
        recursive=True,
        path_matcher=PatternFilePathMatcher(included_patterns=["**/*.md"]),
        live=True,  # watch for changes; pass -L to `cocoindex update` to run live
    )
    await coco.mount_each(process_file, files.items(), target_namespace)
```

`mount_namespace_target` creates and manages the Turbopuffer namespace for you: schema, idempotent upserts, and orphan cleanup when a file disappears. `live=True` makes the [filesystem source](https://cocoindex.io/docs/connectors/localfs/) [watch for changes](https://cocoindex.io/docs/programming_guide/live_mode/), and `mount_each` runs one component per file so the engine can track and update them independently.

## Setup and run

- A free [Turbopuffer](https://turbopuffer.com/) API key. Copy `.env.example` to `.env` and fill it in (`TURBOPUFFER_REGION` defaults to `gcp-us-central1`):

  ```sh
  export TURBOPUFFER_API_KEY="tpuf_..."
  ```

- Install CocoIndex with the Turbopuffer and embedding extras:

  ```sh
  pip install -U "cocoindex[turbopuffer,sentence_transformers]" numpy python-dotenv
  ```

- A few `.md` files in a `markdown_files/` directory — grab the [sample file](https://github.com/cocoindex-io/cocoindex/tree/main/examples/text_embedding_turbopuffer/markdown_files) from the repo or drop in your own.

Run the [`cocoindex` CLI](https://cocoindex.io/docs/cli/) to build and update the index, then query it from the command line:

```sh
# Catch-up run: scan, sync, exit
cocoindex update main

# Live run: keep watching for file changes
cocoindex update -L main

# Search the namespace
python main.py "what is self-attention?"
```

The query embeds your text with the *same* model and asks Turbopuffer for the nearest vectors with `rank_by=("vector", "ANN", ...)`, so indexing and querying stay consistent. The most semantically similar chunks come back ranked — even when they share none of the words in your query.

## Incremental updates

CocoIndex keeps the namespace in sync with your files and does the **minimum work** to get there — you never compute a diff. `@coco.fn(memo=True)` on `process_file` decides what to *recompute* (a file is skipped when its content and the function's code are unchanged), and `mount_namespace_target` decides what to *write* (each row's `id` is derived from its chunk's text, so only changed rows are upserted and rows whose source is gone are deleted).

- **A file is added** — only that file is chunked and embedded, and its rows are upserted. The rest is untouched.
- **A file is edited** — it is re-chunked; unchanged chunks keep their `id` and embedding, new chunks are embedded and upserted, and chunks that no longer exist are deleted.
- **A file is deleted** — its rows are removed from the namespace automatically.

The same machinery covers **logic** changes: tune the chunk size or swap the embedding model, and CocoIndex applies only the difference. A catch-up run (`cocoindex update main`) does this once and exits; live mode (`cocoindex update -L main`) keeps watching.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/text_embedding_turbopuffer](https://github.com/cocoindex-io/cocoindex/tree/main/examples/text_embedding_turbopuffer). To swap the store for self-hosted Postgres instead, see the [base Semantic Search 101 example](https://cocoindex.io/docs/examples/text-embedding/). Questions? Come say hi in our [Discord](https://discord.com/invite/zpA9S2DR7s), and if this helped, a [star on GitHub](https://github.com/cocoindex-io/cocoindex) goes a long way.

---

# Example: Embed Markdown from Amazon S3

Source: https://cocoindex.io/docs/examples/amazon-s3-embedding/

![Embed Markdown from Amazon S3 with CocoIndex V1](https://cocoindex.io/blobs/docs-v1/img/examples/amazon-s3-embedding/cover.png)

This is the [Semantic Search 101](https://cocoindex.io/docs/examples/text-embedding/) example with one thing swapped: the source is an **Amazon S3 bucket** instead of a local directory. Everything downstream — chunking, embedding, the Postgres/pgvector target, and the query — is identical, so this post spends its words on the part that differs: the S3 connector, its env vars, and the catch-up run.

If you haven't read the base example yet, start there — it walks through the chunk-and-embed flow line by line. Here we'll move fast.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/amazon_s3_embedding)

## Flow overview

![CocoIndex Amazon S3 embedding flow: list Markdown objects from a bucket, split into chunks, embed each chunk, and store the vectors in Postgres with pgvector](https://cocoindex.io/blobs/docs-v1/img/examples/amazon-s3-embedding/flow-v1.png)

From a high level, these are the steps:

1. List Markdown objects from an [Amazon S3](https://cocoindex.io/docs/connectors/amazon_s3/) bucket (filtered by prefix and glob).
2. [Split each file into overlapping chunks](https://cocoindex.io/docs/ops/text/), then [embed](https://cocoindex.io/docs/ops/sentence_transformers/) every chunk.
3. Store the chunks and their embeddings in Postgres (as [target states](https://cocoindex.io/docs/programming_guide/target_state/)).

You [declare the transformation logic](https://cocoindex.io/docs/programming_guide/core_concepts/) with native Python, without worrying about how updates propagate. Think: **target_state = transformation(source_state)**.

The chunk-and-embed half of this pipeline is unchanged from [Semantic Search 101](https://cocoindex.io/docs/examples/text-embedding/) — same `RecursiveSplitter`, same [`SentenceTransformerEmbedder`](https://cocoindex.io/docs/ops/sentence_transformers/) with `all-MiniLM-L6-v2`, same `DocEmbedding` row written to pgvector. The only new piece is the connector.

## Provide an S3 client

The S3 connector needs an [aiobotocore](https://github.com/aio-libs/aiobotocore) client. We open it once in the [lifespan](https://cocoindex.io/docs/programming_guide/context/) alongside the Postgres pool and embedder, and share it through a `ContextKey`.

```python title="main.py"
DATABASE_URL = os.getenv("POSTGRES_URL", "postgres://cocoindex:cocoindex@localhost/cocoindex")
TABLE_NAME = "amazon_s3_doc_embeddings"
S3_BUCKET = os.environ["S3_BUCKET"]
S3_PREFIX = os.getenv("S3_PREFIX", "")

PG_DB = coco.ContextKey[asyncpg.Pool]("s3_embedding_db")
S3_CLIENT = coco.ContextKey[AioBaseClient]("s3_client")
EMBEDDER = coco.ContextKey[SentenceTransformerEmbedder]("embedder", detect_change=True)


@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    async with asyncpg.create_pool(DATABASE_URL) as pool:
        builder.provide(PG_DB, pool)
        builder.provide(EMBEDDER, SentenceTransformerEmbedder(EMBED_MODEL))

        session = aiobotocore.session.get_session()
        async with session.create_client("s3") as s3_client:
            builder.provide(S3_CLIENT, s3_client)
            yield
```

`create_client("s3")` picks up standard AWS credentials — env vars, `~/.aws/credentials`, or an IAM role. Set `AWS_ENDPOINT_URL` to point at an S3-compatible service like MinIO.

## List objects from the bucket

`app_main` mounts the Postgres table exactly as in the base example, then swaps `localfs.walk_dir` for [`amazon_s3.list_objects`](https://cocoindex.io/docs/connectors/amazon_s3/) — same `path_matcher` glob, same `mount_each` fan-out.

```python title="main.py"
@coco.fn
async def app_main() -> None:
    target_table = await postgres.mount_table_target(
        PG_DB,
        table_name=TABLE_NAME,
        table_schema=await postgres.TableSchema.from_class(
            DocEmbedding, primary_key=["id"],
        ),
        pg_schema_name=PG_SCHEMA_NAME,
    )

    client = coco.use_context(S3_CLIENT)
    files = amazon_s3.list_objects(
        client,
        S3_BUCKET,
        prefix=S3_PREFIX,
        path_matcher=PatternFilePathMatcher(included_patterns=["**/*.md"]),
    )
    await coco.mount_each(process_file, files.items(), target_table)
```

`list_objects` yields one [`S3File`](https://cocoindex.io/docs/connectors/amazon_s3/) per matching object; `prefix` scopes the listing server-side, and the glob filters the rest. `process_file` then reads, chunks, and embeds each one — that code is identical to the base example, so [see it there](https://cocoindex.io/docs/examples/text-embedding/#process-a-file).

## Setup

- A running Postgres with the [pgvector](https://github.com/pgvector/pgvector) extension.

  ```sh
  export POSTGRES_URL="postgres://cocoindex:cocoindex@localhost/cocoindex"
  ```

- An S3 bucket with some `.md` files, plus AWS credentials and the bucket name:

  ```sh
  export S3_BUCKET="my-bucket"
  export S3_PREFIX="markdown_files/"   # optional: scope the listing
  ```

- Install CocoIndex with the `amazon_s3` extra and the example's dependencies:

  ```sh
  pip install -U "cocoindex[amazon_s3,postgres,sentence_transformers]" asyncpg pgvector numpy python-dotenv
  ```

## Run the pipeline

The [`amazon_s3` source](https://cocoindex.io/docs/connectors/amazon_s3/) does not support live mode, so this is a one-shot catch-up run — scan the bucket, sync, exit:

```sh
cocoindex update main
```

Then search straight from the command line, reusing the same embedder so indexing and querying stay consistent:

```sh
python main.py "what is self-attention?"
```

The most semantically similar chunks come back ranked — even when they share none of the words in your query.

## Incremental updates

CocoIndex keeps the index in sync with the bucket and does the **minimum work** to get there. `@coco.fn(memo=True)` decides what to *recompute* — a file is skipped when its content and the function's code are both unchanged — and `mount_table_target` decides what to *write*, deriving each row's [`id`](https://cocoindex.io/docs/common_resources/id_generation/) from its chunk text so it upserts only the rows that actually changed and deletes rows whose source object is gone.

- **An object is added** — only that file is chunked and embedded; the rest is untouched.
- **An object is edited** — it is re-chunked; unchanged chunks keep their `id` and embedding, new chunks are embedded and inserted, and stale ones are deleted.
- **An object is deleted** — its rows are removed from the target automatically.

Because S3 is catch-up only, you re-run `cocoindex update main` to pick up bucket changes; the engine still applies just the difference.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/amazon_s3_embedding](https://github.com/cocoindex-io/cocoindex/tree/main/examples/amazon_s3_embedding). Have a question or want to show what you built? Join us on [Discord](https://discord.com/invite/zpA9S2DR7s), and if CocoIndex saves you time, a [star on GitHub](https://github.com/cocoindex-io/cocoindex) helps others find it.

---

# Example: Semantic Search over Google Drive

Source: https://cocoindex.io/docs/examples/google-drive-embedding/

![Semantic Search over Google Drive with CocoIndex V1](https://cocoindex.io/blobs/docs-v1/img/examples/google-drive-embedding/cover.png)

This is the [Semantic Search 101](https://cocoindex.io/docs/examples/text-embedding/) example with exactly one thing swapped: instead of reading Markdown off your local disk, it reads documents straight from a **Google Drive** folder. Everything downstream — chunking, embedding, and storing the vectors in Postgres with pgvector — is identical, so this post spends its prose on the one piece that differs: the Drive connector.

The chunk-and-embed half is explained in full in the [base walkthrough](https://cocoindex.io/docs/examples/text-embedding/); read that first if you want the line-by-line tour. Here we focus on wiring [Google Drive](https://cocoindex.io/docs/connectors/google_drive/) as the source.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/gdrive_text_embedding)

## Flow overview

![CocoIndex Google Drive embedding flow: read files from a Drive folder, split into chunks, embed each chunk, and store the vectors in Postgres with pgvector](https://cocoindex.io/blobs/docs-v1/img/examples/google-drive-embedding/flow-v1.png)

From a high level, these are the steps:

1. List documents under one or more Google Drive folders (recursively), exporting Docs/Sheets/Slides to text.
2. [Split each file into overlapping chunks](https://cocoindex.io/docs/ops/text/), then [embed](https://cocoindex.io/docs/ops/sentence_transformers/) every chunk.
3. Store the chunks and their embeddings in Postgres (as [target states](https://cocoindex.io/docs/programming_guide/target_state/)).

You [declare the transformation logic](https://cocoindex.io/docs/programming_guide/core_concepts/) with native Python, without worrying about how updates propagate. Think: **target_state = transformation(source_state)**.

## Connect to Google Drive

The Drive source needs two things: a Google Cloud **service-account JSON key** with Drive access, and the **folder id(s)** to index. CocoIndex reads both from environment variables, so nothing is hardcoded.

```python title="main.py"
from cocoindex.connectors import google_drive, postgres

credential_path = os.environ["GOOGLE_SERVICE_ACCOUNT_CREDENTIAL"]
root_folder_ids = [
    folder.strip()
    for folder in os.environ["GOOGLE_DRIVE_ROOT_FOLDER_IDS"].split(",")
    if folder.strip()
]

source = google_drive.GoogleDriveSource(
    service_account_credential_path=credential_path,
    root_folder_ids=root_folder_ids,
)
```

`GoogleDriveSource` walks each root folder recursively and yields one `DriveFile` per document. Native Google Docs, Sheets, and Slides are exported to text (Markdown, CSV, and plain text respectively); any other file is downloaded as-is. That's the whole connector — from here on, a `DriveFile` behaves like any other [`FileLike`](https://cocoindex.io/docs/connectors/localfs/), so `await file.read_text()` works just as it does for a local file.

## Define the main function

`app_main` mounts the Postgres table, then fans out one [processing component](https://cocoindex.io/docs/programming_guide/processing_component/) per Drive file with `mount_each`. The processing component is the same `process_file` from the base example — read the file, chunk it, embed each chunk, and `declare_row` a `DocEmbedding` per chunk.

```python title="main.py"
@coco.fn
async def app_main() -> None:
    table = await postgres.mount_table_target(
        PG_DB,
        table_name=TABLE_NAME,
        table_schema=await postgres.TableSchema.from_class(
            DocEmbedding,
            primary_key=["id"],
        ),
        pg_schema_name=PG_SCHEMA_NAME,
    )

    source = google_drive.GoogleDriveSource(
        service_account_credential_path=credential_path,
        root_folder_ids=root_folder_ids,
    )

    await coco.mount_each(process_file, source.items(), table)
```

`source.items()` yields `(key, file)` pairs keyed by the file's name path, which is exactly the shape [`mount_each`](https://cocoindex.io/docs/programming_guide/processing_component/) expects — so the engine tracks each Drive file as its own component and updates them independently. `mount_table_target` creates and manages the Postgres table: schema, idempotent upserts, and orphan cleanup when a file disappears from the folder.

## Setup

- A running Postgres with the [pgvector](https://github.com/pgvector/pgvector) extension:

  ```sh
  export POSTGRES_URL="postgres://cocoindex:cocoindex@localhost/cocoindex"
  ```

- A Google Cloud **service account** with Drive access, and the folder id(s) you want to index. Share the folders with the service account's email, then:

  ```sh
  export GOOGLE_SERVICE_ACCOUNT_CREDENTIAL="/path/to/service-account.json"
  export GOOGLE_DRIVE_ROOT_FOLDER_IDS="folder_id_1,folder_id_2"
  ```

- Install CocoIndex with the Google Drive extra and the dependencies this example uses:

  ```sh
  pip install -U "cocoindex[postgres,sentence_transformers,google_drive]" asyncpg pgvector numpy python-dotenv
  ```

## Run the pipeline

The Google Drive source does a one-shot catch-up (live mode isn't supported), so build the index with a single `cocoindex update`:

```sh
cocoindex update main
```

Then search straight from the command line, reusing the *same* embedder from the indexing flow so indexing and querying stay consistent:

```bash
python main.py "what is self-attention?"
```

The most semantically similar chunks come back ranked — even when they share none of the words in your query.

## Incremental updates

Just like the [base example](https://cocoindex.io/docs/examples/text-embedding/), CocoIndex does the **minimum work** to keep the index in sync. `@coco.fn(memo=True)` on `process_file` skips any Drive file whose content and the function's code are both unchanged, and `mount_table_target` derives each row's [`id`](https://cocoindex.io/docs/common_resources/id_generation/) from its chunk text, so only the rows that actually changed are upserted and rows whose source is gone are deleted.

- **A document is added to the folder** — only that file is chunked and embedded, and its rows are inserted.
- **A document is edited** — it is re-chunked; unchanged chunks keep their `id` and embedding, genuinely new chunks are embedded and inserted, and chunks that no longer exist are deleted.
- **A document is removed from the folder** — its rows are removed from the target automatically.

Because the Drive source is catch-up only, each `cocoindex update main` rescans the folders and applies exactly the difference.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/gdrive_text_embedding](https://github.com/cocoindex-io/cocoindex/tree/main/examples/gdrive_text_embedding). If you haven't yet, read [Semantic Search 101](https://cocoindex.io/docs/examples/text-embedding/) for the line-by-line tour of chunking and embedding.

Found this useful? [Star CocoIndex on GitHub](https://github.com/cocoindex-io/cocoindex) and come say hi in our [Discord](https://discord.com/invite/zpA9S2DR7s).

---

# Example: Embed OCI Object Storage

Source: https://cocoindex.io/docs/examples/oci-object-storage-embedding/

![Embed OCI Object Storage with CocoIndex V1](https://cocoindex.io/blobs/docs-v1/img/examples/oci-object-storage-embedding/cover.png)

This is the [Semantic Search 101](https://cocoindex.io/docs/examples/text-embedding/) example with exactly one thing changed: instead of reading Markdown from a local folder, it lists Markdown objects from an [Oracle Cloud (OCI) Object Storage](https://cocoindex.io/docs/connectors/oci_object_storage/) bucket. The chunk → embed → store-in-pgvector half is identical, so the [base walkthrough](https://cocoindex.io/docs/examples/text-embedding/) covers the embedding model, the `DocEmbedding` row, and the query. Here we focus on the part that differs: the OCI client, the source call, and how the same flow goes **live** off OCI Streaming.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/oci_object_storage_embedding)

## Flow overview

![CocoIndex OCI Object Storage flow: list Markdown objects from a bucket, split into chunks, embed each chunk, and store the vectors in Postgres with pgvector](https://cocoindex.io/blobs/docs-v1/img/examples/oci-object-storage-embedding/flow-v1.png)

From a high level, these are the steps:

1. List Markdown objects from an OCI Object Storage bucket (optionally under a prefix).
2. [Split each object into overlapping chunks](https://cocoindex.io/docs/ops/text/), then [embed](https://cocoindex.io/docs/ops/sentence_transformers/) every chunk.
3. Store the chunks and their embeddings in Postgres (as [target states](https://cocoindex.io/docs/programming_guide/target_state/)).

You [declare the transformation logic](https://cocoindex.io/docs/programming_guide/core_concepts/) with native Python, without worrying about how updates propagate. Think: **target_state = transformation(source_state)**.

## Provide the OCI client

The OCI SDK is synchronous and you create the client yourself, so we build one from a config-file profile and hand it to the [context](https://cocoindex.io/docs/programming_guide/context/) alongside the Postgres pool and the embedder. The connector wraps every SDK call in `asyncio.to_thread`, so passing the sync client through is fine.

```python title="main.py"
def _build_oci_client() -> ObjectStorageClient:
    config = oci.config.from_file(
        file_location=os.path.expanduser(OCI_CONFIG_FILE),
        profile_name=OCI_PROFILE,
    )
    return ObjectStorageClient(config)


OCI_CLIENT = coco.ContextKey[ObjectStorageClient]("oci_object_storage_client")


@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
    async with asyncpg.create_pool(DATABASE_URL) as pool:
        builder.provide(PG_DB, pool)
        builder.provide(EMBEDDER, SentenceTransformerEmbedder(EMBED_MODEL))
        builder.provide(OCI_CLIENT, _build_oci_client())
        yield
```

Everything downstream of this — `DocEmbedding`, `process_file`, `process_chunk` — is the same chunk-and-embed code as the [base example](https://cocoindex.io/docs/examples/text-embedding/), so we won't repeat it here. The one small difference is the source type: `process_file` takes an `oci_object_storage.OCIFile` and reads its text with `await file.read_text()`, just like the local `FileLike`.

## List objects from the bucket

`app_main` mounts the Postgres table, then walks the bucket with [`list_objects`](https://cocoindex.io/docs/connectors/oci_object_storage/). It is the OCI analogue of `localfs.walk_dir` — give it the client, the namespace, the bucket, an optional prefix, and a path matcher, and it yields one `OCIFile` per matching object.

```python title="main.py"
@coco.fn
async def app_main() -> None:
    target_table = await postgres.mount_table_target(
        PG_DB,
        table_name=TABLE_NAME,
        table_schema=await postgres.TableSchema.from_class(
            DocEmbedding, primary_key=["id"],
        ),
        pg_schema_name=PG_SCHEMA_NAME,
    )

    client = coco.use_context(OCI_CLIENT)

    # Live mode is opt-in: build a LiveStream[bytes] from OCI Streaming if configured.
    consumer = _build_streaming_consumer()
    live_stream = None
    if consumer is not None and OCI_STREAMING_TOPIC is not None:
        live_stream = kafka.topic_as_stream(consumer, [OCI_STREAMING_TOPIC]).payloads()

    files = oci_object_storage.list_objects(
        client,
        OCI_NAMESPACE,
        OCI_BUCKET,
        prefix=OCI_PREFIX,
        path_matcher=PatternFilePathMatcher(included_patterns=["**/*.md"]),
        live_stream=live_stream,
    )
    await coco.mount_each(process_file, files.items(), target_table)
```

`mount_each` runs one [processing component](https://cocoindex.io/docs/programming_guide/processing_component/) per object so the engine tracks and updates each independently. With `live_stream=None` (the default), `list_objects` does a one-shot catch-up scan. Pass a stream and it keeps watching — that's the next section.

## Live mode via OCI Streaming

OCI Streaming is [Kafka-protocol-compatible](https://cocoindex.io/docs/connectors/kafka/), so live updates ride the [Kafka connector](https://cocoindex.io/docs/connectors/kafka/). When the four `OCI_STREAMING_*` env vars are set, we build a `confluent_kafka.aio.AIOConsumer` with `SASL_SSL` + `PLAIN` auth, wrap it with `kafka.topic_as_stream(...).payloads()` to get a `LiveStream[bytes]`, and pass it to `list_objects`. The connector snapshots a cutoff before the scan, runs the scan and stream concurrently, and for each post-cutoff event re-reads the object to apply an authoritative update or delete — see the [OCI Object Storage connector docs](https://cocoindex.io/docs/connectors/oci_object_storage/) for the details.

```python title="main.py"
def _build_streaming_consumer() -> AIOConsumer | None:
    if not (
        OCI_STREAMING_BOOTSTRAP_SERVERS and OCI_STREAMING_TOPIC
        and OCI_STREAMING_USERNAME and OCI_STREAMING_AUTH_TOKEN
    ):
        return None
    return AIOConsumer({
        "bootstrap.servers": OCI_STREAMING_BOOTSTRAP_SERVERS,
        "security.protocol": "SASL_SSL",
        "sasl.mechanism": "PLAIN",
        "sasl.username": OCI_STREAMING_USERNAME,
        "sasl.password": OCI_STREAMING_AUTH_TOKEN,
        "group.id": OCI_STREAMING_GROUP_ID,
        "auto.offset.reset": "earliest",
        "enable.auto.commit": False,
    })
```

## Setup and run

- A running Postgres with the [pgvector](https://github.com/pgvector/pgvector) extension:

  ```sh
  export POSTGRES_URL="postgres://cocoindex:cocoindex@localhost/cocoindex"
  ```

- An OCI Object Storage bucket with Markdown objects, and an [OCI config file](https://docs.oracle.com/en-us/iaas/Content/API/Concepts/sdkconfig.htm) (default `~/.oci/config`) with API-key auth. Copy `.env.example` to `.env` and fill in `OCI_NAMESPACE`, `OCI_BUCKET`, and an optional `OCI_PREFIX`.

- Install CocoIndex with the OCI, Kafka, Postgres, and embedding extras:

  ```sh
  pip install -U "cocoindex[oci,kafka,postgres,sentence_transformers]" asyncpg pgvector numpy python-dotenv
  ```

Build the index — catch-up (scan, sync, exit) or live (catch up, then keep watching the OCI Streaming topic):

```sh
# Catch-up run
cocoindex update main

# Live run: keep watching OCI Streaming for change events
cocoindex update -L main
```

Then search straight from the command line:

```sh
python main.py "what is self-attention?"
```

This example keeps it minimal and doesn't declare a vector index, so queries do a sequential scan — fine for a few objects. For a larger corpus, add `target_table.declare_vector_index(column="embedding")` exactly as [Semantic Search 101](https://cocoindex.io/docs/examples/text-embedding/) does.

## Incremental updates

Incrementality works the same as the [base example](https://cocoindex.io/docs/examples/text-embedding/): `@coco.fn(memo=True)` skips an object whose content and code are unchanged, and `mount_table_target` upserts only the rows that actually changed and deletes rows whose source is gone.

- **An object is added** — only that object is chunked and embedded; its rows are inserted.
- **An object is updated** — it is re-chunked; unchanged chunks keep their `id` and embedding, new chunks are embedded and inserted, and vanished chunks are deleted.
- **An object is deleted** — its rows are removed from the target automatically.

In catch-up mode CocoIndex discovers these by re-scanning the bucket; in live mode the OCI Streaming events drive the same updates with low latency, no full re-scan needed.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/oci_object_storage_embedding](https://github.com/cocoindex-io/cocoindex/tree/main/examples/oci_object_storage_embedding). If you're starting from a local folder instead of a bucket, [Semantic Search 101](https://cocoindex.io/docs/examples/text-embedding/) is the same flow on the local filesystem.

If this helped, [give CocoIndex a star on GitHub](https://github.com/cocoindex-io/cocoindex) and come say hi on [Discord](https://discord.com/invite/zpA9S2DR7s).

---

# Example: Build Your Own Face Search

Source: https://cocoindex.io/docs/examples/face-recognition/

![Index faces for visual search — build your own Google Photo face search with CocoIndex](https://cocoindex.io/blobs/docs-v1/img/examples/face-recognition/cover.png)

We'll take a folder of photos and make them searchable *by face* — the core of "find every photo of this person." For each image we detect every face, crop it, embed it into a 128-d vector with [`face_recognition`](https://github.com/ageitgey/face_recognition) (dlib), and index the faces in [Qdrant](https://qdrant.tech/). Then a query face finds the nearest indexed faces — the same person across different photos lands close together.

The whole pipeline is ordinary `async` Python and your own types. The heavy lifting — [incremental processing](https://cocoindex.io/docs/programming_guide/core_concepts/), change tracking, the managed Qdrant collection — runs in a Rust engine underneath, so adding a photo only re-detects that photo, and the slow detection/embedding steps run on a [GPU runner](https://cocoindex.io/docs/programming_guide/function/) instead of blocking the event loop.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/face_recognition)

## Flow overview

![CocoIndex face recognition flow: walk a folder of images, detect every face, embed each into a 128-d vector, and index one Qdrant point per face](https://cocoindex.io/blobs/docs-v1/img/examples/face-recognition/flow-v1.png)

Unlike a one-embedding-per-image index, an image here fans out to **many** faces — so the shape is *image → N faces → N points*:

1. Read image files from a local directory (live).
2. Detect every face in each image and crop it.
3. Embed each face into a 128-d vector and store one Qdrant [point](https://cocoindex.io/docs/connectors/qdrant/) per face, with the source filename and bounding box.

You [declare the transformation logic](https://cocoindex.io/docs/programming_guide/core_concepts/) with native Python; CocoIndex works out what to insert, update, and delete. Think: **target_state = transformation(source_state)**.

## Detect and embed faces

Face detection and embedding are synchronous, CPU/GPU-heavy dlib calls, so each is wrapped with [`@coco.fn.as_async(runner=coco.GPU)`](https://cocoindex.io/docs/programming_guide/function/) to run on a dedicated [GPU runner](https://cocoindex.io/docs/programming_guide/function/) without blocking the async loop. `extract_faces` returns one `Face` (bounding box + cropped PNG) per detected face:

```python title="main.py"
@dataclass
class Face:
    rect: ImageRect       # bounding box in the original image
    image: bytes          # the cropped face, as PNG


@coco.fn.as_async(runner=coco.GPU)
def extract_faces(content: bytes) -> list[Face]:
    orig = Image.open(io.BytesIO(content)).convert("RGB")
    # The CNN detector is slow on large images, so downscale, then map boxes back.
    img, ratio = _downscale(orig, MAX_IMAGE_WIDTH)
    faces = []
    for top, right, bottom, left in face_recognition.face_locations(np.array(img), model="cnn"):
        rect = ImageRect(int(left*ratio), int(top*ratio), int(right*ratio), int(bottom*ratio))
        buf = io.BytesIO()
        orig.crop((rect.min_x, rect.min_y, rect.max_x, rect.max_y)).save(buf, format="PNG")
        faces.append(Face(rect=rect, image=buf.getvalue()))
    return faces


@coco.fn.as_async(runner=coco.GPU)
def embed_face(face_png: bytes) -> list[float]:
    img = Image.open(io.BytesIO(face_png)).convert("RGB")
    return face_recognition.face_encodings(
        np.array(img), known_face_locations=[(0, img.width - 1, img.height - 1, 0)]
    )[0].tolist()
```

`face_recognition.face_encodings` returns a **128-d** vector. Faces of the same person sit close in this space — dlib's own rule of thumb is that a Euclidean distance under ~0.6 means "same person," which is why we index with Euclidean distance below.

## Process an image → fan out to faces

`process_file` runs once per image: detect its faces, then map each face through `process_face`, which embeds it and declares one Qdrant point. The point id is a stable hash of the file path plus the bounding box, so re-running never duplicates and an edited photo reconciles cleanly:

```python title="main.py"
@coco.fn
async def process_face(face: Face, filename: str, target: qdrant.CollectionTarget) -> None:
    embedding = await embed_face(face.image)
    target.declare_point(
        qdrant.PointStruct(
            id=_face_id(filename, face.rect),     # uuid5 of (filename, box) — stable
            vector=embedding,
            payload={"filename": filename, "min_x": face.rect.min_x, "min_y": face.rect.min_y,
                     "max_x": face.rect.max_x, "max_y": face.rect.max_y},
        )
    )


@coco.fn(memo=True)
async def process_file(file: FileLike, target: qdrant.CollectionTarget) -> None:
    faces = await extract_faces(await file.read())
    await coco.map(process_face, faces, str(file.file_path.path), target)
```

![One process_file component per image, fanned out with mount_each: each image is detected, every face embedded, and one Qdrant point declared per face](https://cocoindex.io/blobs/docs-v1/img/examples/face-recognition/stage-file-process.png)

[`@coco.fn(memo=True)`](https://cocoindex.io/docs/programming_guide/function/) makes it [incremental](https://cocoindex.io/docs/advanced_topics/memoization_keys/): an unchanged photo is never re-detected. Each image is its own [processing component](https://cocoindex.io/docs/programming_guide/processing_component/), so deleting a photo removes all its faces from Qdrant automatically. [`coco.map`](https://cocoindex.io/docs/programming_guide/app/) fans out one `process_face` per detected face — the multi-face equivalent of chunking a document.

## Define the main function

`app_main` mounts the Qdrant collection sized to the 128-d face vector with **Euclidean** distance, then walks the image folder and mounts one component per file:

```python title="main.py"
@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
    target_collection = await qdrant.mount_collection_target(
        QDRANT_DB,
        collection_name=QDRANT_COLLECTION,   # "face_embeddings"
        schema=await qdrant.CollectionSchema.create(
            vectors=qdrant.QdrantVectorDef(
                schema=VectorSchema(dtype=np.dtype(np.float32), size=128),
                distance="euclid",            # dlib encodings compare by L2 distance
            )
        ),
    )
    files = localfs.walk_dir(
        sourcedir, recursive=True,
        path_matcher=PatternFilePathMatcher(included_patterns=["**/*.jpg", "**/*.jpeg", "**/*.png"]),
        live=True,
    )
    await coco.mount_each(process_file, files.items(), target_collection)


app = coco.App(coco.AppConfig(name="FaceRecognitionV1"), app_main, sourcedir=pathlib.Path("./images"))
```

## Run the pipeline

You'll need [Qdrant](https://qdrant.tech/) and the `face_recognition` library (it depends on dlib).

```sh
docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant
export QDRANT_URL="http://localhost:6334/"
pip install -e .   # cocoindex[qdrant], face-recognition, numpy, pillow
```

The example ships a handful of famous group photos in `images/` (the 1927 Solvay physics conference, Steve Jobs & Bill Gates, …). Build the index:

```sh
cocoindex update main        # or: cocoindex update -L main   (keep watching the folder)
```

On the sample set this indexes **36 faces** — 29 from the Solvay conference alone — each as a Qdrant point keyed by `(filename, bounding box)`.

## Search by face

Embed a query face the same way and search Qdrant for the nearest indexed faces:

```python title="main.py"
def query(image_path: str, *, top_k: int = 5) -> None:
    arr = np.array(Image.open(image_path).convert("RGB"))
    locs = face_recognition.face_locations(arr, model="cnn")
    query_vec = face_recognition.face_encodings(arr, known_face_locations=locs[:1])[0].tolist()
    client = qdrant.create_client(qdrant_url(), prefer_grpc=True)
    for r in _qdrant_search(client, query_vec, top_k):
        print(f"[{r.score:.3f}] {(r.payload or {}).get('filename')}")
```

```sh
python main.py query images/einplanck3.jpg
```

Because Einstein appears in *both* the Einstein–Planck photo and the Solvay conference, the query pulls his Solvay face back as a close match — a Euclidean distance around `0.46`, comfortably under dlib's ~0.6 same-person threshold. That's face recognition across photos, with no labels or tags: the bounding box in the payload even tells you *where* in the source image the match is.

## Incremental updates

- **Add a photo** — only that image is detected and embedded; its faces are upserted.
- **Replace a photo** — faces whose box is unchanged keep their point; new faces are added, vanished faces are deleted.
- **Delete a photo** — every face from it is removed from Qdrant.
- **Re-run with nothing changed** — zero detection, zero embedding.

The expensive part (CNN detection + embedding) is fully memoized, so iterating on the downstream schema or query never re-runs the models on unchanged photos.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/face_recognition](https://github.com/cocoindex-io/cocoindex/tree/main/examples/face_recognition). For text-driven image search instead of face matching, see [Search Images by Text](https://github.com/cocoindex-io/cocoindex/tree/main/examples/image_search) (CLIP) — same Qdrant target, a different encoder.

Got a photo library you want to make searchable by face? Come tell us on [Discord](https://discord.com/invite/zpA9S2DR7s) — and if this was useful, [star CocoIndex on GitHub](https://github.com/cocoindex-io/cocoindex).

---

# Example: Product Recommendation Graph

Source: https://cocoindex.io/docs/examples/product-recommendation/

![Build a product recommendation graph with LLM taxonomy extraction and CocoIndex](https://cocoindex.io/blobs/docs-v1/img/examples/product-recommendation/cover.png)

A pile of product listings has the recommendations hiding in plain sight — a *pen* pairs with *ink refills* and a *notebook*; a *monitor* pairs with a *stand* and an *HDMI cable*. But that knowledge is locked in prose. In this tutorial we'll build a [CocoIndex](https://github.com/cocoindex-io/cocoindex) pipeline that turns a folder of product JSON into a [Neo4j](https://neo4j.com/) graph: an LLM tags each product with what it *is* and what *complements* it, and the shared taxonomy nodes turn into a recommendation engine you can query in Cypher.

The whole pipeline is ordinary `async` Python and your own types. The heavy lifting — [incremental processing](https://cocoindex.io/docs/programming_guide/core_concepts/), change tracking, managed graph targets — runs in a Rust engine underneath, so editing one product re-extracts only that product and the graph reconciles itself.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/product_recommendation)

## What we're building

Two node types, two relationship types:

- **`Product`** nodes — one per listing (title, price).
- **`Taxonomy`** nodes — one per distinct label (`gel pen`, `notebook`, `ink refill`), keyed by value and *shared* across products.
- **`PRODUCT_TAXONOMY`** edges — `Product → Taxonomy`: what the product is.
- **`PRODUCT_COMPLEMENTARY_TAXONOMY`** edges — `Product → Taxonomy`: what a buyer might also need.

The recommendation falls out of the graph: products whose *complementary* taxonomy matches another product's *is-a* taxonomy are things to recommend together.

## Pipeline overview

![CocoIndex flow: per-product LLM taxonomy extraction declaring Product nodes, then a single graph pass declaring the shared Taxonomy nodes and the two relationship types into Neo4j](https://cocoindex.io/blobs/docs-v1/img/examples/product-recommendation/flow-v1.png)

The taxonomy labels are shared across products, so — like the [docs knowledge graph](https://cocoindex.io/docs/examples/docs-to-knowledge-graph/) — the pipeline runs in two phases:

1. **Per-product extraction.** For each product, render its details to Markdown, LLM-extract the taxonomies and complementary taxonomies, declare the `Product` node, and carry the labels forward.
2. **Graph building.** One pass declares the deduplicated `Taxonomy` nodes and the two relationship types across all products.

You [declare the transformation](https://cocoindex.io/docs/programming_guide/core_concepts/) with native Python; CocoIndex works out what to insert, update, and delete. Think: **target_state = transformation(source_state)**.

## LLM taxonomy extraction

The extraction schema is two lists, and the field descriptions do the prompting — "what it is" vs. "what pairs with it":

```python title="main.py"
class ProductTaxonomy(pydantic.BaseModel):
    name: str = pydantic.Field(
        description="A concise noun for the product's core functionality — lowercase, "
        "specific ('pen', 'printer'), not broad ('office supplies')."
    )


class ProductTaxonomyInfo(pydantic.BaseModel):
    taxonomies: list[ProductTaxonomy] = pydantic.Field(description="What this product is.")
    complementary_taxonomies: list[ProductTaxonomy] = pydantic.Field(
        description="Taxonomies for complementary products a buyer might also need."
    )


@coco.fn(memo=True)
async def extract_taxonomy(detail: str) -> ProductTaxonomyInfo:
    client = instructor.from_litellm(litellm.acompletion, mode=instructor.Mode.JSON)
    result = await client.chat.completions.create(
        model=coco.use_context(LLM_MODEL),
        response_model=ProductTaxonomyInfo,
        messages=[{"role": "system", "content": TAXONOMY_PROMPT}, {"role": "user", "content": detail}],
    )
    return ProductTaxonomyInfo.model_validate(result.model_dump())
```

Extraction is [instructor](https://github.com/instructor-ai/instructor) over [LiteLLM](https://docs.litellm.ai/) — swap `LLM_MODEL` for any provider. [`@coco.fn(memo=True)`](https://cocoindex.io/docs/programming_guide/function/) caches each extraction by content, so re-running re-tags only changed products.

## Phase 1: per-product extraction

`process_file` renders the product JSON to Markdown (a Jinja template), declares the `Product` node, extracts the taxonomies, and returns the labels for phase 2:

```python title="main.py"
@coco.fn(memo=True)
async def process_file(file: FileLike, product_table: neo4j.TableTarget[Product]) -> ProductTaxonomies:
    raw = json.loads(await file.read_text())
    product_id = file.file_path.path.name.removesuffix(".json")
    price = float(str(raw["price"]).lstrip("$").replace(",", ""))
    product_table.declare_record(row=Product(id=product_id, title=raw["title"], price=price))

    info = await extract_taxonomy(PRODUCT_TEMPLATE.render(**raw))
    return ProductTaxonomies(
        product_id=product_id,
        taxonomies=[t.name for t in info.taxonomies],
        complementary=[t.name for t in info.complementary_taxonomies],
    )
```

## Phase 2: build the graph

`Taxonomy` nodes are shared, so they're owned by one graph pass — it declares the deduplicated node set and the two relationship types:

```python title="main.py"
@coco.fn
async def build_graph(products, taxonomy_table, product_taxonomy_rel, complementary_rel) -> None:
    labels = {t for p in products for t in (*p.taxonomies, *p.complementary)}
    for value in labels:
        taxonomy_table.declare_record(row=Taxonomy(value=value))

    for p in products:
        for t in set(p.taxonomies):
            product_taxonomy_rel.declare_relation(from_id=p.product_id, to_id=t)
        for t in set(p.complementary):
            complementary_rel.declare_relation(from_id=p.product_id, to_id=t)
```

Both relationship types carry no payload, so the [Neo4j connector](https://cocoindex.io/docs/connectors/neo4j/) derives each edge's identity from its `(Product, Taxonomy)` endpoints — one edge per pair, no duplicates.

## Run the pipeline

```sh
docker run -d -p 7474:7474 -p 7687:7687 -e NEO4J_AUTH=neo4j/cocoindex --name cocoindex-neo4j neo4j:5.26-community
cp .env.example .env   # set OPENAI_API_KEY (or LLM_MODEL=ollama/llama3.2)
pip install -e .
cocoindex update main
```

The example ships a `products/` folder of sample listings (pens, notebooks, monitors, …). Running it builds the graph — on the 9 sample products that's **9 `Product` nodes, ~40 `Taxonomy` nodes**, and the two edge types wired up.

## Explore the recommendations

Open [Neo4j Browser](http://localhost:7474) (`neo4j` / `cocoindex`) and ask the graph for recommendations:

```cypher
// What a pen is, and what pairs with it
MATCH (p:Product)-[:PRODUCT_TAXONOMY]->(:Taxonomy {value: "gel pen"})
MATCH (p)-[:PRODUCT_COMPLEMENTARY_TAXONOMY]->(c:Taxonomy)
RETURN p.title, collect(c.value) AS also_needs

// Recommend products to pair with anything that is a "pen":
// find products whose *is-a* taxonomy matches a pen's *complementary* taxonomy
MATCH (:Taxonomy {value: "gel pen"})<-[:PRODUCT_TAXONOMY]-(:Product)
      -[:PRODUCT_COMPLEMENTARY_TAXONOMY]->(need:Taxonomy)
MATCH (rec:Product)-[:PRODUCT_TAXONOMY]->(need)
RETURN DISTINCT rec.title
```

On the sample data, recommending for a pen surfaces the notepad and the multipurpose paper — exactly the cross-sell you'd want.

## Incremental updates

- **Edit a product** — only that product re-extracts; the graph pass re-runs and diffs, adding new taxonomy nodes/edges and removing ones no longer supported anywhere.
- **Add a product** — one extraction plus the graph diff.
- **Delete a product** — its `Product` node and edges are cleaned up; taxonomies only it introduced disappear on the next pass.
- **Swap the LLM** — `LLM_MODEL` has `detect_change=True`, so changing it re-extracts everything against the new model with no cache to clear.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/product_recommendation](https://github.com/cocoindex-io/cocoindex/tree/main/examples/product_recommendation). For a concept graph over prose docs instead of products, see [Turn Docs into a Knowledge Graph](https://cocoindex.io/docs/examples/docs-to-knowledge-graph/).

Got a product catalog you want to turn into a recommendation graph? Come tell us on [Discord](https://discord.com/invite/zpA9S2DR7s) — and if this was useful, [star CocoIndex on GitHub](https://github.com/cocoindex-io/cocoindex).

---

# Example: Manuals to Structured Data

Source: https://cocoindex.io/docs/examples/manuals-llm-extraction/

![Extract structured data from PDF manuals with docling + an LLM and CocoIndex](https://cocoindex.io/blobs/docs-v1/img/examples/manuals-llm-extraction/cover.png)

Manuals, datasheets, and reference docs are full of structure — classes, functions, parameters, defaults — laid out for humans, not machines. In this tutorial we'll build a [CocoIndex](https://github.com/cocoindex-io/cocoindex) pipeline that pulls that structure out: convert each PDF manual to Markdown with [docling](https://github.com/docling-project/docling), LLM-extract a typed summary of the module it documents, and store the result in Postgres. The sample manuals are the reference docs for a few Python standard-library modules.

The whole pipeline is ordinary `async` Python and your own types. The heavy PDF parse runs on a [GPU runner](https://cocoindex.io/docs/programming_guide/function/), and the Rust engine handles [incremental processing](https://cocoindex.io/docs/programming_guide/core_concepts/) — edit one manual and only that one is re-parsed and re-extracted.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/manuals_llm_extraction)

## Flow overview

![CocoIndex flow: walk a folder of PDF manuals, convert each to Markdown with docling, LLM-extract a typed ModuleInfo, and store a row per manual in Postgres](https://cocoindex.io/blobs/docs-v1/img/examples/manuals-llm-extraction/flow-v1.png)

Per manual, two transforms and a row:

1. Convert the PDF to Markdown with docling.
2. LLM-extract a `ModuleInfo` — title, description, classes (with their methods), and module-level functions (with their arguments).
3. Store a row in Postgres with the summary counts and the full structured info as JSON.

## The extraction schema is the prompt

The output type is nested Pydantic, and the structure itself tells the model what to pull out — a module has classes, a class has methods, a method has arguments:

```python title="main.py"
class MethodInfo(pydantic.BaseModel):
    name: str
    args: list[ArgInfo] = pydantic.Field(default_factory=list)
    description: str = ""


class ClassInfo(pydantic.BaseModel):
    name: str
    description: str = ""
    methods: list[MethodInfo] = pydantic.Field(default_factory=list)


class ModuleInfo(pydantic.BaseModel):
    title: str
    description: str
    classes: list[ClassInfo] = pydantic.Field(default_factory=list)
    methods: list[MethodInfo] = pydantic.Field(default_factory=list)
```

Extraction is [instructor](https://github.com/instructor-ai/instructor) over [LiteLLM](https://docs.litellm.ai/), so `LLM_MODEL` swaps any provider (OpenAI, Gemini, a local Ollama model). [`@coco.fn(memo=True)`](https://cocoindex.io/docs/programming_guide/function/) caches both the PDF parse and the extraction by content.

## Convert, extract, and store

`process_file` runs once per manual — docling to Markdown, LLM to `ModuleInfo`, then declare one Postgres row with the summary counts plus the full structure as JSON:

```python title="main.py"
@coco.fn(memo=True)
async def process_file(file: FileLike, table: postgres.TableTarget[ModuleRecord]) -> None:
    markdown = await pdf_to_markdown(await file.read())
    info = await extract_module(markdown)
    table.declare_row(
        row=ModuleRecord(
            filename=file.file_path.path.name,
            title=info.title,
            description=info.description,
            num_classes=len(info.classes),
            num_methods=len(info.methods),
            module_info=json.dumps(info.model_dump()),
        )
    )
```

> **docling vs. marker.** The original v0 example used `marker-pdf` for the PDF→Markdown step; this v1 port uses [docling](https://github.com/docling-project/docling) — the parser the other CocoIndex PDF examples use — but the shape is identical: bytes in, Markdown out, on a GPU runner.

## Run the pipeline

```sh
docker compose -f dev/postgres.yaml up -d
export POSTGRES_URL="postgres://cocoindex:cocoindex@localhost/cocoindex"
cp .env.example .env   # set OPENAI_API_KEY (or LLM_MODEL=gemini/gemini-2.0-flash, ollama/llama3.2, …)
pip install -e .
cocoindex update main
```

The example ships a `manuals/` folder of Python module reference PDFs. Running it produces one row per manual — and the extraction is faithful to each module's shape:

| manual | title | classes | functions |
|---|---|---|---|
| `array.pdf` | array — efficient arrays of numeric values | 1 (`array`) | 0 |
| `base64.pdf` | base64 — Base16/32/64/85 data encodings | 0 | 22 |
| `copy.pdf` | copy — shallow and deep copy operations | 1 | 3 |

`base64` is correctly recognized as function-based (22 module functions, no classes), while `array` is a single class — exactly the distinction you'd want from the structured output.

## Explore the results

```sql
SELECT filename, title, num_classes, num_methods FROM coco_examples.modules_info;

-- pull the full nested structure for one module
SELECT module_info::jsonb -> 'classes' -> 0 -> 'methods'
FROM coco_examples.modules_info WHERE filename = 'copy.pdf';
```

## Incremental updates

- **Add a manual** — only it is parsed and extracted; one new row.
- **Edit a manual** — re-parsed and re-extracted; the row is updated in place.
- **Delete a manual** — its row is removed.
- **Swap the LLM** — `LLM_MODEL` has `detect_change=True`, so everything re-extracts against the new model with no cache to clear.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/manuals_llm_extraction](https://github.com/cocoindex-io/cocoindex/tree/main/examples/manuals_llm_extraction). For extracting metadata from research papers instead, see [Index Academic Papers](https://cocoindex.io/docs/examples/paper-metadata/); for extraction into typed JSON files, see the patient-intake examples.

Got a pile of manuals or datasheets to structure? Come tell us on [Discord](https://discord.com/invite/zpA9S2DR7s) — and if this was useful, [star CocoIndex on GitHub](https://github.com/cocoindex-io/cocoindex).

---

# Example: Multi-format Visual Search

Source: https://cocoindex.io/docs/examples/multi-format-indexing/

![Index PDFs and images together with ColPali and CocoIndex](https://cocoindex.io/blobs/docs-v1/img/examples/multi-format-indexing/cover.png)

Real document sets are a mix — scanned reports, slide exports, screenshots, and PDFs all jumbled together. Parsing each format into clean text is brittle and loses the layout (tables, charts, figures) that often *is* the answer. In this tutorial we'll build a [CocoIndex](https://github.com/cocoindex-io/cocoindex) pipeline that sidesteps parsing entirely: render every PDF page to an image, embed pages and standalone images alike with the multi-vector [ColPali](https://huggingface.co/vidore/colpali-v1.2) model, and store them in one [Qdrant](https://qdrant.tech/) collection. A text query then retrieves the most relevant *page*, no matter what format it started as.

The whole pipeline is ordinary `async` Python. The slow per-page model inference runs on a [GPU runner](https://cocoindex.io/docs/programming_guide/function/), and the Rust engine handles [incremental processing](https://cocoindex.io/docs/programming_guide/core_concepts/) — add a document and only its pages get embedded.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/multi_format_indexing)

## Why ColPali (and multi-vector search)

A normal embedding squashes a whole page into one vector — fine for a paragraph, lossy for a dense report page with tables and figures. [ColPali](https://github.com/illuin-tech/colpali) instead emits a *bag* of vectors (one per image patch) and matches a query token-against-patch with **MaxSim**. The cost is more vectors per page; the payoff is retrieval that holds up on visually dense, text-heavy pages — exactly the documents that defeat plain OCR-and-embed.

## Flow overview

![CocoIndex flow: walk a folder of PDFs and images, render each PDF to per-page images, embed every page with ColPali, and store one multi-vector Qdrant point per page](https://cocoindex.io/blobs/docs-v1/img/examples/multi-format-indexing/flow-v1.png)

A file fans out to **pages**, so the shape is *file → N pages → N points*:

1. Walk a folder of PDFs and images (live).
2. Render each PDF to one image per page; an image is a single page.
3. Embed every page with ColPali and store one multi-vector Qdrant point per page, tagged with filename and page number.

## Split any file into pages

One function handles every format: PDFs go through [`pdf2image`](https://github.com/Belval/pdf2image), images pass through as a single page, anything else is skipped.

```python title="main.py"
@coco.fn.as_async(runner=coco.GPU)
def file_to_pages(filename: str, content: bytes) -> list[Page]:
    mime_type, _ = mimetypes.guess_type(filename)
    if mime_type == "application/pdf":
        return [
            Page(page_number=i + 1, image=_to_png(image))
            for i, image in enumerate(convert_from_bytes(content, dpi=PDF_RENDER_DPI))
        ]
    if mime_type and mime_type.startswith("image/"):
        return [Page(page_number=None, image=content)]
    return []
```

## Embed pages and fan out

`process_file` splits a file into pages, then maps each page through `process_file`'s helper, which embeds it with ColPali and declares one multi-vector Qdrant point:

```python title="main.py"
@coco.fn
async def process_page(page: Page, filename: str, target: qdrant.CollectionTarget) -> None:
    embedding = await embed_page(page.image)          # list[list[float]] — multi-vector
    target.declare_point(
        qdrant.PointStruct(
            id=_page_id(filename, page.page_number),
            vector=embedding,
            payload={"filename": filename, "page": page.page_number},
        )
    )


@coco.fn(memo=True)
async def process_file(file: FileLike, target: qdrant.CollectionTarget) -> None:
    pages = await file_to_pages(str(file.file_path.path), await file.read())
    await coco.map(process_page, pages, str(file.file_path.path), target)
```

`embed_page` runs the ColPali model (loaded once via `@functools.cache`) and returns a *list of* vectors — the multi-vector representation. [`coco.map`](https://cocoindex.io/docs/programming_guide/app/) fans out one `process_page` per page, and [`@coco.fn(memo=True)`](https://cocoindex.io/docs/programming_guide/function/) skips files that haven't changed.

## The multi-vector Qdrant collection

The collection is declared with a [`MultiVectorSchema`](https://cocoindex.io/docs/connectors/qdrant/) and a MaxSim comparator — that's what makes Qdrant score a query against the *best-matching patch* of each page:

```python title="main.py"
target_collection = await qdrant.mount_collection_target(
    QDRANT_DB,
    collection_name=QDRANT_COLLECTION,
    schema=await qdrant.CollectionSchema.create(
        vectors=qdrant.QdrantVectorDef(
            schema=MultiVectorSchema(
                vector_schema=VectorSchema(dtype=np.dtype(np.float32), size=dim)
            ),
            distance="cosine",
            multivector_comparator="max_sim",
        )
    ),
)
```

## Run the pipeline

```sh
docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant
export QDRANT_URL="http://localhost:6334/"
pip install -e .          # cocoindex[colpali,qdrant], pdf2image, torch, … (needs poppler for PDFs)
cocoindex update main
```

The example ships a `source_files/` folder mixing PDFs (papers) and images (financial report pages). A PDF expands to one point per page — the sample BERT paper alone is 16 pages.

## Search across formats

Embed a text query with ColPali and search Qdrant; the same query reaches pages from PDFs and standalone images alike:

```sh
python main.py "revenue growth"
```

On the sample set, *"revenue growth"* ranks the two financial-report images at the top (Sweetgreen, then Restaurant Brands), above an unrelated healthcare page — MaxSim matching the query against the most relevant patches of each page, with zero text extraction.

## Incremental updates

- **Add a file** — only its pages are rendered and embedded; existing points are untouched.
- **Edit a file** — pages reconcile against what's in Qdrant; unchanged pages keep their points.
- **Delete a file** — every page from it is removed.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/multi_format_indexing](https://github.com/cocoindex-io/cocoindex/tree/main/examples/multi_format_indexing). For the image-only version with a web UI, see [Search Images by Text · ColPali](https://github.com/cocoindex-io/cocoindex/tree/main/examples/image_search_colpali); for a text-extraction pipeline over PDFs instead, see [Semantic Search over PDFs](https://cocoindex.io/docs/examples/pdf-embedding/).

Got a pile of mixed-format documents to make searchable? Come tell us on [Discord](https://discord.com/invite/zpA9S2DR7s) — and if this was useful, [star CocoIndex on GitHub](https://github.com/cocoindex-io/cocoindex).

---

# Example: Slides to Narrated Search

Source: https://cocoindex.io/docs/examples/slides-to-speech/

![Turn slide decks into narrated, searchable audio with a vision LLM and Piper TTS](https://cocoindex.io/blobs/docs-v1/img/examples/slides-to-speech/cover.png)

A slide deck is a great outline and a terrible thing to *listen to* or *search*. In this tutorial we'll build a [CocoIndex](https://github.com/cocoindex-io/cocoindex) pipeline that fixes both: for each slide, a vision LLM writes natural speaker notes, [Piper](https://github.com/OHF-Voice/piper1-gpl) synthesizes them to audio locally, and the notes are embedded into [LanceDB](https://lancedb.com/) so you can search the deck by meaning and play back the narration for any hit.

The whole pipeline is ordinary `async` Python. The vision and TTS steps run on a [GPU runner](https://cocoindex.io/docs/programming_guide/function/), and the Rust engine handles [incremental processing](https://cocoindex.io/docs/programming_guide/core_concepts/) — add a deck and only its slides get processed.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/slides_to_speech)

## Flow overview

![CocoIndex flow: render each slide to an image, a vision LLM writes speaker notes, Piper TTS narrates them, the notes are embedded, and everything is stored per-slide in LanceDB](https://cocoindex.io/blobs/docs-v1/img/examples/slides-to-speech/flow-v1.png)

A deck fans out to **slides**, and each slide produces text, audio, and a vector:

1. Render each slide of the PDF to an image (pymupdf).
2. A vision LLM writes speaker notes for the slide.
3. Piper synthesizes the notes to MP3 audio; a sentence-transformer embeds the notes.
4. Store one LanceDB row per slide — page, notes, audio, and embedding.

## Speaker notes from a slide image

The vision LLM reads the rendered slide and writes presenter narration. Extraction is [instructor](https://github.com/instructor-ai/instructor) over [LiteLLM](https://docs.litellm.ai/), so the image goes in as a data URL and a typed `SlideTranscript` comes back:

```python title="main.py"
class SlideTranscript(pydantic.BaseModel):
    speaker_notes: str = pydantic.Field(
        description="Natural spoken narration for this slide, as a presenter would say it."
    )


@coco.fn(memo=True)
async def extract_speaker_notes(image: bytes) -> SlideTranscript:
    client = instructor.from_litellm(litellm.acompletion, mode=instructor.Mode.JSON)
    data_url = "data:image/png;base64," + base64.b64encode(image).decode()
    result = await client.chat.completions.create(
        model=coco.use_context(LLM_MODEL),          # e.g. gemini/gemini-2.5-flash
        response_model=SlideTranscript,
        messages=[{"role": "user", "content": [
            {"type": "text", "text": "Write speaker notes for this slide."},
            {"type": "image_url", "image_url": {"url": data_url}},
        ]}],
    )
    return SlideTranscript.model_validate(result.model_dump())
```

> **A note on the port.** The v0 example pulled slides from Google Drive and used BAML for the vision call; this v1 port reads slides from a local folder and uses instructor + LiteLLM (any vision model — Gemini, GPT-4o, …). Point the source at a [Google Drive folder](https://cocoindex.io/docs/connectors/google_drive/) to reproduce the original.

## Narrate locally with Piper

Piper is a fast, fully local neural TTS — no API, no per-character billing. The voice model loads once and synthesizes the notes to MP3:

```python title="main.py"
@coco.fn.as_async(runner=coco.GPU)
def text_to_speech(text: str) -> bytes:
    voice = get_piper_voice()                       # cached PiperVoice
    chunks = list(voice.synthesize(text))
    pcm = b"".join(c.audio_int16_bytes for c in chunks)
    audio = AudioSegment(data=pcm, sample_width=chunks[0].sample_width,
                         frame_rate=chunks[0].sample_rate, channels=chunks[0].sample_channels)
    out = io.BytesIO(); audio.export(out, format="mp3", bitrate="64k")
    return out.getvalue()
```

## Fan out per slide and store

`process_file` renders the deck to slides, then maps each through `process_slide`, which runs the vision LLM, then synthesizes audio *and* embeds the notes concurrently before declaring the row:

```python title="main.py"
@coco.fn
async def process_slide(slide, filename, table) -> None:
    notes = (await extract_speaker_notes(slide.image)).speaker_notes
    voice, embedding = await asyncio.gather(
        text_to_speech(notes),
        coco.use_context(EMBEDDER).embed(notes),
    )
    table.declare_row(row=SlideRecord(
        id=f"{filename}#{slide.page_number}", filename=filename, page=slide.page_number,
        speaker_notes=notes, voice=voice, embedding=embedding,
    ))
```

The MP3 audio is stored right in the LanceDB row (a binary column), so a search hit comes with playable narration attached.

## Run the pipeline

```sh
python3 -m piper.download_voices en_US-lessac-medium   # ~60 MB local voice
cp .env.example .env                                    # set GEMINI_API_KEY (or OPENAI_API_KEY)
pip install -e .                                        # needs ffmpeg for MP3 export
cocoindex update main
```

Drop a slide-deck PDF into `slides/`. On a 3-slide sample deck, this produces three LanceDB rows, each with Gemini-written speaker notes and ~170–280 KB of Piper MP3 audio.

## Search the deck

Embed a query the same way and search LanceDB:

```sh
python main.py "reducing latency and reliability"
```

On the sample deck, that query ranks the **Engineering Priorities** slide first — above the roadmap and go-to-market slides — matching the spoken notes by meaning, not keywords. Each hit carries the slide's MP3 narration, ready to play.

## Incremental updates

- **Add a deck** — only its slides are rendered, narrated, and embedded.
- **Edit a deck** — slides reconcile against LanceDB; unchanged slides keep their notes and audio.
- **Swap the voice or LLM** — change `PIPER_MODEL_NAME` or `LLM_MODEL`; the affected steps re-run, the rest is served from cache.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/slides_to_speech](https://github.com/cocoindex-io/cocoindex/tree/main/examples/slides_to_speech). For transcribing existing audio instead of generating it, see [Audio → Text](https://cocoindex.io/docs/examples/audio-to-text/).

Got a deck library you want to narrate and search? Come tell us on [Discord](https://discord.com/invite/zpA9S2DR7s) — and if this was useful, [star CocoIndex on GitHub](https://github.com/cocoindex-io/cocoindex).

---

# Example: SEC Filing Hybrid Search

Source: https://cocoindex.io/docs/examples/sec-edgar-analytics/

![Index multi-format SEC filings into Apache Doris with vector + full-text indexes for hybrid search](https://cocoindex.io/blobs/docs-v1/img/examples/sec-edgar-analytics/cover.png)

SEC filings come in many shapes — narrative 10-K risk factors as text, structured financials as XBRL JSON, exhibits as PDF. In this tutorial we'll build a [CocoIndex](https://github.com/cocoindex-io/cocoindex) pipeline that pulls these formats into a *single* searchable index in [Apache Doris](https://doris.apache.org/), with both a **vector index** for semantic search and a **full-text index** for keyword search — the foundation for hybrid retrieval. Along the way each document is scrubbed of PII, chunked, embedded, and tagged with risk/topic labels.

The whole pipeline is ordinary `async` Python. Embedding runs on a [GPU runner](https://cocoindex.io/docs/programming_guide/function/), and the Rust engine handles [incremental processing](https://cocoindex.io/docs/programming_guide/core_concepts/) — add a filing and only its chunks are embedded and loaded.

[→ View on GitHub](https://github.com/cocoindex-io/cocoindex/tree/main/examples/sec_edgar_analytics)

## Flow overview

![CocoIndex flow: walk text filings and JSON facts, scrub PII, chunk, embed, tag topics, and load one row per chunk into Apache Doris with a vector index and a full-text index](https://cocoindex.io/blobs/docs-v1/img/examples/sec-edgar-analytics/flow-v1.png)

Two source formats fan into one chunk table:

1. **Sources** — `*.txt` 10-K filings and `*.json` XBRL company facts (the JSON is rendered to searchable text first).
2. **Scrub & chunk** — strip SSNs / phones / emails *before* indexing, then split into overlapping chunks.
3. **Embed & tag** — a sentence-transformer embeds each chunk; a keyword pass tags `RISK:*` / `TOPIC:*` labels.
4. **Load into Doris** — one row per chunk, into a table with a vector (ANN) index and a full-text (inverted) index.

## One table, two index types

The row type is a plain dataclass. The magic is in `mount_table_target`: the same table gets a **vector index** (for `l2_distance` semantic search) and an **inverted index** (for `MATCH_ANY` keyword search):

```python title="main.py"
@dataclass
class FilingChunk:
    chunk_id: str          # primary key
    source_type: str       # "filing" | "facts"
    doc_filename: str
    cik: str
    filing_date: str
    form_type: str
    text: str
    topics: list[str]
    embedding: Annotated[NDArray, EMBEDDER]


table = await doris.mount_table_target(
    DORIS_DB, TABLE,
    await doris.TableSchema.from_class(FilingChunk, primary_key=["chunk_id"]),
    vector_indexes=[doris.VectorIndexDef(field_name="embedding", metric_type="l2_distance")],
    inverted_indexes=[doris.InvertedIndexDef(field_name="text", parser="unicode")],
)
```

## Scrub PII, then chunk, embed, and tag

PII is redacted *before* chunking, so it never enters the index. Each format gets a thin per-file entry point (`process_filing`, `process_facts`) that funnels into one shared path — scrub, chunk, embed, tag, declare a row per chunk:

```python title="main.py"
async def _index_text(text, source_type, filename, cik, filing_date, form_type, table):
    embedder = coco.use_context(EMBEDDER)
    for chunk in _splitter.split(_scrub_pii(text), chunk_size=1000, chunk_overlap=200,
                                 language="markdown"):
        table.declare_row(row=FilingChunk(
            chunk_id=_chunk_id(filename, chunk.start.char_offset, chunk.end.char_offset),
            source_type=source_type, doc_filename=filename, cik=cik,
            filing_date=filing_date, form_type=form_type,
            text=chunk.text, topics=_extract_topics(chunk.text),
            embedding=await embedder.embed(chunk.text),
        ))
```

Both sources `declare_row` into the *same* Doris table — `chunk_id` is a stable `uuid5` of the file and chunk offsets, so re-running reconciles cleanly instead of duplicating.

> **A note on the port.** The original v0 example also ingested PDF exhibits via docling; this v1 port focuses on the text and XBRL-JSON sources (the PDF path is identical to the [Manuals to Structured Data](https://cocoindex.io/docs/examples/manuals-llm-extraction/) example — `docling` bytes → Markdown, then the same `_index_text`). It needs **Apache Doris 4.0+** for vector index support; a ready `docker-compose.yml` is included.

## Run the pipeline

```sh
docker compose up -d fe be       # Apache Doris 4.0 (FE + BE)
python download.py               # synthetic 10-K filings + XBRL company facts
cp .env.example .env             # Doris host/ports
pip install -e .
cocoindex update main
```

On the sample data this loads 4 chunks (2 filings + 2 company-facts) into Doris, creating both `idx_vec_embedding` (ANN) and `idx_inv_text` (INVERTED). Topic tags come out as you'd expect — Apple's filing tagged `RISK:CYBER, RISK:CLIMATE, RISK:SUPPLY, RISK:REGULATORY, TOPIC:AI`, Microsoft's `RISK:CYBER, RISK:REGULATORY, TOPIC:AI, TOPIC:CLOUD`.

## Hybrid search with RRF

The payoff is hybrid retrieval — fuse the vector ranking and the keyword ranking with [Reciprocal Rank Fusion](https://en.wikipedia.org/wiki/Learning_to_rank). `search.py` does both in one SQL query:

```python title="search.py (shape)"
WITH semantic AS (
    SELECT chunk_id, ROW_NUMBER() OVER (ORDER BY l2_distance(embedding, {q})) AS rk
    FROM filing_chunks
),
lexical AS (
    SELECT chunk_id, ROW_NUMBER() OVER (
        ORDER BY CASE WHEN text MATCH_ANY '{keywords}' THEN 0 ELSE 1 END) AS rk
    FROM filing_chunks
)
SELECT s.doc_filename, 1.0/(60 + s.rk) + 1.0/(60 + l.rk) AS rrf
FROM semantic s JOIN lexical l USING (chunk_id) ORDER BY rrf DESC
```

```sh
python search.py "cloud computing and AI risk"
```

On the sample data that ranks **Microsoft's** cloud-and-AI filing first (it carries both `TOPIC:CLOUD` and `TOPIC:AI`), Apple's second, and the company-facts rows below — semantic relevance and keyword presence combined, not either alone.

## Incremental updates

- **Add a filing** — only its chunks are scrubbed, embedded, tagged, and stream-loaded into Doris.
- **Edit a filing** — chunks reconcile by `chunk_id`; unchanged chunks are untouched.
- **Delete a filing** — its chunks are removed from the table.

## Run it

The full, runnable example is in the CocoIndex repo: [examples/sec_edgar_analytics](https://github.com/cocoindex-io/cocoindex/tree/main/examples/sec_edgar_analytics). For the PDF-extraction side, see [Manuals to Structured Data](https://cocoindex.io/docs/examples/manuals-llm-extraction/); for a pure-vector setup, see [Text Embedding](https://cocoindex.io/docs/examples/text-embedding/).

Indexing your own filing archive? Come tell us on [Discord](https://discord.com/invite/zpA9S2DR7s) — and if this was useful, [star CocoIndex on GitHub](https://github.com/cocoindex-io/cocoindex).