CocoIndex is an incremental engine for long horizon agents. It continously transforms data from any source - codebase, meeting notes and builds the context for agents deploying in production.
CocoIndex V1 is now live. It is a fundamental redesign of how you write incremental data pipelines — built from a year of watching what people actually wanted to do with CocoIndex. CocoIndex V1 is built for AI engineers and agent builders — people building context, RAG, memory, knowledge-graph that live agents depend on. The design choice that matters most to that audience is deliberately small: the framework meets you in the language you already write, with the data types you already use. If Claude or Cursor can already fluently write it, it works here.
The headline change is simple: there is no DSL anymore. Your whole pipeline is now regular async Python, growing organically as functions call each other. The engine still does what CocoIndex has always done — managed targets, incremental processing, change tracking in Rust — but it now lives behind a Python-native surface, not behind a DSL with its own type system and its own rules about what you can and can’t do.
The mental model is unchanged, and it’s the clearest way to describe what CocoIndex does: you declare what the target should look like as a function of the source, and the engine figures out the transitions. CocoIndex calls this state-driven programming — the same shape as React, spreadsheets, or SQL materialized views. Declare the state, Coco handles rest.
Why CocoIndex: the incremental engine for long-horizon agents
Agents run roughly 50× faster than humans, but the tools they rely on were built for human speed.
At GTC 2026, Jeff Dean and Bill Dally named a bottleneck that’s about to reshape every piece of infrastructure around AI. Amdahl’s law takes over — no matter how fast the model gets, end-to-end throughput is capped by the slowest thing in the loop. Jeff’s specific emphasis was on ultra-low-latency inference for agents “operating in the background,” running tasks that “take hours or perhaps even days, independently doing a bunch of things, correcting themselves, doing some more things.”
Data infrastructure is one of those tools, and it matters beyond inference. An agent reasoning over a codebase, a conversation graph, a document corpus, or a stream of events needs that data fresh, organized, and cheap to query — not just on the first call, but throughout the run, because:
- Agents write code. The artifacts they produce become source data for the next reasoning step.
- Agents make decisions and record them. Traces, plans, audit logs — written by agents, read by other agents.
- Agents update data while they run. A long-horizon agent isn’t a one-shot query; it’s a process that edits its own context as it proceeds.
- Source data arrives faster, too. Codebases, Logs, Slack, PRs, tickets, sensor feeds — all updating in the background, all expected to be visible to whatever the agent is doing next.
The default pattern for indexing that data — “batch-rebuild overnight” — was built for humans who check a dashboard every morning. For agents running 50x faster, that cadence is the bottleneck. What agents need is an engine that treats derived data as a function of source data and keeps it in sync incrementally: only the chunks that actually changed get re-embedded, only the rows that changed get re-upserted, only the messages that changed get re-published. No full rebuilds. No stale snapshots. No waiting for the next batch window.
That has always been CocoIndex’s job. V1 makes it the right shape for agent-era workloads: the same incremental, state-driven guarantees, but now expressive enough to cover the pipeline shapes agents actually produce — entity resolution, clustering, multi-phase reduction, per-tenant topologies, conditional targets — instead of just “chunk → embed → upsert.” Every pattern in the examples gallery is something a long-horizon agent might want to run itself, and have its outputs become fresh source data for the next agent — without a human babysitting the refresh job.
What is an incremental engine, and why it’s hard to build for production
An incremental engine does one job: keep derived data in sync with source data without redoing work that’s already been done. The source data is changing and the processing code is dynamic.
To do it right, you need: reliable change detection across heterogeneous sources (filesystems, databases, queues, APIs), content fingerprints stable enough to diff but cheap enough to compute every run, persistent state that records what the last run produced so this run can compare against it, memoization keyed on both data and code fingerprints so that editing a helper function invalidates only the callers that actually depended on it, managed target lifecycles (schema evolution, orphan cleanup when a source disappears, idempotent upserts), transactional behavior when you’re writing to multiple targets at once, and recovery logic for every partial failure in between and many more.
Teams that take this on seriously typically allocate 10 – 20 engineers for at least six months to land the first production-worthy version — and then keep paying for maintenance indefinitely as sources, targets, and schemas evolve. CocoIndex ships all of this in the engine, so the code you write is the pipeline itself, not the scaffolding around it.
Notion has recently published their engineering work around the tip of the iceberg on maintaining in-house data pipeline with incremental processing on simipler data transformation (e.g., without clustering etc) for live agents.
The V1 mental model: declare states, not steps
The mental model didn’t change. The execution model did.
CocoIndex has always been a state-driven framework. You don’t write code that says “compute this delta and apply it”; you describe what the target should look like as a function of the source, and the engine works out the transitions. If you’ve used React, spreadsheets, or SQL materialized views, you already know the shape of this:
- React: declare UI as a function of state → React re-renders what changed.
- Spreadsheets: declare formulas → cells recompute when inputs change.
- CocoIndex: declare target states as a function of source → engine syncs what changed.
What’s new in V1 is how you express that declaration. The whole pipeline is now plain async Python — no FlowBuilder, no DataScope, no two-phase “define, then run” lifecycle. Functions call other functions. Loops are loops. if statements are if statements. Behind the scenes, CocoIndex tracks every target state you declare and every memoized intermediate value, so the next run only does what changed.
V1 is deliberately not intrusive. You can start with a single @coco.fn on a plain Python function and opt into more of the engine as the pipeline grows: flip on memo=True when re-running starts to cost real time, add declare_* targets when you want the engine to own the output lifecycle, move to live mode when you want the pipeline to keep watching for changes. There’s no “set up the framework first” step and no ceremony you have to write before you get value — you turn the incremental knobs on when you actually need them.
That model rests on a small vocabulary: an App is the top-level entity you run; inside it, each processing component groups one item’s work with its target states and runs independently; incremental processing means only what changed gets reapplied to the target; and function memoization skips any function whose input and code haven’t changed since the last run.
What that means in practice is code you don’t have to write: no bookkeeping table tracking which file was last processed, no insert / update / delete branches, no “did I already embed this chunk?” checks, no migration scripts when your schema changes. The engine owns the target state, diffs against the previous run, and only does what changed. V0 made that tractable. V1 makes it feel like writing native Python.
A complete V1 app — a PDF-to-Markdown converter — is short enough to read at a glance:
import pathlib
import cocoindex as coco
from cocoindex.connectors import localfs
from cocoindex.resources.file import PatternFilePathMatcher
from docling.document_converter import DocumentConverter
_converter = DocumentConverter()
@coco.fn(memo=True)
def process_file(file: localfs.File, outdir: pathlib.Path) -> None:
markdown = _converter.convert(file.file_path.resolve()).document.export_to_markdown()
outname = file.file_path.path.stem + ".md"
localfs.declare_file(outdir / outname, markdown, create_parent_dirs=True)
@coco.fn
async def app_main(sourcedir: pathlib.Path, outdir: pathlib.Path) -> None:
files = localfs.walk_dir(
sourcedir, path_matcher=PatternFilePathMatcher(included_patterns=["**/*.pdf"]),
)
await coco.mount_each(process_file, files.items(), outdir)
app = coco.App("PdfToMarkdown", app_main,
sourcedir=pathlib.Path("./pdf_files"),
outdir=pathlib.Path("./out"))
app.update_blocking()
Three things to notice:
process_fileis just a function — you can set a breakpoint in it, call it directly, run it under pytest, whatever. The@coco.fn(memo=True)decorator hooks it into the engine’s change-detection and memoization, but doesn’t change its calling convention.localfs.declare_file(...)is a declaration, not a write. It tells CocoIndex “the target state for this path is this content.” If the target state hasn’t changed since the last run, nothing happens on disk. If the source file disappears, the declaration disappears too — and CocoIndex deletes the corresponding output file automatically.coco.mount_each(...)mounts one processing component per file. Each component is its own unit of incremental change detection: when one file’s content changes, only that component re-runs.
That’s the whole programming model. Everything below is consequences.
Why V1: the V0 ceiling
V0 worked — developers shipped real pipelines in production — but four constraints kept coming back, all consequences of the DSL-first design:
- Two worlds, one program.
FlowBuilder/DataScope/DataSlicefor topology, regular Python for leaves. The DSL world couldn’tifits way into a different shape, couldn’t pass values into normal Python and back. Clustering, entity resolution, reduction across items, conditional sources — every one of them hit the seam. - A separate type system. Like a database, the engine had its own types distinct from Python’s, so every Python type you wanted to use needed a bi-directional conversion wired in.
dataclassandNDArraywere covered; a PILImageor a pyarrow array was not — if no conversion existed, you couldn’t hand it to a function, and what did cross the boundary got marshaled on the way in and out. - Postgres as a hard dependency. The engine used Postgres for its own bookkeeping, so
pip install cocoindex && python script.pywasn’t a real path — you stood up a database first, even for a tiny local pipeline. - Static topology.
add_source()and.export()were declared before execution, so multi-tenancy, config-driven topologies, and runtime target selection all required scaffolding around CocoIndex rather than being expressible in it.
Fixing any one of these meant changing the design.
Four shifts that change what you can build
1. No more DSL — the flow grows during execution
In V0, the flow had to be fully described before any data moved through it. FlowBuilder returned a frozen graph; the engine then interpreted that graph at execution time. In V1, the flow is the execution. app_main runs as the root processing component. Every await coco.mount(...) it does adds a child component. Every declare_* call records a target state. The component tree is built by running your code, not by parsing a graph beforehand.
This unlocks shapes that were awkward or impossible in V0:
-
Reduction patterns — “for each project, look at all its files, then aggregate.” The multi-codebase summarization example does exactly this:
@coco.fn(memo=True) async def process_project(project_name, files, output_dir): file_infos = await coco.map(extract_file_info, files) # fan out project_info = await aggregate_project_info(project_name, file_infos) # reduce markdown = generate_markdown(project_name, project_info, file_infos) localfs.declare_file(output_dir / f"{project_name}.md", markdown) -
Multi-phase pipelines with cross-item logic — entity resolution across sessions, clustering, deduplication. The conversation-to-knowledge example uses three phases (per-session extraction, cross-session entity resolution, knowledge-base assembly) all expressed as function calls in a single
app_main. -
Conditional topology —
if config.enable_kafka: await mount_kafka_target(...). No flag-pattern, no separate flow definitions.
And because everything is real Python code, debugging is real Python debugging. Set a breakpoint. Step through. Print things. The engine’s not interpreting an opaque graph — it’s calling your functions.
2. Native Python types — no separate type system
V1 doesn’t have a separate type system — it uses Python’s directly. The types v0 supported through bindings (dataclass, Pydantic, NDArray, dict, list, tuple) still work with nothing to wire up, and types v0 couldn’t express — a PIL Image, a pyarrow array, an arbitrary class from a library you just pulled in — can be passed as function arguments the same way:
from PIL import Image
@coco.fn(memo=True)
async def caption_image(img: Image) -> str:
return await vlm.describe(img)
That Image is just PIL.Image — no wrapper, no conversion. Pass the function a pyarrow Table, a torch.Tensor, or any class from a library you happen to have imported; the signature is whatever you’d write in plain Python.
Targets, function arguments, return types, memoization keys — everything is described in the type hints you’d write anyway. For memoized return values that persist between runs, the same type hint drives serialization; we wrote a long post on how the type-guided serializer works underneath.
3. Embedded LMDB — pip install and go
V0’s “Postgres for engine bookkeeping” requirement is gone. V1 stores its internal state in LMDB, an embedded key-value store that lives in a single local file. There’s no server process, no schema to migrate, no port to expose.
pip install --pre cocoindex
cocoindex update main.py
That’s the whole setup. The LMDB file holds the target-state ledger, the memoization cache, and the component-path tree across runs. On a 64-bit system the default 4 GiB map size is virtual address space, not physical memory — you can bump it to 8 GiB or more without paying the cost upfront.
Postgres is still a first-class target (with pgvector support); it’s just no longer required for the engine to function.
4. Dynamic sources and targets
In V0, sources and targets were declared in the flow definition, before execution. In V1, they’re created during execution — by regular function calls — and the component-path tree tracks their identity across runs. So you can:
- Mount a target conditionally based on a config row, an environment variable, or a feature flag.
- Create different targets per item — e.g., one Postgres schema per tenant, one Kafka topic per category, one S3 prefix per dataset.
- Build the topology from a database — query the list of active tenants at startup and mount one component per tenant, each with its own set of targets. When a tenant disappears from the list, CocoIndex automatically cleans up their target states (because the path is no longer mounted).
The conversation-to-knowledge example does this for entity types: it iterates over a list of entity configurations and mounts one SurrealDB target per type, all from inside a normal for loop in app_main.
entity_tables = {
cfg.name: await surrealdb.mount_table_target(SURREAL_DB, cfg.name, entity_schema)
for cfg in ENTITY_TYPES
}
Add a new entity type to ENTITY_TYPES, restart, and a new table appears — managed by CocoIndex from then on. Remove a type, and its table goes away. No migration, no orphaned data.
The real leap over v0 is when that list isn’t a constant at all — a changing file, a database row, a live tenant registry — since v0 could already generate topology from static inputs at flow-creation time, but couldn’t react when those inputs changed.
What we kept: the core promise
The redesign was the means, not the goal. The reason V1 exists is to make the things V0 was already good at more available. Those things are still here, and now they apply across a much wider set of pipeline shapes.
Fully managed targets
You declare what your target should look like; CocoIndex makes sure it’s in that state — on any change, whether to source data or to your code. This is the lifecycle, top to bottom:
| Layer | Upsert — first or changed declaration | Delete — no longer declared |
|---|---|---|
| Container — schema (tables · directories · other containers) | Create or alter | Drop |
| Leaf — data (rows · files · embeddings · other leaves) | Insert or update | Delete |
No manual migrations. No orphaned data. No “did I already write this row?” code paths. The target is fully owned by CocoIndex — if you stop declaring something, it goes away. This is the same contract whether the target is a Postgres table, a LanceDB collection, a Neo4j graph node, a Kafka topic, an S3 prefix, or a directory of files on disk.
Incremental processing, two levels deep
Incremental processing is still the core of the engine, and V1 exposes it through two concepts:
- Components own target states. When a component finishes, CocoIndex diffs its declared target states against the last run and applies only the changes — create / alter / drop for containers, insert / update / delete for contents — including recursive cleanup of sub-paths no longer mounted. When a component’s inputs and code haven’t changed, the whole component is skipped.
@coco.fnfunctions participate in change detection. Every decorated function’s code is fingerprinted, and that fingerprint propagates up the call chain. Addmemo=Trueand the return value is also cached, keyed on the function’s arguments plus its code fingerprint, so the body is skipped when neither has changed.
Edit a helper function and only the components and memoized callers that actually depend on it are invalidated.
Rust-powered engine
The change-detection, fingerprinting, target-state diffing, and persistence are all written in Rust on top of Tokio. That’s still true — Python is the user-facing API, but the hot paths run in Rust. You get pythonic ergonomics without giving up the performance characteristics that made V0 feel snappy.
What it looks like end-to-end
Here’s a complete text-embedding pipeline that walks markdown files, chunks them, embeds the chunks with a SentenceTransformer model, and writes everything (with a vector index) to Postgres. It’s the text_embedding example trimmed down — every line of pipeline logic is below.
import pathlib
from dataclasses import dataclass
from typing import AsyncIterator, Annotated
import asyncpg
from numpy.typing import NDArray
import cocoindex as coco
from cocoindex.connectors import localfs, postgres
from cocoindex.ops.text import RecursiveSplitter
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder
from cocoindex.resources.chunk import Chunk
from cocoindex.resources.file import FileLike, PatternFilePathMatcher
from cocoindex.resources.id import IdGenerator
PG_DB = coco.ContextKey[asyncpg.Pool]("text_embedding_db")
EMBEDDER = coco.ContextKey[SentenceTransformerEmbedder]("embedder", detect_change=True)
_splitter = RecursiveSplitter()
@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
async with await postgres.create_pool("postgres://...") as pool:
builder.provide(PG_DB, pool)
builder.provide(EMBEDDER, SentenceTransformerEmbedder("sentence-transformers/all-MiniLM-L6-v2"))
yield
@dataclass
class DocEmbedding:
id: int
filename: str
chunk_start: int
chunk_end: int
text: str
embedding: Annotated[NDArray, EMBEDDER]
@coco.fn
async def process_chunk(chunk, filename, id_gen, table):
table.declare_row(row=DocEmbedding(
id=await id_gen.next_id(chunk.text),
filename=str(filename),
chunk_start=chunk.start.char_offset,
chunk_end=chunk.end.char_offset,
text=chunk.text,
embedding=await coco.use_context(EMBEDDER).embed(chunk.text),
))
@coco.fn(memo=True)
async def process_file(file: FileLike, table) -> None:
text = await file.read_text()
chunks = _splitter.split(text, chunk_size=2000, chunk_overlap=500, language="markdown")
id_gen = IdGenerator()
await coco.map(process_chunk, chunks, file.file_path.path, id_gen, table)
@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
table = await postgres.mount_table_target(
PG_DB, table_name="doc_embeddings",
table_schema=await postgres.TableSchema.from_class(DocEmbedding, primary_key=["id"]),
)
table.declare_vector_index(column="embedding")
files = localfs.walk_dir(
sourcedir, recursive=True,
path_matcher=PatternFilePathMatcher(included_patterns=["**/*.md"]),
)
await coco.mount_each(process_file, files.items(), table)
app = coco.App(coco.AppConfig(name="TextEmbedding"), app_main,
sourcedir=pathlib.Path("./markdown_files"))
Worth pausing on what’s not in there:
- No flow definition phase.
app_mainruns and the topology emerges. - No custom column types.
embedding: Annotated[NDArray, EMBEDDER]is the schema;from_classreads it. - No insert / update / delete branches.
declare_row(...)describes the desired state; the engine diffs. - No “did this chunk already get embedded?” check. Memoization handles it.
- No bookkeeping table for which file was last processed. LMDB has the previous run’s fingerprints.
- No manual migration if you change
DocEmbedding. Schema evolution is part of the managed-target contract.
Add a markdown file → its chunks get embedded and inserted. Edit one paragraph → only the affected chunks re-embed; the others reuse cached embeddings. Delete a file → its rows vanish from Postgres. Change the splitter’s chunk_size → every file re-chunks, but embed(chunk) calls whose chunk text didn’t change still hit the memo cache. Change the embedder model → because EMBEDDER is detect_change=True, every embedding gets recomputed and the column re-syncs.
That’s the whole thing. None of those scenarios need a separate code path.
Living shapes V0 couldn’t reach
A short tour of pipeline shapes the V1 model makes natural, drawn from the example gallery:
- Multi-codebase summarization — fan out per project, fan out per file with
coco.map, reduce back up to a project summary, write a markdown file per project. Pure Python control flow; the engine handles which projects need re-summarizing. - Conversation-to-knowledge — three phases in one
app_main: per-session extraction (mountper YouTube ID), cross-session entity resolution (mountper entity type), and knowledge-base assembly (one finalmount). Each phase reads the outputs of the prior one through normal function returns. - CSV → Kafka live — a live filesystem watcher feeding a Kafka topic target with at-most-once-per-change semantics, in about 60 lines. File-level source, row-level change events — one Kafka message per changed CSV row.
- HN trending topics, image search, PDF embedding, paper metadata extraction, Amazon S3 + embeddings, Google Drive + embeddings, code embedding to LanceDB — the connector ecosystem that landed during V0 came over to V1, and a few new ones (SurrealDB, SQLite, Kafka, S3) joined it.
Try it
pip install --pre cocoindex
Then either follow the Quickstart (PDF → Markdown in about thirty lines), or clone the repo and run any of the examples directly:
git clone https://github.com/cocoindex-io/cocoindex
cd cocoindex/examples/text_embedding
pip install -e .
cocoindex update main.py
The docs for V1 live at cocoindex.io/docs-v1. The pieces most worth reading first:
- Core Concepts — state-driven processing, target states, processing components, memoization
- App — top-level entry point, lifespan, CLI
- Processing Component —
mount,use_mount,mount_each, component paths - Function —
@coco.fn, memoization, change detection - Live Mode — pipelines that keep watching for changes
Bug reports, feature requests, and “this section of the docs is unclear” notes are all very welcome on GitHub issues, and we hang out in Discord. ⭐ us on GitHub if you want to follow where this goes — there’s a lot more shipping over the next few months.
Support us
If you found this useful, starring CocoIndex on GitHub is the most direct way to help. It’s how other developers find the project :)