CocoIndex core concepts

The mental model behind CocoIndex — declarative state-driven sync, target states, Apps, processing components, and how incremental execution skips unchanged work across both logic and input changes.

Version
v 1.0.0-alpha48
Last reviewed
Apr 19, 2026

Incremental processing

When processing data and storing results in targets (e.g., a database) for knowledge retrieval by AI agents or search systems, both your data and code evolve over time. Reprocessing everything after every change is expensive, slow, and disruptive. Incremental processing solves this by only processing what’s changed, and applying those changes to the target.

Implementing incremental processing by hand is complicated because:

  • You need to figure out what has changed and what has not.

  • You need to think in the time dimension and carefully compute the “delta”, e.g., what needs to be inserted, updated, or deleted in your target.

  • You need to track and preserve intermediate states to avoid full recomputation when possible.

  • You need to evolve the target schema and backfill data when the code logic changes.

With so many moving parts, when something goes wrong, it is difficult to debug.

State-driven programming

CocoIndex uses a declarative programming model — instead of programming how to incrementally process your data and apply changes to your target, you declaratively specify what your target should look like, based on the current state of your data source.

Info

If you’ve used React, spreadsheets, or materialized views, this mental model will feel familiar:

  • Spreadsheets: You declare formulas in cells. When any upstream cell changes, downstream cells automatically recompute to reflect the new state.

  • React: You declare your UI as a function of state. When state changes, React automatically re-renders the UI to match.

  • Materialized Views: You declare a query (e.g., SQL) that runs on source tables. When input data changes, the view automatically refreshes to match.

CocoIndex uses the above ideas to formulate a state-driven paradigm for long-running, side-effectful data processing pipelines with the following key concepts:

  • Data transformations: You read the current state from your source and perform a series of transformations. For example, converting PDFs to markdown files, extracting features or structures, or mapping data to fit a particular schema.

  • Target states: You output the results to a target such as a relational database, vector database, or file system. Note that the target state is a pure function of the source state (i.e., it has no other side effects). TargetState = Transform(SourceState)

  • Incremental processing: Under the hood, when the source state changes, CocoIndex incrementally processes only the changes needed to update the target, so you don’t have to manage it yourself.

App

An app is the top-level executable entity in CocoIndex. In an app, you write code to:

  • Read state from sources

  • Transform the data

  • Declare target states: i.e., what the output should look like

CocoIndex then syncs these target states to external systems (Postgres, vector databases, etc.).

CocoIndex App An App reads state from a source, transforms it, declares target state, and syncs to the target system.
Source System
CocoIndex App
State
Transform F(x)
Target State
Target System

For example, here’s an app that reads PDFs from a drive, converts them to markdown, and outputs to a folder:

Example App: PDF → Markdown An app that reads PDFs from a drive, converts them to markdown, and outputs to a folder.
Drive Folder
CocoIndex App
List of PDFs
Convert all files to Markdown
List of Markdown (filename, content)
Target System

Processing Component

In practice, your source often contains many items — files, rows, or entities — each of which can be processed independently. A processing component groups an item’s processing together with its target states. Each processing component runs independently and applies its target states to the external system as soon as it completes, without waiting for the rest of the app.

For example, if you have many files in a drive and want to process them file by file, your processing component would operate at the file level:

One processing component per file Each file gets its own processing component that converts it to markdown.
Drive Folder
CocoIndex App
Drive Folder
a.pdf
Processing Component
Convert to Markdown
a.md
b.pdf
Processing Component
Convert to Markdown
b.md

Taking this further, suppose you want to split each file into chunks and create embedding vectors for indexing. The processing component can still operate at the file level, but each component now produces multiple target states (one per chunk). CocoIndex applies all target state changes (inserts, updates, and deletes rows in the target database) as a unit for each file — all writes happen after processing completes, and each target backend applies its batch atomically when supported (e.g., within a database transaction):

Processing component with chunks Each file is split into chunks; the processing component produces one target state per chunk and syncs them as a unit.
Drive Folder
CocoIndex App
Vector Database
a.md
Processing Component
Split into chunks
chunk1
Embed
vector1
chunk2
Embed
vector2
b.md
Processing Component
Split into chunks
chunk3
Embed
vector3
chunk4
Embed
vector4

Let’s see what happens when the source state changes in different ways:

Processing component with chunks Each file is split into chunks; the processing component produces one target state per chunk and syncs them as a unit.
Drive Folder
CocoIndex App
Vector Database
a.md
Processing Component
Split into chunks
chunk1
Embed
vector1
chunk2
Embed
vector2
b.md
Processing Component
Split into chunks
chunk3
Embed
vector3
chunk4
Embed
vector4
c.md
Processing Component
Split into chunks
chunk5
Embed
vector5
chunk6
Embed
vector6

When a new file (c.md) is added, a new processing component is created for it. Once execution completes, CocoIndex applies the new target states — inserting vector5 and vector6 into the vector store.

Processing component with chunks Each file is split into chunks; the processing component produces one target state per chunk and syncs them as a unit.
Drive Folder
CocoIndex App
Vector Database
a.md
Processing Component
Split into chunks
chunk1
Embed
vector1
chunk2
Embed
vector2
b.md
Processing Component
Split into chunks
deleted
chunk3
deleted
Embed
deleted
vector3
deleted
chunk4
deleted
Embed
deleted
vector4
chunk5
Embed
vector5

When file b.md is updated — say its content is reduced to just one chunk instead of two — the processing component’s target state changes from vector3 and vector4 to just vector5. CocoIndex deletes vector3 and vector4, then inserts vector5 into the vector database, all within a single transaction.

Processing component with chunks Each file is split into chunks; the processing component produces one target state per chunk and syncs them as a unit.
Drive Folder
CocoIndex App
Vector Database
a.md
Processing Component
Split into chunks
chunk1
Embed
vector1
chunk2
Embed
vector2
deleted
b.md
deleted
Processing Component
deleted
Split into chunks
deleted
chunk3
deleted
Embed
deleted
vector3
deleted
chunk4
deleted
Embed
deleted
vector4

When file b.md is deleted from the source folder, CocoIndex deletes its associated target states (vector3 and vector4) from the vector database in a single transaction.

Function memoization: skip unchanged computations

Function memoization is a technique that allows skipping a function when its input and code are unchanged from a previous run. It is essential for incremental processing — without it, every run would require full recomputation.

In CocoIndex, both processing components and transforms are expressed as functions, so function memoization can be enabled at either level. Using the chunk-embed example:

  • Processing component level: If a file hasn’t changed and the processing logic hasn’t changed, the entire processing component is skipped.

  • Transform level: If the input to the “embed” transform (the chunk text) hasn’t changed and the transform logic (e.g., the model) hasn’t changed, that specific embedding computation is skipped.

See Function Memoization for more details.

Processing component with chunks Each file is split into chunks; the processing component produces one target state per chunk and syncs them as a unit.
Drive Folder
CocoIndex App
Vector Database
a.md
Processing Component
Split into chunks
chunk1
Embed
memoized
vector1
chunk2
Embed
memoized
vector2
memoized
b.md
Processing Component
Split into chunks
chunk3
Embed
memoized
vector3
chunk4
Embed
memoized
vector4
memoized

Here’s how memoization behaves in different scenarios:

Processing component with chunks Each file is split into chunks; the processing component produces one target state per chunk and syncs them as a unit.
Drive Folder
CocoIndex App
Vector Database
a.md
Processing Component
Split into chunks
chunk1
Embed
cache ready — reused memoized
vector1
chunk2
Embed
cache ready — reused memoized
vector2
cache ready — reused memoized
b.md
Processing Component
Split into chunks
chunk3
Embed
cache ready — reused memoized
vector3
deleted
chunk4
deleted
Embed
deleted
vector4
chunk5
Embed
memoized
vector5
refreshing — re-executing memoized

When input file b.md changes:

  • The input state a.md is unchanged, so the 1st processing component is entirely reused without reprocessing.

  • The input state b.md changed, so the 2nd processing component must be reprocessed. After splitting, suppose we get two chunks: chunk3 (identical to before) and chunk5 (new).

    • Embed(chunk3) was memoized previously, so its cached result is reused.

    • Embed(chunk5) is new and must be computed.

Processing component with chunks Each file is split into chunks; the processing component produces one target state per chunk and syncs them as a unit.
Drive Folder
CocoIndex App
Vector Database
a.md
Processing Component
Split into chunks
chunk1
Embed
cache ready — reused memoized
vector1
chunk2
Embed
cache ready — reused memoized
vector2
refreshing — re-executing memoized
b.md
Processing Component
Split into chunks
deleted
chunk3
deleted
Embed
deleted
vector3
deleted
chunk4
deleted
Embed
deleted
vector4
chunk5
Embed
memoized
vector5
chunk6
Embed
memoized
vector6
refreshing — re-executing memoized

When the “Split into chunks” logic changes:

  • All processing components must be reprocessed since the logic changed.

  • For the 1st processing component, the new logic produces the same chunks as before. The memoized Embed results are reused without recomputation.

  • For the 2nd processing component, the new logic produces different chunks (chunk5 and chunk6), so Embed must be invoked on them.

As these examples show, memoization can save expensive computations even when logic changes — as long as the intermediate results remain the same.

Next steps

Now that you understand the mental model, dive into the Programming Guide to learn how to use these concepts in code:

  • App — creating and running pipelines
  • Target State — declaring what should exist in external systems
  • Processing Component — structuring work and mounting components
  • Function — the @coco.fn decorator, memoization, and change detection
CocoIndex Docs Edit this page Report issue