Core Concepts

Incremental processing

When processing data and storing results in targets (e.g., a database) for knowledge retrieval by AI agents or search systems, both your data and code evolve over time. Reprocessing everything after every change is expensive, slow, and disruptive. Incremental processing solves this by only processing what's changed, and applying those changes to the target.

Implementing incremental processing by hand is complicated because:

You need to figure out what has changed and what has not.
You need to think in the time dimension and carefully compute the “delta”, e.g., what needs to be inserted, updated, or deleted in your target.
You need to track and preserve intermediate states to avoid full recomputation when possible.
You need to evolve the target schema and backfill data when the code logic changes.

With so many moving parts, when something goes wrong, it is difficult to debug.

State-driven programming

CocoIndex uses a declarative programming model — instead of programming how to incrementally process your data and apply changes to your target, you declaratively specify what your target should look like, based on the current state of your data source.

info

If you've used React, spreadsheets, or materialized views, this mental model will feel familiar:

Spreadsheets: You declare formulas in cells. When any upstream cell changes, downstream cells automatically recompute to reflect the new state.
React: You declare your UI as a function of state. When state changes, React automatically re-renders the UI to match.
Materialized Views: You declare a query (e.g., SQL) that runs on source tables. When input data changes, the view automatically refreshes to match.

CocoIndex uses the above ideas to formulate a state-driven paradigm for long-running, side-effectful data processing pipelines with the following key concepts:

Data transformations: You read the current state from your source and perform a series of transformations. For example, converting PDFs to markdown files, extracting features or structures, or mapping data to fit a particular schema.
Target states: You output the results to a target such as a relational database, vector database, or file system. Note that the target state is a pure function of the source state (i.e., it has no other side effects). TargetState = Transform(SourceState)
Incremental processing: Under the hood, when the source state changes, CocoIndex incrementally processes only the changes needed to update the target, so you don't have to manage it yourself.

App

An app is the top-level executable entity in CocoIndex. In an app, you write code to:

Read state from sources
Transform the data
Declare target states: i.e., what the output should look like

CocoIndex then syncs these target states to external systems (Postgres, vector databases, etc.).

For example, here's an app that reads PDFs from a drive, converts them to markdown, and outputs to a folder:

Processing Component

In practice, your source often contains many items — files, rows, or entities — each of which can be processed independently. A processing component groups an item's processing together with its target states. Each processing component runs independently and applies its target states to the external system as soon as it completes, without waiting for the rest of the app.

For example, if you have many files in a drive and want to process them file by file, your processing component would operate at the file level:

Taking this further, suppose you want to split each file into chunks and create embedding vectors for indexing. The processing component can still operate at the file level, but each component now produces multiple target states (one per chunk). CocoIndex applies all target state changes (inserts, updates, and deletes rows in the target database) atomically for each file:

Let's see what happens when the source state changes in different ways:

On New File Added
On Existing File Updated
On Existing File Deleted

When a new file (c.md) is added, a new processing component is created for it. Once execution completes, CocoIndex applies the new target states — inserting vector5 and vector6 into the vector store.

When file b.md is updated — say its content is reduced to just one chunk instead of two — the processing component's target state changes from vector3 and vector4 to just vector5. CocoIndex deletes vector3 and vector4, then inserts vector5 into the vector database, all within a single transaction.

When file b.md is deleted from the source folder, CocoIndex deletes its associated target states (vector3 and vector4) from the vector database in a single transaction.

Function memoization: skip unchanged computations

Function memoization is a technique that allows skipping a function when its input and code are unchanged from a previous run. It is essential for incremental processing — without it, every run would require full recomputation.

In CocoIndex, both processing components and transforms are expressed as functions, so function memoization can be enabled at either level. Using the chunk-embed example:

Processing component level: If a file hasn't changed and the processing logic hasn't changed, the entire processing component is skipped.
Transform level: If the input to the "embed" transform (the chunk text) hasn't changed and the transform logic (e.g., the model) hasn't changed, that specific embedding computation is skipped.

See Function Memoization for more details.

Here's how memoization behaves in different scenarios:

On Input Data Change
On Code Change

When input file b.md changes:

The input state a.md is unchanged, so the 1st processing component is entirely reused without reprocessing.
The input state b.md changed, so the 2nd processing component must be reprocessed. After splitting, suppose we get two chunks: chunk3 (identical to before) and chunk5 (new).
- Embed(chunk3) was memoized previously, so its cached result is reused.
- Embed(chunk5) is new and must be computed.

When the "Split into chunks" logic changes:

All processing components must be reprocessed since the logic changed.
For the 1st processing component, the new logic produces the same chunks as before. The memoized Embed results are reused without recomputation.
For the 2nd processing component, the new logic produces different chunks (chunk5 and chunk6), so Embed must be invoked on them.

As these examples show, memoization can save expensive computations even when code changes — as long as the intermediate results remain the same.

Incremental processing​

State-driven programming​

App​

Processing Component​

Function memoization: skip unchanged computations​

Incremental processing

State-driven programming

App

Processing Component

Function memoization: skip unchanged computations