System updates and automatic schema inference

Q: What are the two dimensions of change an indexing system must handle?

An indexing system has to absorb change along two axes: data evolution (source records being added, updated, or deleted) and logic evolution (the processing rules changing, e.g. new fields to index, transformation logic changes, or new analysis requirements). This is analogous to a spreadsheet, where a change in either the source data or the formulas triggers updates to the derived data.See The two dimensions of change.

When building data processing and indexing systems, one of the key challenges is handling system updates gracefully. These systems maintain state across multiple components (like Pinecone, PostgreSQL, etc.) and need to evolve over time. Let’s explore the challenges and potential solutions.

The two dimensions of change

1. Data evolution

Source data is constantly changing - new records are added, existing ones are updated or deleted.

2. Logic evolution

The business logic and processing rules also evolve, for example,

New fields need to be indexed
Transformation logic changes
New analysis requirements emerge

This is similar to how spreadsheets work - changes in either source data or formulas trigger updates to the target data.

Two dimensions of change feeding one engine: data evolution (rows added, updated, deleted) and logic evolution (transformation code edited) both flow into CocoIndex, which infers storage and schema from the flow and reprocesses only the affected target rows, leaving the rest untouched

Infrastructure and schema management challenges

When setting up a new indexing flow, there are multiple moving parts to configure:

Internal data storage
Target storage systems (PostgreSQL, Pinecone, Milvus, etc.)
Pipeline logic needs to match with the component setup. For example, the fields that need to be carried into the index need to be carefully managed.

Currently, this often requires manual setup and careful coordination. Small mismatches in schema or field definitions can cause subtle bugs that are hard to debug.

CocoIndex Approach: reduce manual setup and infer from indexing flow

CocoIndex aims to simplify this by making infrastructure setup and schema management automatic and inference-based:

Flow-driven setup

Users define their indexing flow logic
CocoIndex automatically infers required storage and schema configurations
Internal and target storage is provisioned with correct schemas automatically

Benefits of inference

Like modern programming languages that use type inference, for example, when using Java/TS to write code, developers don’t need to define data types at every single step. CocoIndex can derive the necessary infrastructure setup from the flow definition. This:

Reduces manual configuration
Prevents schema mismatches
Makes updates more reliable
Allows the system to evolve more easily

How updates are actually applied

The inference above is what makes updates safe to run. Because CocoIndex knows the schema up front and persists what it created, an update is a reconciliation rather than a rebuild.

Declared state vs. previous state

A target state is what you declare should exist in an external system — a table, a row, a file, an embedding. CocoIndex treats your declarations as the source of truth and records them in its internal storage (an LMDB database that tracks target states and memoization results from previous runs). On the next run it compares what you now declare against what it stored last time and applies only the minimal changes needed:

Target state	On first declaration	When declared differently	When no longer declared
A database table	Create the table	Alter the table	Drop the table
A row in a table	Insert the row	Update the row	Delete the row
A file in a directory	Create the file	Update the file	Delete the file

This is the same mechanism for both dimensions of change. A new or deleted source record shows up as a row that is now (or no longer) declared. An edit to your logic — adding a field, changing a transformation — shows up as a row or table that is declared differently. Either way, CocoIndex computes the delta instead of reprocessing everything. Memoized components are skipped when their inputs haven’t changed, so unchanged work isn’t redone.

What setup and drop do

When a container target state changes — for instance you add a column or change a primary key — CocoIndex detects it and does its best to alter the target in place. If the change is too large to alter (changing primary keys is the canonical example), the target is dropped and recreated. Crucially, when that happens CocoIndex automatically reprocesses the affected components to backfill the data; you don’t have to manually trigger a full reprocess. This is driven by the target connector’s child-invalidation mechanism, which tells the engine whether a change is destructive (all children lost) or merely lossy (some data may be lost).

You can also reach for these transitions explicitly through the CLI. cocoindex update runs the app in catch-up mode and applies the reconciliation above. Passing --reset drops the existing setup before updating (equivalent to running cocoindex drop first), while --full-reprocess reprocesses everything and invalidates existing caches. The standalone cocoindex drop command reverts all target states an app created — dropping tables, deleting rows — and clears the app’s internal state database.

Keeping the index fresh

Catch-up mode is already incremental, but each update() call still has to scan sources to discover what changed, and changes are only picked up when you trigger a run. For near-real-time indexes, live mode keeps the app running after the initial catch-up and lets change-aware sources (a filesystem watcher, a database change feed, a Kafka consumer) stream updates continuously into the same target-state reconciliation — new or modified items re-mount the affected component, deletions remove it and its target states.

Looking forward

The future of data processing systems lies in smart automation that can:

Infer infrastructure needs from processing logic
Handle schema evolution gracefully
Maintain consistency across distributed storage
Make updates and changes reliable and predictable

By building these capabilities into CocoIndex, we can significantly reduce the operational burden on users while making systems more reliable and maintainable.

CocoIndex

An incremental engine for long-horizon agents — always-fresh, explainable data, one Python file.

Learn the concept → View on GitHub

Frequently asked questions.

What are the two dimensions of change an indexing system must handle?

An indexing system has to absorb change along two axes: data evolution (source records being added, updated, or deleted) and logic evolution (the processing rules changing, e.g. new fields to index, transformation logic changes, or new analysis requirements). This is analogous to a spreadsheet, where a change in either the source data or the formulas triggers updates to the derived data.

See The two dimensions of change.

How does CocoIndex infer storage and schema from an indexing flow?

CocoIndex makes setup inference-based: users define their indexing flow logic, and CocoIndex automatically derives the required storage and schema configurations, provisioning both internal and target storage with the correct schemas. The post compares this to type inference in languages like Java or TypeScript, where developers don't restate data types at every step.

See Flow-driven setup.

Why is manual schema setup for indexing pipelines error-prone?

Setting up an indexing flow involves multiple moving parts that must be configured and kept consistent: internal data storage, target storage systems (PostgreSQL, Pinecone, Milvus, etc.), and pipeline logic whose carried-through fields have to match the component setup. Done manually, small mismatches in schema or field definitions cause subtle bugs that are hard to debug.

See Infrastructure and schema management challenges.

What are the benefits of schema inference in CocoIndex?

Deriving infrastructure from the flow definition reduces manual configuration, prevents schema mismatches, makes updates more reliable, and lets the system evolve more easily. The goal is to cut the operational burden on users while keeping systems reliable and maintainable as both data and logic change.

See Benefits of inference.