When building data processing and indexing systems, one of the key challenges is handling system updates gracefully. These systems maintain state across multiple components (like Pinecone, PostgreSQL, etc.) and need to evolve over time. Let’s explore the challenges and potential solutions.
The two dimensions of change
1. Data evolution
Source data is constantly changing - new records are added, existing ones are updated or deleted.
2. Logic evolution
The business logic and processing rules also evolve, for example,
- New fields need to be indexed
- Transformation logic changes
- New analysis requirements emerge
This is similar to how spreadsheets work - changes in either source data or formulas trigger updates to the target data.
Infrastructure and schema management challenges
When setting up a new indexing flow, there are multiple moving parts to configure:
- Internal data storage
- Target storage systems (PostgreSQL, Pinecone, Milvus, etc.)
- Pipeline logic needs to match with the component setup. For example, the fields that need to be carried into the index need to be carefully managed.
Currently, this often requires manual setup and careful coordination. Small mismatches in schema or field definitions can cause subtle bugs that are hard to debug.
CocoIndex Approach: reduce manual setup and infer from indexing flow
CocoIndex aims to simplify this by making infrastructure setup and schema management automatic and inference-based:
Flow-driven setup
- Users define their indexing flow logic
- CocoIndex automatically infers required storage and schema configurations
- Internal and target storage is provisioned with correct schemas automatically
Benefits of inference
Like modern programming languages that use type inference, for example, when using Java/TS to write code, developers don’t need to define data types at every single step. CocoIndex can derive the necessary infrastructure setup from the flow definition. This:
- Reduces manual configuration
- Prevents schema mismatches
- Makes updates more reliable
- Allows the system to evolve more easily
How updates are actually applied
The inference above is what makes updates safe to run. Because CocoIndex knows the schema up front and persists what it created, an update is a reconciliation rather than a rebuild.
Declared state vs. previous state
A target state is what you declare should exist in an external system — a table, a row, a file, an embedding. CocoIndex treats your declarations as the source of truth and records them in its internal storage (an LMDB database that tracks target states and memoization results from previous runs). On the next run it compares what you now declare against what it stored last time and applies only the minimal changes needed:
| Target state | On first declaration | When declared differently | When no longer declared |
|---|---|---|---|
| A database table | Create the table | Alter the table | Drop the table |
| A row in a table | Insert the row | Update the row | Delete the row |
| A file in a directory | Create the file | Update the file | Delete the file |
This is the same mechanism for both dimensions of change. A new or deleted source record shows up as a row that is now (or no longer) declared. An edit to your logic — adding a field, changing a transformation — shows up as a row or table that is declared differently. Either way, CocoIndex computes the delta instead of reprocessing everything. Memoized components are skipped when their inputs haven’t changed, so unchanged work isn’t redone.
What setup and drop do
When a container target state changes — for instance you add a column or change a primary key — CocoIndex detects it and does its best to alter the target in place. If the change is too large to alter (changing primary keys is the canonical example), the target is dropped and recreated. Crucially, when that happens CocoIndex automatically reprocesses the affected components to backfill the data; you don’t have to manually trigger a full reprocess. This is driven by the target connector’s child-invalidation mechanism, which tells the engine whether a change is destructive (all children lost) or merely lossy (some data may be lost).
You can also reach for these transitions explicitly through the CLI. cocoindex update runs the app in catch-up mode and applies the reconciliation above. Passing --reset drops the existing setup before updating (equivalent to running cocoindex drop first), while --full-reprocess reprocesses everything and invalidates existing caches. The standalone cocoindex drop command reverts all target states an app created — dropping tables, deleting rows — and clears the app’s internal state database.
Keeping the index fresh
Catch-up mode is already incremental, but each update() call still has to scan sources to discover what changed, and changes are only picked up when you trigger a run. For near-real-time indexes, live mode keeps the app running after the initial catch-up and lets change-aware sources (a filesystem watcher, a database change feed, a Kafka consumer) stream updates continuously into the same target-state reconciliation — new or modified items re-mount the affected component, deletions remove it and its target states.
Looking forward
The future of data processing systems lies in smart automation that can:
- Infer infrastructure needs from processing logic
- Handle schema evolution gracefully
- Maintain consistency across distributed storage
- Make updates and changes reliable and predictable
By building these capabilities into CocoIndex, we can significantly reduce the operational burden on users while making systems more reliable and maintainable.
CocoIndex
An incremental engine for long-horizon agents — always-fresh, explainable data, one Python file.
Frequently asked questions.
What are the two dimensions of change an indexing system must handle?
An indexing system has to absorb change along two axes: data evolution (source records being added, updated, or deleted) and logic evolution (the processing rules changing, e.g. new fields to index, transformation logic changes, or new analysis requirements). This is analogous to a spreadsheet, where a change in either the source data or the formulas triggers updates to the derived data.
How does CocoIndex infer storage and schema from an indexing flow?
CocoIndex makes setup inference-based: users define their indexing flow logic, and CocoIndex automatically derives the required storage and schema configurations, provisioning both internal and target storage with the correct schemas. The post compares this to type inference in languages like Java or TypeScript, where developers don't restate data types at every step.
See Flow-driven setup.
Why is manual schema setup for indexing pipelines error-prone?
Setting up an indexing flow involves multiple moving parts that must be configured and kept consistent: internal data storage, target storage systems (PostgreSQL, Pinecone, Milvus, etc.), and pipeline logic whose carried-through fields have to match the component setup. Done manually, small mismatches in schema or field definitions cause subtle bugs that are hard to debug.
What are the benefits of schema inference in CocoIndex?
Deriving infrastructure from the flow definition reduces manual configuration, prevents schema mismatches, makes updates more reliable, and lets the system evolve more easily. The goal is to cut the operational burden on users while keeping systems reliable and maintainable as both data and logic change.