Concept Data Indexing Best Practices Feature Architecture ~4 min read

System updates and automatic schema inference

How CocoIndex handles system updates in indexing flows: automatic schema inference and managing data + logic evolution without downtime.

Define the flow. The schema follows: data changes and logic changes feed CocoIndex schema inference, which auto-provisions the target schema and keeps it in sync.

When building data processing and indexing systems, one of the key challenges is handling system updates gracefully. These systems maintain state across multiple components (like Pinecone, PostgreSQL, etc.) and need to evolve over time. Let’s explore the challenges and potential solutions.

The two dimensions of change

1. Data evolution

Source data is constantly changing - new records are added, existing ones are updated or deleted.

2. Logic evolution

The business logic and processing rules also evolve, for example,

  • New fields need to be indexed
  • Transformation logic changes
  • New analysis requirements emerge

This is similar to how spreadsheets work - changes in either source data or formulas trigger updates to the target data.

Two dimensions of change feeding one engine: data evolution (rows added, updated, deleted) and logic evolution (transformation code edited) both flow into CocoIndex, which infers storage and schema from the flow and reprocesses only the affected target rows, leaving the rest untouched

Infrastructure and schema management challenges

When setting up a new indexing flow, there are multiple moving parts to configure:

  1. Internal data storage
  2. Target storage systems (PostgreSQL, Pinecone, Milvus, etc.)
  3. Pipeline logic needs to match with the component setup. For example, the fields that need to be carried into the index need to be carefully managed.

Currently, this often requires manual setup and careful coordination. Small mismatches in schema or field definitions can cause subtle bugs that are hard to debug.

CocoIndex Approach: reduce manual setup and infer from indexing flow

CocoIndex aims to simplify this by making infrastructure setup and schema management automatic and inference-based:

Flow-driven setup

Benefits of inference

Like modern programming languages that use type inference, for example, when using Java/TS to write code, developers don’t need to define data types at every single step. CocoIndex can derive the necessary infrastructure setup from the flow definition. This:

  • Reduces manual configuration
  • Prevents schema mismatches
  • Makes updates more reliable
  • Allows the system to evolve more easily

How updates are actually applied

The inference above is what makes updates safe to run. Because CocoIndex knows the schema up front and persists what it created, an update is a reconciliation rather than a rebuild.

Declared state vs. previous state

A target state is what you declare should exist in an external system — a table, a row, a file, an embedding. CocoIndex treats your declarations as the source of truth and records them in its internal storage (an LMDB database that tracks target states and memoization results from previous runs). On the next run it compares what you now declare against what it stored last time and applies only the minimal changes needed:

Target stateOn first declarationWhen declared differentlyWhen no longer declared
A database tableCreate the tableAlter the tableDrop the table
A row in a tableInsert the rowUpdate the rowDelete the row
A file in a directoryCreate the fileUpdate the fileDelete the file

This is the same mechanism for both dimensions of change. A new or deleted source record shows up as a row that is now (or no longer) declared. An edit to your logic — adding a field, changing a transformation — shows up as a row or table that is declared differently. Either way, CocoIndex computes the delta instead of reprocessing everything. Memoized components are skipped when their inputs haven’t changed, so unchanged work isn’t redone.

What setup and drop do

When a container target state changes — for instance you add a column or change a primary key — CocoIndex detects it and does its best to alter the target in place. If the change is too large to alter (changing primary keys is the canonical example), the target is dropped and recreated. Crucially, when that happens CocoIndex automatically reprocesses the affected components to backfill the data; you don’t have to manually trigger a full reprocess. This is driven by the target connector’s child-invalidation mechanism, which tells the engine whether a change is destructive (all children lost) or merely lossy (some data may be lost).

You can also reach for these transitions explicitly through the CLI. cocoindex update runs the app in catch-up mode and applies the reconciliation above. Passing --reset drops the existing setup before updating (equivalent to running cocoindex drop first), while --full-reprocess reprocesses everything and invalidates existing caches. The standalone cocoindex drop command reverts all target states an app created — dropping tables, deleting rows — and clears the app’s internal state database.

Keeping the index fresh

Catch-up mode is already incremental, but each update() call still has to scan sources to discover what changed, and changes are only picked up when you trigger a run. For near-real-time indexes, live mode keeps the app running after the initial catch-up and lets change-aware sources (a filesystem watcher, a database change feed, a Kafka consumer) stream updates continuously into the same target-state reconciliation — new or modified items re-mount the affected component, deletions remove it and its target states.

Looking forward

The future of data processing systems lies in smart automation that can:

  • Infer infrastructure needs from processing logic
  • Handle schema evolution gracefully
  • Maintain consistency across distributed storage
  • Make updates and changes reliable and predictable

By building these capabilities into CocoIndex, we can significantly reduce the operational burden on users while making systems more reliable and maintainable.

CocoIndex

An incremental engine for long-horizon agents — always-fresh, explainable data, one Python file.

Frequently asked questions.

What are the two dimensions of change an indexing system must handle?

An indexing system has to absorb change along two axes: data evolution (source records being added, updated, or deleted) and logic evolution (the processing rules changing, e.g. new fields to index, transformation logic changes, or new analysis requirements). This is analogous to a spreadsheet, where a change in either the source data or the formulas triggers updates to the derived data.

See The two dimensions of change.

How does CocoIndex infer storage and schema from an indexing flow?

CocoIndex makes setup inference-based: users define their indexing flow logic, and CocoIndex automatically derives the required storage and schema configurations, provisioning both internal and target storage with the correct schemas. The post compares this to type inference in languages like Java or TypeScript, where developers don't restate data types at every step.

See Flow-driven setup.

Why is manual schema setup for indexing pipelines error-prone?

Setting up an indexing flow involves multiple moving parts that must be configured and kept consistent: internal data storage, target storage systems (PostgreSQL, Pinecone, Milvus, etc.), and pipeline logic whose carried-through fields have to match the component setup. Done manually, small mismatches in schema or field definitions cause subtle bugs that are hard to debug.

See Infrastructure and schema management challenges.

What are the benefits of schema inference in CocoIndex?

Deriving infrastructure from the flow definition reduces manual configuration, prevents schema mismatches, makes updates more reliable, and lets the system evolve more easily. The goal is to cut the operational burden on users while keeping systems reliable and maintainable as both data and logic change.

See Benefits of inference.