CocoIndex overview
CocoIndex is a real-time data transformation framework for AI with incremental processing: only the data that actually changed is reprocessed, lineage is observable end to end, and the schema is known up front.
CocoIndex is an ultra-performant framework for building data processing pipelines for AI workloads, with built-in incremental processing.
Programming model
CocoIndex uses a declarative, state-driven programming model. You specify what your target should look like as a function of your source data — not how to incrementally update it. CocoIndex handles change detection and applies only the necessary updates automatically.
If you’ve used React, spreadsheets, or materialized views, this will feel familiar:
- React: declare UI as a function of state → React re-renders what changed
- Spreadsheets: declare formulas → cells recompute when inputs change
- CocoIndex: declare target states as a function of source → CocoIndex syncs what changed
CocoIndex features
High-performance Rust 🦀 engine
CocoIndex executes pipelines on a high-performance Rust engine, delivering resilient and scalable data processing.
Easy to code
- Write simple transformations in Python without learning new DSLs
- Write batch-style code without worrying about deltas — CocoIndex runs it incrementally in both batch and live mode, continuously updating results. No separate DAGs, operators, or orchestration logic required.
Incremental & low-latency
CocoIndex tracks fine-grained dependencies and only recomputes what changed in the input data or the code. End-to-end updates drop from hours/days to seconds while keeping full correctness.
Full lineage & explainability
Every processing step, intermediate result, and execution path is inspectable. This helps it remain compliant with the EU AI Act for transparency, and satisfies enterprise auditability/traceability requirements.
Open integration model
Sources and targets plug in through a standard, open interface (no vendor lock-in). Leverage the full Python ecosystem for models, functions, and libraries.
High throughput + controlled concurrency
Pipelines automatically parallelize with managed concurrency and request batching — reducing GPU cost, RPC fanout, and end-to-end latency.
Fault-tolerant runtime
The engine gracefully retries transient failures and resumes from previous progress after interruptions — eliminating manual backfills and replays.
Low operational overhead
CocoIndex removes the need for elaborate plumbing: refreshing datasets, maintaining state, handling backfills, ensuring correctness, coordinating GPUs, scaling workers, and managing infra are all handled by the engine.
Incremental data processing
CocoIndex continuously maintains and tracks state while processing only new or changed data. It is designed to support incremental processing from day zero.
What incremental processing means:
- Avoid unnecessarily recomputing work, based on multi-level change detection:
- Component level: only reprocess source items with changes
- Function level: within an item’s processing, memoize expensive function calls and reuse when possible
- Target level: apply minimum necessary changes (insertions, updates, deletions) to the target
- Support multiple mechanisms to capture source changes (CDC, poll-based) out of the box
You write simple batch-style code — no delta logic, no state handling. CocoIndex automatically runs your pipeline incrementally and keeps the output up to date for serving, training, or feature computation.
Next steps
- Install CocoIndex and follow the Quickstart to build your first pipeline in 5 minutes
- Read Core Concepts for the mental model behind CocoIndex