React — for data engineering.
A persistent-state-driven model. You declare the desired state of your target. The engine keeps it in sync with the latest source data and code, across long time horizons, with low latency and low cost.
React re-renders only the nodes that changed. CocoIndex re-syncs only the rows that changed. Same reactive loop — one drives pixels, the other drives data.
Your code is as simple as the one-off version.
Write the transformation. Declare the sink. That’s the job. CocoIndex figures out what to rerun, what to cache, and what’s already fresh.
Transforms the input data — you write Python, not a DAG configuration.
In classical pipelines you describe how to move data: stages, operators, schedules, retries. In CocoIndex you describe what the target should be — a pure function of the source — and the engine derives the graph from your code.
The function you write looks like any other Python function. You set breakpoints in it, call it in tests, import it into a notebook. The decorator (@coco.fn) is opt-in metadata that lets the engine memoize, fingerprint, and re-run it incrementally — but the semantics stay the same: input in, output out.
Declares desired state for the target — we compute the minimum work to reach it.
A target state is a declaration: “this row should exist in this table with these values,” “this vector should live under this id,” “this Kafka topic should carry this message.” You describe the end state once. You never write insert / update / delete branches.
When inputs change, the engine diffs declared target state against what’s already in the store and applies the smallest set of mutations that reconciles the two. New rows get upserted, stale rows get removed, unchanged rows are skipped. The same pattern React uses to patch the DOM, applied to your data store.
Tracks lineage end-to-end — every byte in the target can be traced to a source.
Every declared target state is tagged with the source item(s) and function version(s) that produced it. When you ask “where did this chunk come from,” you get the file, the byte range, the code commit, and the run timestamp — without adding audit columns yourself.
Lineage is the same mechanism that powers incrementality: because the engine knows which source fingerprints produced which outputs, it knows exactly what to invalidate when any of those fingerprints change.
Runs incrementally at any scale — only the delta, never the full recompute.
The same code runs on a laptop against a toy repo, and on a shared daemon against a petabyte corpus. You do not choose between “batch” and “streaming” — the engine runs continuously and only touches what changed.
Memoization at the function level, component-path identity across runs, and content-addressed fingerprints mean unchanged work is always skipped — whether a developer edits one file on their laptop or a CI pipeline swaps out a helper function across a million files. Your bill scales with delta, not with corpus size.
The full analogy, row by row.
If React taught the frontend to stop thinking about DOM mutations, CocoIndex is teaching backend pipelines to stop thinking about ETL steps. The mechanics map almost 1:1.
Declare the target. Skip the plumbing.
The easiest way to feel the model is to write a tiny flow against a local folder. Five minutes end-to-end.