CocoIndex Basics
An index is a collection of data stored in a way that is easy for retrieval.
CocoIndex is an ETL framework for building indexes from specified data sources, a.k.a. indexing. It also offers utilities for users to retrieve data from the indexes.
Indexing flow
An indexing flow extracts data from speicfied data sources, upon specified transformations, and puts the transformed data into specified storage for later retrieval.
An indexing flow has two aspects: data and operations on data.
Data
An indexing flow involves source data and transformed data (either as an intermediate result or the final result to be put into storage). All data within the indexing flow has schema determined at flow definition time.
Each piece of data has a data type, falling into one of the following categories:
- Basic type.
- Struct type: a collection of fields, each with a name and a type.
- Collection type: a collection of rows, each of which is a struct with specified schema. A collection type can be a table (which has a key field) or a list (ordered but without key field).
An indexing flow always has a top-level struct, containing all data within and managed by the flow.
See Data Types for more details about data types.
Operations
An operation in an indexing flow defines a step in the flow. An operation is defined by:
-
Action, which defines the behavior of the operation, e.g. import, transform, for each, collect and export. See Flow Definition for more details for each action.
-
Some actions (i.e. "import", "transform" and "export") require an Operation Spec, which describes the specific behavior of the operation, e.g. a source to import from, a function describing the transformation behavior, a target storage to export to (as an index).
- Each operation spec has a operation type, e.g.
LocalFile
(data source),SplitRecursively
(function),SentenceTransformerEmbed
(function),Postgres
(storage). - CocoIndex framework maintains a set of supported operation types. Users can also implement their own.
- Each operation spec has a operation type, e.g.
"import" and "transform" operations produce output data, whose data type is determined based on the operation spec and data types of input data (for "transform" operation only).
Example
For the example shown in the Quickstart section, the indexing flow is as follows:
This creates the following data for the indexing flow:
- The
Localfile
source creates adocuments
field at the top level, withfilename
(key) andcontent
sub fields. - A "for each" action works on each document, with the following transformations:
- The
SplitRecursively
function splits content into chunks, adds achunks
field into the current scope (each document), withlocation
(key) andtext
sub fields. - A "collect" action works on each chunk, with the following transformations:
- The
SentenceTransformerEmbed
function embeds the chunk into a vector space, adding aembedding
field into the current scope (each chunk).
- The
- The
This shows schema and example data for the indexing flow:
Life cycle of an indexing flow
An indexing flow, once set up, maintains a long-lived relationship between data source and data in target storage. This means:
-
The target storage created by the flow remain available for querying at any time
-
As source data changes (new data added, existing data updated or deleted), data in the target storage are updated to reflect those changes, on certain pace, according to the update mode:
- One time update: Once triggered, CocoIndex updates the target data to reflect the version of source data up to the current moment.
- Live update: CocoIndex continuously reacts to changes of source data and updates the target data accordingly, based on various change capture mechanisms for the source.
See more details in the build / update target data section.
-
CocoIndex intelligently reprocesses to propagate source changes to target by:
- Determining which parts of the target data need to be recomputed
- Reusing existing computations where possible
- Only reprocessing the minimum necessary data
This is known as incremental processing.
You can think of an indexing flow similar to formulas in a spreadsheet:
- In a spreadsheet, you define formulas that transform input cells into output cells
- When input values change, the spreadsheet recalculates affected outputs
- You focus on defining the transformation logic, not managing updates
CocoIndex works the same way, but with more powerful capabilities:
- Instead of flat tables, CocoIndex models data in nested data structures, making it more natural to model complex data
- Instead of simple cell-level formulas, you have operations like "for each" to apply the same formula across rows without repeating yourself
This means when writing your flow operations, you can treat source data as if it were static - focusing purely on defining the transformation logic. CocoIndex takes care of maintaining the dynamic relationship between sources and target data behind the scenes.
Internal storage
As an indexing flow is long-lived, it needs to store intermediate data to keep track of the states. CocoIndex uses internal storage for this purpose.
Currently, CocoIndex uses Postgres database as the internal storage.
See Initialization for configuring its location, and cocoindex setup
CLI command (see CocoIndex CLI) creates tables for the internal storage.
Retrieval
There are two ways to retrieve data from target storage built by an indexing flow:
- Query the underlying target storage directly for maximum flexibility.
- Use CocoIndex query handlers for a more convenient experience with built-in tooling support (e.g. CocoInsight) to understand query performance against the target data.
Query handlers are tied to specific indexing flows. They accept query inputs, transform them by defined operations, and retrieve matching data from the target storage that was created by the flow.