CocoIndex Flow Definition

In CocoIndex, to define an indexing flow, you provide a function to import source, transform data and put them into targets. You connect input/output of these operations with fields of data scopes.

Entry Point

A CocoIndex flow is defined by a function:

Python

The easiest way is to use the @cocoindex.flow_def decorator:

@cocoindex.flow_def(name="DemoFlow")
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    ...

This @cocoindex.flow_def decorator declares this function as a CocoIndex flow definition.

It takes two arguments:

flow_builder: a FlowBuilder object to help build the flow.
data_scope: a DataScope object, representing the top-level data scope. Any data created by the flow should be added to it.

Alternatively, for more flexibility (e.g. you want to do this conditionally or generate dynamic name), you can explicitly call the cocoindex.add_flow_def() method:

def demo_flow_def(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    ...

# Add the flow definition to the flow registry.
demo_flow = cocoindex.add_flow_def("DemoFlow", demo_flow_def)

In both cases, demo_flow will be an object with cocoindex.Flow class type. See Flow Running for more details on it.

Sometimes you no longer want to keep states of the flow in memory. We provide a cocoindex.remove_flow() method for this purpose:

cocoindex.remove_flow(demo_flow)

After it's called, demo_flow becomes an invalid object, and you should not call any methods of it.

note

This only removes states of the flow from the current process, and it won't affect the persistent states.

If you w

Data Scope

A data scope represents data for a certain unit, e.g. the top level scope (involving all data for a flow), for a document, or for a chunk. A data scope has a bunch of fields and collectors, and users can add new fields and collectors to it.

Get or Add a Field

You can get or add a field of a data scope (which is a data slice).

note

You cannot override an existing field.

Python

Getting and setting a field of a data scope is done by the [] operator with a field name:

@cocoindex.flow_def(name="DemoFlow")
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):

    # Add "documents" to the top-level data scope.
    data_scope["documents"] = flow_builder.add_source(DemoSourceSpec(...))

    # Each row of "documents" is a child scope.
    with data_scope["documents"].row() as document:

        # Get "content" from the document scope, transform, and add "summary" to scope.
        document["summary"] = field1_row["content"].transform(DemoFunctionSpec(...))

Add a collector

See Data Collector below for more details.

Data Slice

A data slice references a subset of data belonging to a data scope, e.g. a specific field from a data scope. A data slice has a certain data type, and it's the input for most operations.

Import from source

To get the initial data slice, we need to start from importing data from a source. FlowBuilder provides a add_source() method to import data from external sources. A source spec needs to be provided for any import operation, to describe the source and parameters related to the source. Import must happen at the top level, and the field created by import must be in the top-level struct.

Python

@cocoindex.flow_def(name="DemoFlow")
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    data_scope["documents"] = flow_builder.add_source(DemoSourceSpec(...))
    ......

note

The actual value of data is not available at the time when we define the flow: it's only available at runtime. In a flow definition, you can use a data representation as input for operations, but not access the actual value.

Refresh interval

You can provide a refresh_interval argument. When present, in the live update mode, the data source will be refreshed by specified interval.

Python

The refresh_interval argument is of type datetime.timedelta. For example, this refreshes the data source every 1 minute:

@cocoindex.flow_def(name="DemoFlow")
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    data_scope["documents"] = flow_builder.add_source(
        DemoSourceSpec(...), refresh_interval=datetime.timedelta(minutes=1))
    ......

info

In live update mode, for each refresh, CocoIndex will list rows in the data source to figure out the changes based on metadata such as last modified time, and only perform transformations on changed source keys. If nothing changed during the last refresh cycle, only list operations will be performed, which is usually cheap for most data sources.

Transform

transform() method transforms the data slice by a function, which creates another data slice. A function spec needs to be provided for any transform operation, to describe the function and parameters related to the function.

The function takes one or multiple data arguments. The first argument is the data slice to be transformed, and the transform() method is applied from it. Other arguments can be passed in as positional arguments or keyword arguments, after the function spec.

Python

@cocoindex.flow_def(name="DemoFlow")
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    ...
    data_scope["field2"] = data_scope["field1"].transform(
                               DemoFunctionSpec(...),
                               arg1, arg2, ..., key0=kwarg0, key1=kwarg1, ...)
    ...

For-each-row

If the data slice has table type, you can call row() method to obtain a child scope representing each row, to apply operations on each row.

Python

@cocoindex.flow_def(name="DemoFlow")
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
  ...
  with data_scope["table1"].row() as table1_row:
    # Children operations
    table1_row["field2"] = table1_row["field1"].transform(DemoFunctionSpec(...))

Get a sub field

If the data slice has Struct type, you can obtain a data slice on a specific sub field of it, similar to getting a field of a data scope.

Data Collector

A data collector can be added from a specific data scope, and it collects multiple entries of data from the same or children scope.

Collect

Call its collect() method to collect a specific entry, which can have multiple fields. Each field has a name as specified by the argument name, and a value in one of the following representations:

A DataSlice.
An enum cocoindex.GeneratedField.UUID indicating its value is an UUID automatically generated by the engine. The uuid will remain stable when other collected input values are unchanged.

note
An automatically generated UUID field is allowed to appear at most once.

For example,

Python

@cocoindex.flow_def(name="DemoFlow")
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    ...
    demo_collector = data_scope.add_collector()
    with data_scope["documents"].row() as document:
        ...
        demo_collector.collect(id=cocoindex.GeneratedField.UUID,
                               filename=document["filename"],
                               summary=document["summary"])
    ...

Here the collector is in the top-level data scope. It collects filename and summary fields from each row of documents, and generates a id field with UUID and remains stable when filename and summary are unchanged.

Export

The export() method exports the collected data to an external target.

A target spec needs to be provided for any export operation, to describe the target and parameters related to the target.

Export must happen at the top level of a flow, i.e. not within any child scopes created by "for each row". It takes the following arguments:

name: the name to identify the export target.
target_spec: the target spec as the export target.
setup_by_user (optional): whether the export target is setup by user. By default, CocoIndex is managing the target setup (see Setup / drop flow), e.g. create related tables/collections/etc. with compatible schema, and update them upon change. If True, the export target will be managed by users, and users are responsible for creating the target and updating it upon change.
Fields to configure storage indexes. primary_key_fields is required, and all others are optional.

Python

@cocoindex.flow_def(name="DemoFlow")
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    ...
    demo_collector = data_scope.add_collector()
    ...
    demo_collector.export(
        "demo_target", DemoTargetSpec(...),
        primary_key_fields=["field1"],
        vector_indexes=[cocoindex.VectorIndexDef("field2", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])

The target is managed by CocoIndex, i.e. it'll be created or dropped when setup / drop flow, and the data will be automatically updated (including stale data removal) when updating the index. The name for the same target should remain stable across different runs. If it changes, CocoIndex will treat it as an old target removed and a new one created, and perform setup changes and reindexing accordingly.

Storage Indexes

Many targets are storage systems supporting indexes, to boost efficiency in retrieving data. CocoIndex provides a common way to configure indexes for various targets.

Primary Key

Primary key is specified by primary_key_fields (Sequence[str]). Types of the fields must be key types. See Key Types for more details.

Vector Index

Vector index is specified by vector_indexes (Sequence[VectorIndexDef]). VectorIndexDef has the following fields:

field_name: the field to create vector index.
metric: the similarity metric to use.

Similarity Metrics

Following metrics are supported:

Metric Name	Description	Similarity Order
CosineSimilarity	Cosine similarity	Larger is more similar
L2Distance	L2 distance (a.k.a. Euclidean distance)	Smaller is more similar
InnerProduct	Inner product	Larger is more similar

Miscellaneous

Getting App Namespace

You can use the app_namespace setting or COCOINDEX_APP_NAMESPACE environment variable to specify the app namespace, to organize flows across different environments (e.g., dev, staging, production), team members, etc.

In the code, You can call flow.get_app_namespace() to get the app namespace, and use it to name certain backends. It takes the following arguments:

trailing_delimiter (optional): a string to append to the app namespace when it's not empty.

e.g. when the current app namespace is Staging, flow.get_app_namespace(trailing_delimiter='.') will return Staging..

For example,

Python

doc_embeddings.export(
    "doc_embeddings",
    cocoindex.targets.Qdrant(
        collection_name=cocoindex.get_app_namespace(trailing_delimiter='__') + "doc_embeddings",
        ...
    ),
    ...
)

It will use Staging__doc_embeddings as the collection name if the current app namespace is Staging, and use doc_embeddings if the app namespace is empty.

Control Processing Concurrency

You can control the concurrency of the processing by setting the following options:

max_inflight_rows: the maximum number of concurrent inflight requests for the processing.
max_inflight_bytes: the maximum number of concurrent inflight bytes for the processing.

These options can be passed in to the following APIs:

FlowBuilder.add_source(): The options above control the processing concurrency of multiple rows from a source. New rows will not be loaded in memory if it'll be over the limit.

Besides, global limits on overall processing concurrency of all sources from all flows can be specified by GlobalExecutionOptions or corresponding environment variables. If both global and per-source limits are specified, both need to be satisfied to admit additional source rows.
DataSlice.row(): The options above provides a finer-grained control, to limit the processing concurrency of multiple rows within a table at any level.

max_inflight_bytes only counts the number of bytes already existing in the current row before any further processing.

For example:

@cocoindex.flow_def(name="DemoFlow")
def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
  data_scope["documents"] = flow_builder.add_source(
      DemoSourceSpec(...), max_inflight_rows=10, max_inflight_bytes=100*1024*1024)

  with data_scope["documents"].row() as doc:
    doc["chunks"] = doc["content"].transform(SplitRecursively(...))
    with doc["chunks"].row(max_inflight_rows=100) as chunk:
      ......

Target Declarations

Most time a target is created by calling export() method on a collector, and this export() call comes with configurations needed for the target, e.g. options for storage indexes. Occasionally, you may need to specify some configurations for the target out of the context of any specific data collector.

For example, for graph database targets like Neo4j and Kuzu, you may have a data collector to export data to relationships, which will create nodes referenced by various relationships in turn. These nodes don't directly come from any specific data collector (consider relationships from different data collectors may share the same nodes). To specify configurations for these nodes, you can declare spec for related node labels.

FlowBuilder provides declare() method for this purpose, which takes the spec to declare, as provided by various target types.

Python

flow_builder.declare(
    cocoindex.targets.Neo4jDeclarations(...)
)

Auth Registry

CocoIndex manages an auth registry. It's an in-memory key-value store, mainly to store authentication information for a backend.

Operation spec is the default way to configure a persistent backend. But it has the following limitations:

The spec isn't supposed to contain secret information, and it's frequently shown in various places, e.g. cocoindex show.
Once an operation is removed after flow definition code change, the spec is also gone. But we still need to be able to drop the backend (e.g. a table) when setup / drop flow.

Auth registry is introduced to solve the problems above. It works as follows:

You can create new auth entry by a key and a value.
You can references the entry by the key, and pass it as part of spec for certain operations. e.g. Neo4j takes connection field in the form of auth entry reference.

Python

You can add an auth entry by cocoindex.add_auth_entry() function, which returns a cocoindex.AuthEntryReference:

my_graph_conn = cocoindex.add_auth_entry(
    "my_graph_conn",
    cocoindex.targets.Neo4jConnectionSpec(
            uri="bolt://localhost:7687",
            user="neo4j",
            password="cocoindex",
    ))

Then reference it when building a spec that takes an auth entry:

You can either reference by the AuthEntryReference object directly:

demo_collector.export(
    "MyGraph",
    cocoindex.targets.Neo4jRelationship(connection=my_graph_conn, ...)
)

You can also reference it by the key string, using cocoindex.ref_auth_entry() function:

demo_collector.export(
    "MyGraph",
    cocoindex.targets.Neo4jRelationship(connection=cocoindex.ref_auth_entry("my_graph_conn"), ...))

Note that CocoIndex backends use the key of an auth entry to identify the backend.

Keep the key stable. If the key doesn't change, it's considered to be the same backend (even if the underlying way to connect/authenticate change).
If a key is no longer referenced in any operation spec, keep it until the next flow setup / drop action, so that cocoindex will be able to clean up the backends.

Entry Point​

Data Scope​

Get or Add a Field​

Add a collector​

Data Slice​

Import from source​

Refresh interval​

Transform​

For-each-row​

Get a sub field​

Data Collector​

Collect​

Export​

Storage Indexes​

Primary Key​

Vector Index​

Similarity Metrics​

Miscellaneous​

Getting App Namespace​

Control Processing Concurrency​

Target Declarations​

Auth Registry​

Entry Point

Data Scope

Get or Add a Field

Add a collector

Data Slice

Import from source

Refresh interval

Transform

For-each-row

Get a sub field

Data Collector

Collect

Export

Storage Indexes

Primary Key

Vector Index

Similarity Metrics

Miscellaneous

Getting App Namespace

Control Processing Concurrency

Target Declarations

Auth Registry