Data transformation for AI

Ultra performant data transformation framework for AI, with incremental processing and data lineage out-of-box. Exceptional developer velocity. Production-ready at day 0.

CocoIndex

Open source ETL framework designed for AI workloads.

CocoIndex is an ultra performant data transformation framework, with core engine written in Rust. Make it super easy to transform data with AI workloads, and keep source data and target in sync effortlessly.

Either creating embedding, building knowledge graphs, or doing any data transformations - beyond traditional SQL.

CocoIndex is the worlds' first open source ETL framework / compute engine that supports 1) incremental processing 2) custom logic 3) compute heavy transformation beyond SQL, for example: LLM inference, structured extraction, vector embedding, etc.

Exceptional Velocity

Just declare transformations with data flow, minimal code needed. Get started with ~100 lines of Python.

# import
data['content'] = flow_builder.add_source(...)

# transform
data['out'] = data['content']
    .transform(...)
    .transform(...)

# collect data
collector.collect(...)

# export to db, vector db, graph db ...
collector.export(...)

Data flow diagram showing CocoIndex's transformation pipeline

Build like LEGO

Native builtins for different sources, targets and transformations. Standardize interface, make it 1-line code switch between different components.

CocoIndex native components: 1) ingestion: local files, ingestion API, web pages, cloud storage. 2) indexing: parse, extract structure, chunk, embed, map, dedup, cluster, reconcile. 3) export: relational db, vector db, graph db, object db.

Incremental Processing

Out-of-box support for incremental indexing

minimal recomputation on source or logic change.
(re-)processing necessary portions; reuse cache when possible

Incremental processing diagram showing CocoIndex's efficient data flow

Flow is Single Source of Truth

Define once, run in multiple modes

batch
long running job (live update that watches your source)
sample based fast preview run (cocoinsight)

Automatic schema setup based on logic and data

data processing and schema management

CocoIndex flow is the single source of truth. once a flow is defined, it can be run in multiple modes, and schemas in target stores are automatically managed.

CocoInsight

Your Data Lineage and Observability Companion

You don't need to be a data expert. CocoInsight provides you with best in-class tools to understand your pipeline step by step, explains the process, and helps you choose the best indexing strategy.

One of the most loved feature for our users building ETL with coco, with significant boost on developer velocity, and lowering the barrier to entry for data engineering.

CocoInsight's data lineage and observability tool

Start free and scales as you grow

Open Source

Self-hosted
Free
Apache 2.0 license

CocoInsight

Free for personal use
Join our discord group to get started!

Team/Enterprise

VPC / On Premise Deployments
Guaranteed customer support and SLA
Enterprise source connectors
Data governance - PII
Cost and usage optimization
CocoInsight
Support and control plane for distributed computing