Extract, Transform, Index Data. Easy and Fresh.

CocoIndex is the world's first open-source engine that supports both custom transformation logic and incremental updates specialized for data indexing.
Heavy Transformation
ParseChunkEmbedding...
Incremental indexing on data or logic update
Custom transformation logic
CocoIndex AI Logo

Describe the Data Flow

Spin up your pipeline in a few lines of code.
@cocoindex.flow_def(name="TextEmbedding")
def text_embedding(flow: cocoindex.FlowBuilder, data: cocoindex.DataScope):
  # Add data source.
  data["documents"] = flow.add_source(
    cocoindex.sources.LocalFile(path="sourceFiles"))
  doc_embeddings = data.add_collector()

  with data["documents"].row() as doc:
    # Split into chunks.
    doc["chunks"] = doc["content"].transform(
      cocoindex.functions.SplitRecursively(), chunk_size=512)

    with doc["chunks"].row() as chunk:
      # Embed each chunk.
      chunk["embedding"] = chunk["text"].transform(
        cocoindex.functions.SentenceTransformerEmbed(
            model="sentence-transformers/all-MiniLM-L6-v2")

      # Collect embedding and metadata for indexing.
      doc_embeddings.collect(
        filename=doc["filename"], location=chunk["location"],
        text=chunk["text"], embedding=chunk["embedding"])

  # Export to vector store.
  doc_embeddings.export(
      "doc_embeddings", cocoindex.storages.Postgres(),
      primary_key_fields=["filename", "location"],
      vector_index=["embedding"])
CocoIndex Flow

As simple as data and formula in spreadsheet,

  • You declare data, define transformations, and we execute

  • No side effects

  • Built in lineage and observability

  • Super easy to understand and troubleshoot

Coco Does the Rest

CocoIndex saves these work for you compared with conventional indexing pipelines that are normally error prone and tricky to handle:

Setup table schema for indexing and maintain on logic change

(re-)processing necessary portions; reuse cache when possible

Maintain data fresh, clear stale derived data/versions

Re-index data based on tracking data/logic changes or data TTL settings

Resume from terminated execution without recomputing everything

Defined once, run in different scenarios

CocoIndex handles the scalability you need and make your pipeline robust. Once your are ready to deploy in production, CocoIndex saves your time and cost.

Sample Preview

Sample based fast preview run, for dev time understanding / debugging

Batch Processing

Batch run on entire data source, in large scale

Continuous Updates

Continuous apply incremental source changes to keep index up to date, with low latency

CocoIndex Components

CocoIndex can help you connect to all the data sources, identify the best indexing strategy and setup the most robust pipeline - chunking, embedding model, deduping/reconciling, vector stores, knowledge graph etc. And then provide standard API to access the index.

Your Input

Web Pages
Documents
Databases

Applications

Search
RAG
Analytics

Indexing Stack

Ingestion Connectors

Web
Cloud Storage
Ingestion API

Parse, Convert

PDF
Markdown
HTML
XML
Slides
Google Doc
Docs
JSON

Extract, Split

Flat chunks
Hierachical chunks
Knowledge Graph triple extraction

Align, Reconcile

Dedupe
Cross-doc reference
Entity alignment

Query Stack

Query API

Query Understanding

Planning

Rerank

Related Discover

Related Retrieval

Retrieve

Index

Graph Store
Relational Store
Object Store
Vector Store

CocoInsight

You don't need to be a data expert. CocoInsight provides you the best-in-class tools to understand your pipeline step by step, explains and helps you choose the best indexing strategy.
Chunking visualization

Chunking

Observe, understand, and compare different chunking configurations to quickly iterate on your strategy.

Pricing

CocoIndex Open Source

  • Self-hosted
  • Free
  • Apache 2.0 license

CocoInsight

Free beta now, join our discord group to get started!

Enterprise

  • VPC / On Prem Deployments
  • Guaranteed customer support and SLA
  • Enterprise source connectors
  • Data governance - PII
  • Cost and usage optimization
  • CocoInsight
  • Support and control plane for distributed computing