CocoIndex quickstart

Point CocoIndex at a folder of markdown files, chunk and embed each doc with a sentence transformer, and land the vectors in Postgres with pgvector, end to end.

Time
~10 minutes
Language
Python 3.11+
Requires
Postgres + pgvector
Version
v 0.3.37
Last reviewed
Jan 6, 2026

In this tutorial, we’ll build an index with text embeddings, keeping it minimal and focused on the core indexing flow.

Flow Overview

Flow

  1. Read text files from the local filesystem
  2. Chunk each document
  3. For each chunk, embed it with a text embedding model
  4. Store the embeddings in a vector database for retrieval

Setup

  1. Install CocoIndex:

    pip install -U 'cocoindex[embeddings]'
  2. Install Postgres.

  3. Create a new directory for your project:

    mkdir cocoindex-quickstart
    cd cocoindex-quickstart
  4. Place input files in a directory markdown_files. You may download from markdown_files.zip.

Define a flow

Create a new file main.py and define a flow.

import cocoindex

@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    # ... See subsections below for function body

Add Source and Collector

# add source
data_scope["documents"] = flow_builder.add_source(
    cocoindex.sources.LocalFile(path="markdown_files"))

# add data collector
doc_embeddings = data_scope.add_collector()

flow_builder.add_source will create a table with sub fields (filename, content)

Process each document

With CocoIndex, it is easy to process nested data structures.

with data_scope["documents"].row() as doc:
    # ... See subsections below for function body

Chunk each document

doc["chunks"] = doc["content"].transform(
    cocoindex.functions.SplitRecursively(),
    language="markdown", chunk_size=2000, chunk_overlap=500)

We extend a new field chunks to each row by transforming the content field using SplitRecursively. The output of the SplitRecursively is a KTable representing each chunk of the document.

Chunking

Embed each chunk and collect the embeddings

with doc["chunks"].row() as chunk:
    # embed
    chunk["embedding"] = chunk["text"].transform(
        cocoindex.functions.SentenceTransformerEmbed(
            model="sentence-transformers/all-MiniLM-L6-v2"
        )
    )

    # collect
    doc_embeddings.collect(
        filename=doc["filename"],
        location=chunk["location"],
        text=chunk["text"],
        embedding=chunk["embedding"],
    )

This code embeds each chunk using the SentenceTransformer library and collects the results.

Embedding

Export the embeddings to Postgres

doc_embeddings.export(
    "doc_embeddings",
    cocoindex.storages.Postgres(),
    primary_key_fields=["filename", "location"],
    vector_indexes=[
        cocoindex.VectorIndexDef(
            field_name="embedding",
            metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,
        )
    ],
)

CocoIndex supports other vector databases as well, with 1-line switch.

Run the indexing pipeline

  • Specify the database URL by environment variable:

    export COCOINDEX_DATABASE_URL="postgresql://cocoindex:cocoindex@localhost:5432/cocoindex"
i
Prerequisite

Make sure your Postgres server is running before proceeding. See how to launch CocoIndex for details.

  • Build the index:

    cocoindex update main

CocoIndex will run for a few seconds and populate the target table with data as declared by the flow. It will output the following statistics:

documents: 3 added, 0 removed, 0 updated

That’s it for the main indexing flow.

End to end: Query the index (Optional)

If you want to build a end to end query flow that also searches the index, you can follow the simple_vector_index example.

Next Steps

Next, you may want to:

CocoIndex Docs Edit this page Report issue