Skip to main content

Quickstart

View on GitHub Watch on YouTube

In this tutorial, we’ll build an index with text embeddings, keeping it minimal and focused on the core indexing flow.

Flow Overview

Flow

  1. Read text files from the local filesystem
  2. Chunk each document
  3. For each chunk, embed it with a text embedding model
  4. Store the embeddings in a vector database for retrieval

Setup

  1. Install CocoIndex:

    pip install -U 'cocoindex[embeddings]'
  2. Install Postgres.

  3. Create a new directory for your project:

    mkdir cocoindex-quickstart
    cd cocoindex-quickstart
  4. Place input files in a directory markdown_files. You may download from markdown_files.zip.

Define a flow

Create a new file main.py and define a flow.

main.py
import cocoindex

@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
# ... See subsections below for function body

Add Source and Collector

main.py
# add source
data_scope["documents"] = flow_builder.add_source(
cocoindex.sources.LocalFile(path="markdown_files"))

# add data collector
doc_embeddings = data_scope.add_collector()

flow_builder.add_source will create a table with sub fields (filename, content)

Source Data Collector

Process each document

With CocoIndex, it is easy to process nested data structures.

main.py
with data_scope["documents"].row() as doc:
# ... See subsections below for function body

Chunk each document

main.py
doc["chunks"] = doc["content"].transform(
cocoindex.functions.SplitRecursively(),
language="markdown", chunk_size=2000, chunk_overlap=500)

We extend a new field chunks to each row by transforming the content field using SplitRecursively. The output of the SplitRecursively is a KTable representing each chunk of the document.

SplitRecursively

Chunking

Embed each chunk and collect the embeddings

main.py
with doc["chunks"].row() as chunk:
# embed
chunk["embedding"] = chunk["text"].transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2"
)
)

# collect
doc_embeddings.collect(
filename=doc["filename"],
location=chunk["location"],
text=chunk["text"],
embedding=chunk["embedding"],
)

This code embeds each chunk using the SentenceTransformer library and collects the results.

Embedding

SentenceTransformerEmbed

Export the embeddings to Postgres

main.py
doc_embeddings.export(
"doc_embeddings",
cocoindex.storages.Postgres(),
primary_key_fields=["filename", "location"],
vector_indexes=[
cocoindex.VectorIndexDef(
field_name="embedding",
metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,
)
],
)

CocoIndex supports other vector databases as well, with 1-line switch.

Targets

Run the indexing pipeline

  • Specify the database URL by environment variable:

    export COCOINDEX_DATABASE_URL="postgresql://cocoindex:cocoindex@localhost:5432/cocoindex"
  • Build the index:

    cocoindex update --setup main.py

CocoIndex will run for a few seconds and populate the target table with data as declared by the flow. It will output the following statistics:

documents: 3 added, 0 removed, 0 updated

That's it for the main indexing flow.

End to end: Query the index (Optional)

If you want to build a end to end query flow that also searches the index, you can follow the simple_vector_index example.

Next Steps

Next, you may want to: