Skip to main content

Sentence Transformers Embedding

The cocoindex.ops.sentence_transformers module provides integration with the sentence-transformers library for text embeddings.

Overview

The SentenceTransformerEmbedder class is a wrapper around SentenceTransformer models that:

  • Implements VectorSchemaProvider for seamless integration with CocoIndex connectors
  • Handles model caching and thread-safe GPU access automatically
  • Provides a simple embed() method
  • Returns properly typed numpy arrays

Installation

To use sentence transformers with CocoIndex, install with the sentence_transformers extra:

pip install cocoindex[sentence_transformers]

Or with uv:

uv pip install cocoindex[sentence_transformers]

Basic usage

Creating an embedder

from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder

# Initialize embedder with a pre-trained model
embedder = SentenceTransformerEmbedder("sentence-transformers/all-MiniLM-L6-v2")

Embedding text

The embed() method converts text into a numpy.ndarray of float32. It supports both sync and async usage:

# In a CocoIndex function
embedding = await embedder.embed("Hello, world!")

# Use the embedding in a dataclass row, store in a vector database, etc.
table.declare_row(row=CodeEmbedding(code="Hello, world!", embedding=embedding))

Using as a type annotation

The SentenceTransformerEmbedder implements VectorSchemaProvider, which means it can be used directly as metadata in Annotated type annotations. This is the recommended way to declare vector columns — CocoIndex connectors automatically extract the vector dimension and dtype from the annotation when creating tables.

from dataclasses import dataclass
from typing import Annotated
from numpy.typing import NDArray

embedder = SentenceTransformerEmbedder("sentence-transformers/all-MiniLM-L6-v2")

@dataclass
class CodeEmbedding:
id: int
filename: str
code: str
embedding: Annotated[NDArray, embedder] # vector(384) with float32
start_line: int
end_line: int

When you pass this dataclass to a connector's TableSchema.from_class(), the connector automatically reads the embedder annotation to determine the vector column's dimension and dtype. For example, with Postgres:

from cocoindex.connectors import postgres

table_schema = await postgres.TableSchema.from_class(
CodeEmbedding,
primary_key=["id"],
)
target_table = await postgres.mount_table_target(
PG_DB,
"code_embeddings",
table_schema,
pg_schema_name="my_schema",
)

The connector automatically creates the appropriate vector(384) column. See the Connectors docs for other supported backends (LanceDB, Qdrant, SQLite).

Example: text embedding pipeline

Here's a complete example of a text embedding pipeline (based on the text_embedding example):

import asyncio
import pathlib
from dataclasses import dataclass
from typing import Annotated, AsyncIterator

import asyncpg
from numpy.typing import NDArray

import cocoindex as coco
from cocoindex.connectors import localfs, postgres
from cocoindex.ops.text import RecursiveSplitter
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder
from cocoindex.resources.chunk import Chunk
from cocoindex.resources.file import FileLike, PatternFilePathMatcher
from cocoindex.resources.id import IdGenerator

PG_DB = coco.ContextKey[asyncpg.Pool]("pg_db", tracked=False)

_embedder = SentenceTransformerEmbedder("sentence-transformers/all-MiniLM-L6-v2")
_splitter = RecursiveSplitter()

@dataclass
class DocEmbedding:
id: int
filename: str
chunk_start: int
chunk_end: int
text: str
embedding: Annotated[NDArray, _embedder]

@coco.fn
async def process_chunk(
filename: pathlib.PurePath,
chunk: Chunk,
id_gen: IdGenerator,
table: postgres.TableTarget[DocEmbedding],
) -> None:
table.declare_row(
row=DocEmbedding(
id=await id_gen.next_id(chunk.text),
filename=str(filename),
chunk_start=chunk.start.char_offset,
chunk_end=chunk.end.char_offset,
text=chunk.text,
embedding=await _embedder.embed(chunk.text),
),
)

@coco.fn(memo=True)
async def process_file(
file: FileLike,
table: postgres.TableTarget[DocEmbedding],
) -> None:
text = await file.read_text()
chunks = _splitter.split(
text, chunk_size=2000, chunk_overlap=500, language="markdown"
)
id_gen = IdGenerator()
await asyncio.gather(
*(process_chunk(file.file_path.path, chunk, id_gen, table) for chunk in chunks)
)

@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
target_table = await postgres.mount_table_target(
PG_DB,
"doc_embeddings",
await postgres.TableSchema.from_class(
DocEmbedding,
primary_key=["id"],
),
pg_schema_name="public",
)

files = localfs.walk_dir(
sourcedir,
recursive=True,
path_matcher=PatternFilePathMatcher(included_patterns=["**/*.md"]),
)
await coco.mount_each(process_file, files.items(), target_table)

Configuration options

Model selection

You can use any model from the sentence-transformers library:

# Small, fast model (384 dimensions)
embedder = SentenceTransformerEmbedder("sentence-transformers/all-MiniLM-L6-v2")

# Larger, more accurate model (768 dimensions)
embedder = SentenceTransformerEmbedder("sentence-transformers/all-mpnet-base-v2")

# Multilingual model
embedder = SentenceTransformerEmbedder("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")

# Local model
embedder = SentenceTransformerEmbedder("/path/to/local/model")

Normalization

By default, embeddings are normalized to unit length (suitable for cosine similarity):

# Default: normalized embeddings
embedder = SentenceTransformerEmbedder(
"sentence-transformers/all-MiniLM-L6-v2",
normalize_embeddings=True # Default
)

# Disable normalization if needed
embedder = SentenceTransformerEmbedder(
"sentence-transformers/all-MiniLM-L6-v2",
normalize_embeddings=False
)