Sentence Transformers embeddings

Run sentence-transformers models locally for text embeddings with automatic model caching, thread-safe GPU access, and optional normalization — plus VectorSchemaProvider integration for connectors.

Version
v 1.0.0-alpha48
Last reviewed
Apr 19, 2026

The cocoindex.ops.sentence_transformers module provides integration with the sentence-transformers library for text embeddings.

python
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder
Dependencies

This module requires additional dependencies. Install with:

bash
pip install cocoindex[sentence_transformers]

Overview

The SentenceTransformerEmbedder class is a wrapper around SentenceTransformer models that:

  • Implements VectorSchemaProvider for seamless integration with CocoIndex connectors
  • Handles model caching and thread-safe GPU access automatically
  • Provides a simple embed() method
  • Returns properly typed numpy arrays

Basic usage

Creating an embedder

python
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder

# Initialize embedder with a pre-trained model
embedder = SentenceTransformerEmbedder("sentence-transformers/all-MiniLM-L6-v2")

Embedding text

The embed() method converts text into a numpy.ndarray of float32. It supports both sync and async usage:

python
# In a CocoIndex function
embedding = await embedder.embed("Hello, world!")

# Use the embedding in a dataclass row, store in a vector database, etc.
table.declare_row(row=CodeEmbedding(code="Hello, world!", embedding=embedding))

Using as a type annotation

The SentenceTransformerEmbedder implements VectorSchemaProvider, which means it can be used directly as metadata in Annotated type annotations. This is the recommended way to declare vector columns — CocoIndex connectors automatically extract the vector dimension and dtype from the annotation when creating tables.

python
from dataclasses import dataclass
from typing import Annotated
from numpy.typing import NDArray

embedder = SentenceTransformerEmbedder("sentence-transformers/all-MiniLM-L6-v2")

@dataclass
class CodeEmbedding:
    id: int
    filename: str
    code: str
    embedding: Annotated[NDArray, embedder]  # vector(384) with float32
    start_line: int
    end_line: int

When you pass this dataclass to a connector’s TableSchema.from_class(), the connector automatically reads the embedder annotation to determine the vector column’s dimension and dtype. For example, with Postgres:

python
from cocoindex.connectors import postgres

table_schema = await postgres.TableSchema.from_class(
    CodeEmbedding,
    primary_key=["id"],
)
target_table = await postgres.mount_table_target(
    PG_DB,
    "code_embeddings",
    table_schema,
    pg_schema_name="my_schema",
)

The connector automatically creates the appropriate vector(384) column. See the Connectors docs for other supported backends (LanceDB, Qdrant, SQLite).

Example: text embedding pipeline

Here’s a complete example of a text embedding pipeline (based on the text_embedding example):

python
import pathlib
from dataclasses import dataclass
from typing import Annotated, AsyncIterator

import asyncpg
from numpy.typing import NDArray

import cocoindex as coco
from cocoindex.connectors import localfs, postgres
from cocoindex.ops.text import RecursiveSplitter
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder
from cocoindex.resources.chunk import Chunk
from cocoindex.resources.file import FileLike, PatternFilePathMatcher
from cocoindex.resources.id import IdGenerator

PG_DB = coco.ContextKey[asyncpg.Pool]("pg_db")

_embedder = SentenceTransformerEmbedder("sentence-transformers/all-MiniLM-L6-v2")
_splitter = RecursiveSplitter()

@dataclass
class DocEmbedding:
    id: int
    filename: str
    chunk_start: int
    chunk_end: int
    text: str
    embedding: Annotated[NDArray, _embedder]

@coco.fn
async def process_chunk(
    chunk: Chunk,
    filename: pathlib.PurePath,
    id_gen: IdGenerator,
    table: postgres.TableTarget[DocEmbedding],
) -> None:
    table.declare_row(
        row=DocEmbedding(
            id=await id_gen.next_id(chunk.text),
            filename=str(filename),
            chunk_start=chunk.start.char_offset,
            chunk_end=chunk.end.char_offset,
            text=chunk.text,
            embedding=await _embedder.embed(chunk.text),
        ),
    )

@coco.fn(memo=True)
async def process_file(
    file: FileLike,
    table: postgres.TableTarget[DocEmbedding],
) -> None:
    text = await file.read_text()
    chunks = _splitter.split(
        text, chunk_size=2000, chunk_overlap=500, language="markdown"
    )
    id_gen = IdGenerator()
    await coco.map(process_chunk, chunks, file.file_path.path, id_gen, table)

@coco.fn
async def app_main(sourcedir: pathlib.Path) -> None:
    target_table = await postgres.mount_table_target(
        PG_DB,
        "doc_embeddings",
        await postgres.TableSchema.from_class(
            DocEmbedding,
            primary_key=["id"],
        ),
        pg_schema_name="public",
    )

    files = localfs.walk_dir(
        sourcedir,
        recursive=True,
        path_matcher=PatternFilePathMatcher(included_patterns=["**/*.md"]),
    )
    await coco.mount_each(process_file, files.items(), target_table)

Configuration options

Model selection

You can use any model from the sentence-transformers library:

python
# Small, fast model (384 dimensions)
embedder = SentenceTransformerEmbedder("sentence-transformers/all-MiniLM-L6-v2")

# Larger, more accurate model (768 dimensions)
embedder = SentenceTransformerEmbedder("sentence-transformers/all-mpnet-base-v2")

# Multilingual model
embedder = SentenceTransformerEmbedder("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")

# Local model
embedder = SentenceTransformerEmbedder("/path/to/local/model")

Normalization

By default, embeddings are normalized to unit length (suitable for cosine similarity):

python
# Default: normalized embeddings
embedder = SentenceTransformerEmbedder(
    "sentence-transformers/all-MiniLM-L6-v2",
    normalize_embeddings=True  # Default
)

# Disable normalization if needed
embedder = SentenceTransformerEmbedder(
    "sentence-transformers/all-MiniLM-L6-v2",
    normalize_embeddings=False
)
CocoIndex Docs Edit this page Report issue