Built-in operations

Overview of CocoIndex's built-in operations: text chunking and splitting, local and hosted embeddings, audio transcription, and entity resolution, with links to each operation module's reference.

Version: v 1.0.14
Last reviewed: Jun 29, 2026

CocoIndex ships a set of built-in operations under the cocoindex.ops package. Each module is independently importable and composes with the rest of a pipeline.

python

from cocoindex.ops import text, litellm, sentence_transformers, entity_resolution

Operation modules

Module	What it provides
Text operations	Code language detection, regex-based `SeparatorSplitter`, and syntax-aware `RecursiveSplitter` (tree-sitter) returning position-tracked `Chunk`s.
Sentence Transformers	Local text embeddings via `sentence-transformers`, with model caching, thread-safe GPU access, and optional normalization.
LiteLLM	Embeddings and audio transcription through LiteLLM’s unified API across 100+ providers (OpenAI, Azure, Vertex AI, Bedrock, Cohere, and more).
Entity resolution	Deduplicate entity names via FAISS embedding similarity plus a pluggable LLM pair-resolver, with PINNED / PREFERRED canonical policies.

Most common: chunking code and text

The most-used built-in is RecursiveSplitter. It does syntax-aware splitting that respects language structure (functions, classes, blocks) via tree-sitter, falling back to separator-based splitting for unsupported languages.

python

from cocoindex.ops.text import RecursiveSplitter

splitter = RecursiveSplitter()

chunks = splitter.split(
    python_code,
    chunk_size=1000,
    min_chunk_size=300,
    chunk_overlap=300,
    language="python",
)

It tries to keep each output chunk between min_chunk_size and chunk_size, splitting at the highest-level boundary that fits and descending to finer boundaries when a piece is still too large. See Text operations for the full parameter reference, supported languages, and custom-language configuration.

Embeddings

Two interchangeable embedding backends are available, both providing a VectorSchemaProvider so vector columns are configured automatically by connectors:

Sentence Transformers runs locally, no API key.
LiteLLM reaches hosted providers behind one API.