Built-in operations

Overview of CocoIndex's built-in operations: text chunking and splitting, local and hosted embeddings, audio transcription, and entity resolution, with links to each operation module's reference.

Version
v 1.0.14
Last reviewed
Jun 29, 2026

CocoIndex ships a set of built-in operations under the cocoindex.ops package. Each module is independently importable and composes with the rest of a pipeline.

python
from cocoindex.ops import text, litellm, sentence_transformers, entity_resolution

Operation modules

ModuleWhat it provides
Text operationsCode language detection, regex-based SeparatorSplitter, and syntax-aware RecursiveSplitter (tree-sitter) returning position-tracked Chunks.
Sentence TransformersLocal text embeddings via sentence-transformers, with model caching, thread-safe GPU access, and optional normalization.
LiteLLMEmbeddings and audio transcription through LiteLLM’s unified API across 100+ providers (OpenAI, Azure, Vertex AI, Bedrock, Cohere, and more).
Entity resolutionDeduplicate entity names via FAISS embedding similarity plus a pluggable LLM pair-resolver, with PINNED / PREFERRED canonical policies.

Most common: chunking code and text

The most-used built-in is RecursiveSplitter. It does syntax-aware splitting that respects language structure (functions, classes, blocks) via tree-sitter, falling back to separator-based splitting for unsupported languages.

python
from cocoindex.ops.text import RecursiveSplitter

splitter = RecursiveSplitter()

chunks = splitter.split(
    python_code,
    chunk_size=1000,
    min_chunk_size=300,
    chunk_overlap=300,
    language="python",
)

It tries to keep each output chunk between min_chunk_size and chunk_size, splitting at the highest-level boundary that fits and descending to finer boundaries when a piece is still too large. See Text operations for the full parameter reference, supported languages, and custom-language configuration.

Embeddings

Two interchangeable embedding backends are available, both providing a VectorSchemaProvider so vector columns are configured automatically by connectors:

CocoIndex Docs Edit this page Report issue