Built-in operations
Overview of CocoIndex's built-in operations: text chunking and splitting, local and hosted embeddings, audio transcription, and entity resolution, with links to each operation module's reference.
CocoIndex ships a set of built-in operations under the cocoindex.ops package.
Each module is independently importable and composes with the rest of a pipeline.
from cocoindex.ops import text, litellm, sentence_transformers, entity_resolution
Operation modules
| Module | What it provides |
|---|---|
| Text operations | Code language detection, regex-based SeparatorSplitter, and syntax-aware RecursiveSplitter (tree-sitter) returning position-tracked Chunks. |
| Sentence Transformers | Local text embeddings via sentence-transformers, with model caching, thread-safe GPU access, and optional normalization. |
| LiteLLM | Embeddings and audio transcription through LiteLLM’s unified API across 100+ providers (OpenAI, Azure, Vertex AI, Bedrock, Cohere, and more). |
| Entity resolution | Deduplicate entity names via FAISS embedding similarity plus a pluggable LLM pair-resolver, with PINNED / PREFERRED canonical policies. |
Most common: chunking code and text
The most-used built-in is RecursiveSplitter. It does syntax-aware splitting that
respects language structure (functions, classes, blocks) via tree-sitter, falling
back to separator-based splitting for unsupported languages.
from cocoindex.ops.text import RecursiveSplitter
splitter = RecursiveSplitter()
chunks = splitter.split(
python_code,
chunk_size=1000,
min_chunk_size=300,
chunk_overlap=300,
language="python",
)
It tries to keep each output chunk between min_chunk_size and chunk_size, splitting
at the highest-level boundary that fits and descending to finer boundaries when a piece
is still too large. See Text operations for the full parameter reference,
supported languages, and custom-language configuration.
Embeddings
Two interchangeable embedding backends are available, both providing a
VectorSchemaProvider so vector columns are configured automatically by connectors:
- Sentence Transformers runs locally, no API key.
- LiteLLM reaches hosted providers behind one API.