Text operations

Built-in text utilities — code language detection, regex-based SeparatorSplitter, and syntax-aware RecursiveSplitter that returns position-tracked Chunks with optional custom domain-specific splitting rules.

Version
v 1.0.2
Last reviewed
May 2, 2026

The cocoindex.ops.text module provides operations for text processing.

python
from cocoindex.ops.text import RecursiveSplitter, SeparatorSplitter

Features include:

  • Code language detection
  • Text chunking and splitting
  • Syntax-aware code splitting

Available functions and classes

detect_code_language()

Detect the programming language from a filename.

Usage:

python
from cocoindex.ops.text import detect_code_language

language = detect_code_language(filename="main.py")
print(language)  # "python"

language = detect_code_language(filename="app.rs")
print(language)  # "rust"

language = detect_code_language(filename="unknown.xyz")
print(language)  # None

SeparatorSplitter

Split text by regex separators.

Usage:

python
from cocoindex.ops.text import SeparatorSplitter

splitter = SeparatorSplitter()

text = "First sentence. Second sentence. Third sentence."
chunks = splitter.split(
    text,
    chunk_size=100,
    chunk_overlap=20,
    separators=[r"\.\s+"]  # Split on periods followed by whitespace
)

for chunk in chunks:
    print(chunk.text)

RecursiveSplitter

Advanced text chunking with language awareness and syntax-aware splitting for code. Returns Chunk objects with position information.

Features:

  • Supports many programming languages
  • Preserves code structure
  • Customizable chunk sizes and overlap
  • Returns Chunk objects with start/end positions (line, column, byte/char offsets)

Usage:

python
from cocoindex.ops.text import RecursiveSplitter

splitter = RecursiveSplitter()

# Split markdown text
text = "# Title\n\nParagraph 1.\n\nParagraph 2."
chunks = splitter.split(
    text,
    chunk_size=2000,
    chunk_overlap=500,
    language="markdown"
)

for chunk in chunks:
    print(f"Chunk: {chunk.text}")
    print(f"Start: line {chunk.start.line}, char {chunk.start.char_offset}")
    print(f"End: line {chunk.end.line}, char {chunk.end.char_offset}")

Language-aware code splitting:

python
# Split Python code
python_code = '''
def hello():
    print("Hello, world!")

def goodbye():
    print("Goodbye!")
'''

chunks = splitter.split(
    python_code,
    chunk_size=1000,
    min_chunk_size=300,
    chunk_overlap=300,
    language="python"
)

Supported languages:

Languages with syntax-aware (tree-sitter) splitting — splits at logical boundaries like functions, classes, and blocks:

Languagelanguage= valueExtensions
C"c".c, .h
C++"cpp".cpp, .cc, .cxx, .c++
C#"c_sharp".cs
CSS"css".css
Fortran"fortran".f, .f90, .f95
Go"go".go
HTML"html".html, .htm
Java"java".java
JavaScript"javascript".js, .mjs, .cjs, .jsx
JSON"json".json, .jsonc
Julia"julia".jl
Kotlin"kotlin".kt, .kts
Markdown"markdown".md
Pascal"pascal".pas
PHP"php".php
Python"python".py
R"r".r, .R
Ruby"ruby".rb
Rust"rust".rs
Scala"scala".scala
Solidity"solidity".sol
SQL"sql".sql
Svelte"svelte".svelte
Swift"swift".swift
TOML"toml".toml
TSX"tsx".tsx
TypeScript"typescript".ts
Vue"vue".vue
XML"xml".xml
YAML"yaml".yaml, .yml

Many additional languages use separator-based splitting (e.g. "bash", "dart", "elixir", "elm", "go", "haskell", "lua", "perl", "swift", and more). Pass the language name string to language= — use detect_code_language() to infer it from a filename.

CustomLanguageConfig

Define custom language splitting rules.

Usage:

python
from cocoindex.ops.text import CustomLanguageConfig, RecursiveSplitter

# Create custom language config for abstracts
abstract_config = CustomLanguageConfig(
    language_name="abstract",
    separators_regex=[
        r"[.?!]+\s+",  # Sentence boundaries
        r"[:;]\s+",     # Clause boundaries
        r",\s+",        # Comma boundaries
        r"\s+",         # Whitespace
    ]
)

splitter = RecursiveSplitter(custom_languages=[abstract_config])

chunks = splitter.split(
    "This is a sample abstract. It has multiple sentences...",
    chunk_size=500,
    chunk_overlap=150,
    language="abstract"
)

API reference

For detailed API documentation, refer to the module docstrings:

python
from cocoindex.ops import text

help(text.RecursiveSplitter)
help(text.SeparatorSplitter)
help(text.detect_code_language)
help(text.CustomLanguageConfig)
CocoIndex Docs Edit this page Report issue