Text operations

Built-in text utilities — code language detection, regex-based SeparatorSplitter, and syntax-aware RecursiveSplitter that returns position-tracked Chunks with optional custom domain-specific splitting rules.

Version: v 1.0.14
Last reviewed: Jul 4, 2026

The cocoindex.ops.text module provides operations for text processing.

python

from cocoindex.ops.text import RecursiveSplitter, SeparatorSplitter

Features include:

Code language detection
Text chunking and splitting
Syntax-aware code splitting

Available functions and classes

`detect_code_language()`

Detect the programming language from a filename.

Usage:

python

from cocoindex.ops.text import detect_code_language

language = detect_code_language(filename="main.py")
print(language)  # "python"

language = detect_code_language(filename="app.rs")
print(language)  # "rust"

language = detect_code_language(filename="unknown.xyz")
print(language)  # None

`SeparatorSplitter`

Split text by regex separators.

Usage:

python

from cocoindex.ops.text import SeparatorSplitter

splitter = SeparatorSplitter()

text = "First sentence. Second sentence. Third sentence."
chunks = splitter.split(
    text,
    chunk_size=100,
    chunk_overlap=20,
    separators=[r"\.\s+"]  # Split on periods followed by whitespace
)

for chunk in chunks:
    print(chunk.text)

`RecursiveSplitter`

Advanced text chunking with language awareness and syntax-aware splitting for code. Returns Chunk objects with position information.

Features:

Supports many programming languages
Preserves code structure
Customizable chunk sizes and overlap
Returns Chunk objects with start/end positions (line, column, byte/char offsets)

Usage:

python

from cocoindex.ops.text import RecursiveSplitter

splitter = RecursiveSplitter()

# Split markdown text
text = "# Title\n\nParagraph 1.\n\nParagraph 2."
chunks = splitter.split(
    text,
    chunk_size=2000,
    chunk_overlap=500,
    language="markdown"
)

for chunk in chunks:
    print(f"Chunk: {chunk.text}")
    print(f"Start: line {chunk.start.line}, char {chunk.start.char_offset}")
    print(f"End: line {chunk.end.line}, char {chunk.end.char_offset}")

Language-aware code splitting:

python

# Split Python code
python_code = '''
def hello():
    print("Hello, world!")

def goodbye():
    print("Goodbye!")
'''

chunks = splitter.split(
    python_code,
    chunk_size=1000,
    min_chunk_size=300,
    chunk_overlap=300,
    language="python"
)

Reusing one parse across APIs:

When the same code is passed to several parse-consuming APIs (e.g. splitting and structural pattern matching), wrap it in a CodeSource so it is parsed at most once. The CodeSource carries its own language, so the language= argument must be omitted:

python

from cocoindex.ops.code import CodePattern, CodeSource

src = CodeSource(python_code, language="python")
chunks = splitter.split(src, chunk_size=1000)          # parses once
defs = CodePattern(r"def \NAME(\(A*\)):", language="python").match_source(src)  # reuses the parse

Supported languages:

Languages with syntax-aware (tree-sitter) splitting — splits at logical boundaries like functions, classes, and blocks:

Language	`language=` value	Extensions
Astro	`"astro"`	`.astro`
C	`"c"`	`.c`, `.h`
C++	`"cpp"`	`.cpp`, `.cc`, `.cxx`, `.c++`
C#	`"c_sharp"`	`.cs`
CSS	`"css"`	`.css`
Fortran	`"fortran"`	`.f`, `.f90`, `.f95`
Go	`"go"`	`.go`
HTML	`"html"`	`.html`, `.htm`
Java	`"java"`	`.java`
JavaScript	`"javascript"`	`.js`, `.mjs`, `.cjs`, `.jsx`
JSON	`"json"`	`.json`, `.jsonc`
Julia	`"julia"`	`.jl`
Kotlin	`"kotlin"`	`.kt`, `.kts`
Markdown	`"markdown"`	`.md`
Pascal	`"pascal"`	`.pas`
PHP	`"php"`	`.php`
Python	`"python"`	`.py`
R	`"r"`	`.r`, `.R`
Ruby	`"ruby"`	`.rb`
Rust	`"rust"`	`.rs`
Scala	`"scala"`	`.scala`
Solidity	`"solidity"`	`.sol`
SQL	`"sql"`	`.sql`
Svelte	`"svelte"`	`.svelte`
Swift	`"swift"`	`.swift`
TOML	`"toml"`	`.toml`
TSX	`"tsx"`	`.tsx`
TypeScript	`"typescript"`	`.ts`
Vue	`"vue"`	`.vue`
XML	`"xml"`	`.xml`
YAML	`"yaml"`	`.yaml`, `.yml`

Many additional languages use separator-based splitting (e.g. "bash", "dart", "elixir", "elm", "go", "haskell", "lua", "perl", "swift", and more). Pass the language name string to language= — use detect_code_language() to infer it from a filename.

`CustomLanguageConfig`

Define custom language splitting rules.

Usage:

python

from cocoindex.ops.text import CustomLanguageConfig, RecursiveSplitter

# Create custom language config for abstracts
abstract_config = CustomLanguageConfig(
    language_name="abstract",
    separators_regex=[
        r"[.?!]+\s+",  # Sentence boundaries
        r"[:;]\s+",     # Clause boundaries
        r",\s+",        # Comma boundaries
        r"\s+",         # Whitespace
    ]
)

splitter = RecursiveSplitter(custom_languages=[abstract_config])

chunks = splitter.split(
    "This is a sample abstract. It has multiple sentences...",
    chunk_size=500,
    chunk_overlap=150,
    language="abstract"
)

API reference

For detailed API documentation, refer to the module docstrings:

python

from cocoindex.ops import text

help(text.RecursiveSplitter)
help(text.SeparatorSplitter)
help(text.detect_code_language)
help(text.CustomLanguageConfig)