Text operations
Built-in text utilities — code language detection, regex-based SeparatorSplitter, and syntax-aware RecursiveSplitter that returns position-tracked Chunks with optional custom domain-specific splitting rules.
The cocoindex.ops.text module provides operations for text processing.
from cocoindex.ops.text import RecursiveSplitter, SeparatorSplitter
Features include:
- Code language detection
- Text chunking and splitting
- Syntax-aware code splitting
Available functions and classes
detect_code_language()
Detect the programming language from a filename.
Usage:
from cocoindex.ops.text import detect_code_language
language = detect_code_language(filename="main.py")
print(language) # "python"
language = detect_code_language(filename="app.rs")
print(language) # "rust"
language = detect_code_language(filename="unknown.xyz")
print(language) # None
SeparatorSplitter
Split text by regex separators.
Usage:
from cocoindex.ops.text import SeparatorSplitter
splitter = SeparatorSplitter()
text = "First sentence. Second sentence. Third sentence."
chunks = splitter.split(
text,
chunk_size=100,
chunk_overlap=20,
separators=[r"\.\s+"] # Split on periods followed by whitespace
)
for chunk in chunks:
print(chunk.text)
RecursiveSplitter
Advanced text chunking with language awareness and syntax-aware splitting for code. Returns Chunk objects with position information.
Features:
- Supports many programming languages
- Preserves code structure
- Customizable chunk sizes and overlap
- Returns
Chunkobjects with start/end positions (line, column, byte/char offsets)
Usage:
from cocoindex.ops.text import RecursiveSplitter
splitter = RecursiveSplitter()
# Split markdown text
text = "# Title\n\nParagraph 1.\n\nParagraph 2."
chunks = splitter.split(
text,
chunk_size=2000,
chunk_overlap=500,
language="markdown"
)
for chunk in chunks:
print(f"Chunk: {chunk.text}")
print(f"Start: line {chunk.start.line}, char {chunk.start.char_offset}")
print(f"End: line {chunk.end.line}, char {chunk.end.char_offset}")
Language-aware code splitting:
# Split Python code
python_code = '''
def hello():
print("Hello, world!")
def goodbye():
print("Goodbye!")
'''
chunks = splitter.split(
python_code,
chunk_size=1000,
min_chunk_size=300,
chunk_overlap=300,
language="python"
)
Supported languages:
Languages with syntax-aware (tree-sitter) splitting — splits at logical boundaries like functions, classes, and blocks:
| Language | language= value | Extensions |
|---|---|---|
| C | "c" | .c, .h |
| C++ | "cpp" | .cpp, .cc, .cxx, .c++ |
| C# | "c_sharp" | .cs |
| CSS | "css" | .css |
| Fortran | "fortran" | .f, .f90, .f95 |
| Go | "go" | .go |
| HTML | "html" | .html, .htm |
| Java | "java" | .java |
| JavaScript | "javascript" | .js, .mjs, .cjs, .jsx |
| JSON | "json" | .json, .jsonc |
| Julia | "julia" | .jl |
| Kotlin | "kotlin" | .kt, .kts |
| Markdown | "markdown" | .md |
| Pascal | "pascal" | .pas |
| PHP | "php" | .php |
| Python | "python" | .py |
| R | "r" | .r, .R |
| Ruby | "ruby" | .rb |
| Rust | "rust" | .rs |
| Scala | "scala" | .scala |
| Solidity | "solidity" | .sol |
| SQL | "sql" | .sql |
| Svelte | "svelte" | .svelte |
| Swift | "swift" | .swift |
| TOML | "toml" | .toml |
| TSX | "tsx" | .tsx |
| TypeScript | "typescript" | .ts |
| Vue | "vue" | .vue |
| XML | "xml" | .xml |
| YAML | "yaml" | .yaml, .yml |
Many additional languages use separator-based splitting (e.g. "bash", "dart", "elixir", "elm", "go", "haskell", "lua", "perl", "swift", and more). Pass the language name string to language= — use detect_code_language() to infer it from a filename.
CustomLanguageConfig
Define custom language splitting rules.
Usage:
from cocoindex.ops.text import CustomLanguageConfig, RecursiveSplitter
# Create custom language config for abstracts
abstract_config = CustomLanguageConfig(
language_name="abstract",
separators_regex=[
r"[.?!]+\s+", # Sentence boundaries
r"[:;]\s+", # Clause boundaries
r",\s+", # Comma boundaries
r"\s+", # Whitespace
]
)
splitter = RecursiveSplitter(custom_languages=[abstract_config])
chunks = splitter.split(
"This is a sample abstract. It has multiple sentences...",
chunk_size=500,
chunk_overlap=150,
language="abstract"
)
API reference
For detailed API documentation, refer to the module docstrings:
from cocoindex.ops import text
help(text.RecursiveSplitter)
help(text.SeparatorSplitter)
help(text.detect_code_language)
help(text.CustomLanguageConfig)