Text Processing Utilities
The cocoindex.ops.text module provides utilities for text processing, including:
- Code language detection
- Text chunking and splitting
- Syntax-aware code splitting
Available functions and classes
detect_code_language()
Detect the programming language from a filename.
Usage:
from cocoindex.ops.text import detect_code_language
language = detect_code_language(filename="main.py")
print(language) # "python"
language = detect_code_language(filename="app.rs")
print(language) # "rust"
language = detect_code_language(filename="unknown.xyz")
print(language) # None
SeparatorSplitter
Split text by regex separators.
Usage:
from cocoindex.ops.text import SeparatorSplitter
splitter = SeparatorSplitter()
text = "First sentence. Second sentence. Third sentence."
chunks = splitter.split(
text,
chunk_size=100,
chunk_overlap=20,
separators=[r"\.\s+"] # Split on periods followed by whitespace
)
for chunk in chunks:
print(chunk.text)
RecursiveSplitter
Advanced text chunking with language awareness and syntax-aware splitting for code.
Features:
- Supports many programming languages
- Preserves code structure
- Customizable chunk sizes and overlap
- Returns chunks with position information
Usage:
from cocoindex.ops.text import RecursiveSplitter
splitter = RecursiveSplitter()
# Split markdown text
text = "# Title\n\nParagraph 1.\n\nParagraph 2."
chunks = splitter.split(
text,
chunk_size=2000,
chunk_overlap=500,
language="markdown"
)
for chunk in chunks:
print(f"Chunk: {chunk.text}")
print(f"Start: line {chunk.start.line}, char {chunk.start.char_offset}")
print(f"End: line {chunk.end.line}, char {chunk.end.char_offset}")
Language-aware code splitting:
# Split Python code
python_code = '''
def hello():
print("Hello, world!")
def goodbye():
print("Goodbye!")
'''
chunks = splitter.split(
python_code,
chunk_size=1000,
min_chunk_size=300,
chunk_overlap=300,
language="python"
)
Supported languages:
- Python
- Rust
- JavaScript/TypeScript
- Markdown
- And many more...
CustomLanguageConfig
Define custom language splitting rules.
Usage:
from cocoindex.ops.text import CustomLanguageConfig, RecursiveSplitter
# Create custom language config for abstracts
abstract_config = CustomLanguageConfig(
language_name="abstract",
separators_regex=[
r"[.?!]+\s+", # Sentence boundaries
r"[:;]\s+", # Clause boundaries
r",\s+", # Comma boundaries
r"\s+", # Whitespace
]
)
splitter = RecursiveSplitter(custom_languages=[abstract_config])
chunks = splitter.split(
"This is a sample abstract. It has multiple sentences...",
chunk_size=500,
chunk_overlap=150,
language="abstract"
)
API reference
For detailed API documentation, refer to the module docstrings:
from cocoindex.ops import text
help(text.RecursiveSplitter)
help(text.SeparatorSplitter)
help(text.detect_code_language)
help(text.CustomLanguageConfig)