Common data types
Data types shared across connectors and built-in operations — FileLike, FilePath, FilePathMatcher, Chunk with text positions, and the Embedder protocol for single-text async embedding.
The cocoindex.resources package provides common data models and abstractions shared across connectors and built-in operations. Connectors provide concrete implementations — for example, localfs.File implements FileLike, and localfs.FilePath extends FilePath. See individual connector docs for connector-specific details.
File
The file module (cocoindex.resources.file) defines base classes and utilities for working with file-like objects.
FileLike
FileLike is a base class for file objects with async read methods. Each connector provides its own subclass (e.g., localfs.File, amazon_s3.S3Object).
from cocoindex.resources.file import FileLike
async def process_file(file: FileLike) -> str:
text = await file.read_text()
...
return text
Properties:
file_path— AFilePathobject representing the file’s path. Access the relative path viafile_path.path(PurePath).
Methods:
async size()— Return the file size in bytes.async read(size=-1)— Read file content as bytes. Passsizeto limit bytes read.async read_text(encoding=None, errors="replace")— Read as text. Auto-detects encoding via BOM if not specified.
Memoization:
FileLike objects provide a memoization key based on file_path (file identity). When used as arguments to a memoized function, CocoIndex uses a two-level validation: it checks the modification time first (cheap), then computes a content fingerprint only if the modification time has changed. This means touching a file or moving it won’t cause unnecessary recomputation if the content is unchanged.
FilePath
FilePath is a base class that combines a base directory (with a stable key) and a relative path. This enables stable memoization even when the entire directory tree is moved to a different location.
from cocoindex.resources.file import FilePath
Each connector provides its own FilePath subclass (e.g., localfs.FilePath). The base class defines the common interface.
Properties:
base_dir— An object that holds the base directory. Its key is used for stable memoization.path— The path relative to the base directory (PurePath).
Methods:
resolve()— Resolve to the full path (type depends on the connector, e.g.,pathlib.Pathfor local filesystem).
Path Operations:
FilePath supports most pathlib.PurePath operations:
# Join paths with /
config_path = source_dir / "config" / "settings.json"
# Access path properties
config_path.name # "settings.json"
config_path.stem # "settings"
config_path.suffix # ".json"
config_path.parts # ("config", "settings.json")
config_path.parent # FilePath pointing to "config/"
# Modify path components
config_path.with_name("other.json")
config_path.with_suffix(".yaml")
config_path.with_stem("config")
# Pattern matching
config_path.match("*.json") # True
# Convert to POSIX string
config_path.as_posix() # "config/settings.json"
Memoization:
FilePath provides a memoization key based on (base_dir.key, path). This means:
- Two
FilePathobjects with the same base directory key and relative path have the same memo key - Moving the entire project directory doesn’t invalidate memoization, as long as the same base directory key is used
For connector-specific usage (e.g., register_base_dir), see the individual connector documentation like Local File System.
FilePathMatcher
FilePathMatcher is a protocol for filtering files and directories during traversal.
from cocoindex.resources.file import FilePathMatcher
class MyMatcher(FilePathMatcher):
def is_dir_included(self, path: PurePath) -> bool:
"""Return True to traverse this directory."""
return not path.name.startswith(".")
def is_file_included(self, path: PurePath) -> bool:
"""Return True to include this file."""
return path.suffix in (".py", ".md")
PatternFilePathMatcher
A built-in FilePathMatcher implementation using globset patterns:
from cocoindex.resources.file import PatternFilePathMatcher
# Include only Python and Markdown files, exclude tests and hidden dirs
matcher = PatternFilePathMatcher(
included_patterns=["**/*.py", "**/*.md"],
excluded_patterns=["**/test_*", "**/.*"],
)
Parameters:
included_patterns— Glob patterns (globset syntax) for files to include. Use**/*.extto match at any depth. IfNone, all files are included.excluded_patterns— Glob patterns (globset syntax) for files/directories to exclude. Excluded directories are not traversed.
Patterns use globset semantics: *.py matches only in the root directory; use **/*.py to match at any depth.
Chunk
The chunk module (cocoindex.resources.chunk) defines types for representing text chunks produced by text splitters.
Chunk
A Chunk is a frozen dataclass representing a piece of text with its position information in the original document.
from cocoindex.resources.chunk import Chunk
Fields:
text(str) — The text content of the chunk.start(TextPosition) — Start position in the original text.end(TextPosition) — End position in the original text.
TextPosition
A frozen dataclass representing a position in text.
Fields:
byte_offset(int) — Byte offset from the start of the text.char_offset(int) — Character offset from the start of the text.line(int) — 1-based line number.column(int) — 1-based column number.
Example:
from cocoindex.ops.text import RecursiveSplitter
splitter = RecursiveSplitter()
chunks = splitter.split(text, chunk_size=2000, chunk_overlap=500, language="markdown")
for chunk in chunks:
print(f"[{chunk.start.line}:{chunk.start.column}] {chunk.text[:50]}...")
Embedder
The embedder module (cocoindex.resources.embedder) defines a protocol for single-text async embedding.
Embedder Protocol
from cocoindex.resources.embedder import Embedder
class Embedder(Protocol):
async def embed(self, text: str) -> NDArray[np.float32]: ...
This is the call-site contract that consumers like resolve_entities rely on. Both LiteLLMEmbedder and SentenceTransformerEmbedder satisfy this protocol — await embedder.embed("some text") returns a single NDArray[np.float32].
The protocol is deliberately narrow: it does not include dimension() or __coco_vector_schema__(), which are concerns of connectors and table-schema creation, not of embedding consumers.
# Any embedder works with resolve_entities:
from cocoindex.ops.entity_resolution import resolve_entities
result = await resolve_entities(
entities={"Apple Inc.", "Apple"},
embedder=my_embedder, # LiteLLMEmbedder, SentenceTransformerEmbedder, or your own
resolve_pair=my_resolver,
)