Resource Types
The cocoindex.resources package provides common data models and abstractions shared across connectors and utility modules, ensuring a consistent interface for working with data.
File
The file module (cocoindex.resources.file) defines protocols and utilities for working with file-like objects.
FileLike / AsyncFileLike
FileLike is a protocol for file objects with synchronous read access. AsyncFileLike is its async counterpart with the same properties but async read methods.
from cocoindex.resources.file import FileLike
def process_file(file: FileLike) -> str:
text = file.read_text()
...
return text
from cocoindex.resources.file import AsyncFileLike
async def process_file_async(file: AsyncFileLike) -> str:
text = await file.read_text()
...
return text
Properties:
file_path— AFilePathobject representing the file's path. Access the relative path viafile_path.path(PurePath).size— File size in bytesmodified_time— File modification time (datetime)
Methods:
read(size=-1)— Read file content as bytes. Passsizeto limit bytes read.read_text(encoding=None, errors="replace")— Read as text. Auto-detects encoding via BOM if not specified.
Memoization:
FileLike objects provide a memoization key based on file_path and modified_time. When used as arguments to a memoized function, CocoIndex can detect when a file has changed and skip recomputation for unchanged files.
FilePath
FilePath is a base class that combines a base directory (with a stable key) and a relative path. This enables stable memoization even when the entire directory tree is moved to a different location.
from cocoindex.resources.file import FilePath
Each connector provides its own FilePath subclass (e.g., localfs.FilePath). The base class defines the common interface.
Properties:
base_dir— AKeyedConnectionobject that holds the base directory. Thebase_dir.keyis used for stable memoization.path— The path relative to the base directory (PurePath).
Methods:
resolve()— Resolve to the full path (type depends on the connector, e.g.,pathlib.Pathfor local filesystem).
Path Operations:
FilePath supports most pathlib.PurePath operations:
# Join paths with /
config_path = source_dir / "config" / "settings.json"
# Access path properties
config_path.name # "settings.json"
config_path.stem # "settings"
config_path.suffix # ".json"
config_path.parts # ("config", "settings.json")
config_path.parent # FilePath pointing to "config/"
# Modify path components
config_path.with_name("other.json")
config_path.with_suffix(".yaml")
config_path.with_stem("config")
# Pattern matching
config_path.match("*.json") # True
# Convert to POSIX string
config_path.as_posix() # "config/settings.json"
Memoization:
FilePath provides a memoization key based on (base_dir.key, path). This means:
- Two
FilePathobjects with the same base directory key and relative path have the same memo key - Moving the entire project directory doesn't invalidate memoization, as long as you re-register with the same key
For connector-specific usage (e.g., register_base_dir), see the individual connector documentation like Local File System.
FilePathMatcher
FilePathMatcher is a protocol for filtering files and directories during traversal.
from cocoindex.resources.file import FilePathMatcher
class MyMatcher(FilePathMatcher):
def is_dir_included(self, path: PurePath) -> bool:
"""Return True to traverse this directory."""
return not path.name.startswith(".")
def is_file_included(self, path: PurePath) -> bool:
"""Return True to include this file."""
return path.suffix in (".py", ".md")
PatternFilePathMatcher
A built-in FilePathMatcher implementation using glob patterns:
from cocoindex.resources.file import PatternFilePathMatcher
# Include only Python and Markdown files, exclude tests and hidden dirs
matcher = PatternFilePathMatcher(
included_patterns=["*.py", "*.md"],
excluded_patterns=["**/test_*", "**/.*"],
)
Parameters:
included_patterns— Glob patterns for files to include. IfNone, all files are included.excluded_patterns— Glob patterns for files/directories to exclude. Excluded directories are not traversed.
ID Generation
The ID module (cocoindex.resources.id) provides utilities for generating stable unique IDs and UUIDs that persist across incremental updates.
Choosing the Right API
| API | Same dep produces... | Use when... |
|---|---|---|
generate_id(dep) | Same ID every time | Each unique input maps to exactly one ID |
IdGenerator.next_id(dep) | Distinct ID each call | You need multiple IDs for potentially non-distinct inputs |
The same distinction applies to generate_uuid vs UuidGenerator.
generate_id / generate_uuid
Functions that return the same ID/UUID for the same dep value. These are idempotent: calling multiple times with identical dep yields identical results.
from cocoindex.resources.id import generate_id, generate_uuid
def process_item(item: Item) -> Row:
# Same item.key always gets the same ID
item_id = generate_id(item.key)
return Row(id=item_id, data=item.data)
def process_document(doc: Document) -> Row:
# Same doc.path always gets the same UUID
doc_uuid = generate_uuid(doc.path)
return Row(id=doc_uuid, content=doc.content)
Parameters:
dep— Dependency value that determines the ID/UUID. The samedepalways produces the same result within a component. Defaults toNone.
Returns:
generate_idreturns anint(IDs start from 1; 0 is reserved)generate_uuidreturns auuid.UUID
IdGenerator / UuidGenerator
Classes that return a distinct ID/UUID on each call, even when called with the same dep value. The sequence is stable across runs.
Use these when you need multiple IDs for potentially non-distinct inputs, such as splitting text into chunks where chunks may have identical content but still need unique IDs.
from cocoindex.resources.id import IdGenerator, UuidGenerator
def process_document(doc: Document) -> list[Row]:
id_gen = IdGenerator()
rows = []
for chunk in split_into_chunks(doc.content):
# Each call returns a distinct ID, even if chunks are identical
chunk_id = id_gen.next_id(chunk.content)
rows.append(Row(id=chunk_id, content=chunk.content))
return rows
def process_with_uuids(doc: Document) -> list[Row]:
uuid_gen = UuidGenerator()
rows = []
for chunk in split_into_chunks(doc.content):
# Each call returns a distinct UUID, even if chunks are identical
chunk_uuid = uuid_gen.next_uuid(chunk.content)
rows.append(Row(id=chunk_uuid, content=chunk.content))
return rows
Methods:
IdGenerator.next_id(dep=None)— Generate the next unique integer ID (distinct on each call)UuidGenerator.next_uuid(dep=None)— Generate the next unique UUID (distinct on each call)