Common data types

Data types shared across connectors and built-in operations — FileLike, FilePath, FilePathMatcher, Chunk with text positions, and the Embedder protocol for single-text async embedding.

Version
v 1.0.0-alpha48
Last reviewed
Apr 19, 2026

The cocoindex.resources package provides common data models and abstractions shared across connectors and built-in operations. Connectors provide concrete implementations — for example, localfs.File implements FileLike, and localfs.FilePath extends FilePath. See individual connector docs for connector-specific details.

File

The file module (cocoindex.resources.file) defines base classes and utilities for working with file-like objects.

FileLike

FileLike is a base class for file objects with async read methods. Each connector provides its own subclass (e.g., localfs.File, amazon_s3.S3Object).

python
from cocoindex.resources.file import FileLike

async def process_file(file: FileLike) -> str:
    text = await file.read_text()
    ...
    return text

Properties:

  • file_path — A FilePath object representing the file’s path. Access the relative path via file_path.path (PurePath).

Methods:

  • async size() — Return the file size in bytes.
  • async read(size=-1) — Read file content as bytes. Pass size to limit bytes read.
  • async read_text(encoding=None, errors="replace") — Read as text. Auto-detects encoding via BOM if not specified.

Memoization:

FileLike objects provide a memoization key based on file_path (file identity). When used as arguments to a memoized function, CocoIndex uses a two-level validation: it checks the modification time first (cheap), then computes a content fingerprint only if the modification time has changed. This means touching a file or moving it won’t cause unnecessary recomputation if the content is unchanged.

FilePath

FilePath is a base class that combines a base directory (with a stable key) and a relative path. This enables stable memoization even when the entire directory tree is moved to a different location.

python
from cocoindex.resources.file import FilePath

Each connector provides its own FilePath subclass (e.g., localfs.FilePath). The base class defines the common interface.

Properties:

  • base_dir — An object that holds the base directory. Its key is used for stable memoization.
  • path — The path relative to the base directory (PurePath).

Methods:

  • resolve() — Resolve to the full path (type depends on the connector, e.g., pathlib.Path for local filesystem).

Path Operations:

FilePath supports most pathlib.PurePath operations:

python
# Join paths with /
config_path = source_dir / "config" / "settings.json"

# Access path properties
config_path.name      # "settings.json"
config_path.stem      # "settings"
config_path.suffix    # ".json"
config_path.parts     # ("config", "settings.json")
config_path.parent    # FilePath pointing to "config/"

# Modify path components
config_path.with_name("other.json")
config_path.with_suffix(".yaml")
config_path.with_stem("config")

# Pattern matching
config_path.match("*.json")  # True

# Convert to POSIX string
config_path.as_posix()  # "config/settings.json"

Memoization:

FilePath provides a memoization key based on (base_dir.key, path). This means:

  • Two FilePath objects with the same base directory key and relative path have the same memo key
  • Moving the entire project directory doesn’t invalidate memoization, as long as the same base directory key is used

For connector-specific usage (e.g., register_base_dir), see the individual connector documentation like Local File System.

FilePathMatcher

FilePathMatcher is a protocol for filtering files and directories during traversal.

python
from cocoindex.resources.file import FilePathMatcher

class MyMatcher(FilePathMatcher):
    def is_dir_included(self, path: PurePath) -> bool:
        """Return True to traverse this directory."""
        return not path.name.startswith(".")

    def is_file_included(self, path: PurePath) -> bool:
        """Return True to include this file."""
        return path.suffix in (".py", ".md")

PatternFilePathMatcher

A built-in FilePathMatcher implementation using globset patterns:

python
from cocoindex.resources.file import PatternFilePathMatcher

# Include only Python and Markdown files, exclude tests and hidden dirs
matcher = PatternFilePathMatcher(
    included_patterns=["**/*.py", "**/*.md"],
    excluded_patterns=["**/test_*", "**/.*"],
)

Parameters:

  • included_patterns — Glob patterns (globset syntax) for files to include. Use **/*.ext to match at any depth. If None, all files are included.
  • excluded_patterns — Glob patterns (globset syntax) for files/directories to exclude. Excluded directories are not traversed.
Note

Patterns use globset semantics: *.py matches only in the root directory; use **/*.py to match at any depth.

Chunk

The chunk module (cocoindex.resources.chunk) defines types for representing text chunks produced by text splitters.

Chunk

A Chunk is a frozen dataclass representing a piece of text with its position information in the original document.

python
from cocoindex.resources.chunk import Chunk

Fields:

  • text (str) — The text content of the chunk.
  • start (TextPosition) — Start position in the original text.
  • end (TextPosition) — End position in the original text.

TextPosition

A frozen dataclass representing a position in text.

Fields:

  • byte_offset (int) — Byte offset from the start of the text.
  • char_offset (int) — Character offset from the start of the text.
  • line (int) — 1-based line number.
  • column (int) — 1-based column number.

Example:

python
from cocoindex.ops.text import RecursiveSplitter

splitter = RecursiveSplitter()
chunks = splitter.split(text, chunk_size=2000, chunk_overlap=500, language="markdown")

for chunk in chunks:
    print(f"[{chunk.start.line}:{chunk.start.column}] {chunk.text[:50]}...")

Embedder

The embedder module (cocoindex.resources.embedder) defines a protocol for single-text async embedding.

Embedder Protocol

python
from cocoindex.resources.embedder import Embedder

class Embedder(Protocol):
    async def embed(self, text: str) -> NDArray[np.float32]: ...

This is the call-site contract that consumers like resolve_entities rely on. Both LiteLLMEmbedder and SentenceTransformerEmbedder satisfy this protocol — await embedder.embed("some text") returns a single NDArray[np.float32].

The protocol is deliberately narrow: it does not include dimension() or __coco_vector_schema__(), which are concerns of connectors and table-schema creation, not of embedding consumers.

python
# Any embedder works with resolve_entities:
from cocoindex.ops.entity_resolution import resolve_entities

result = await resolve_entities(
    entities={"Apple Inc.", "Apple"},
    embedder=my_embedder,  # LiteLLMEmbedder, SentenceTransformerEmbedder, or your own
    resolve_pair=my_resolver,
)
CocoIndex Docs Edit this page Report issue