Skip to main content

Resource Types

The cocoindex.resources package provides common data models and abstractions shared across connectors and built-in operation modules, ensuring a consistent interface for working with data.

File

The file module (cocoindex.resources.file) defines base classes and utilities for working with file-like objects.

FileLike

FileLike is a base class for file objects with async read methods.

from cocoindex.resources.file import FileLike

async def process_file(file: FileLike) -> str:
text = await file.read_text()
...
return text

Properties:

  • file_path — A FilePath object representing the file's path. Access the relative path via file_path.path (PurePath).

Methods:

  • async size() — Return the file size in bytes.
  • async read(size=-1) — Read file content as bytes. Pass size to limit bytes read.
  • async read_text(encoding=None, errors="replace") — Read as text. Auto-detects encoding via BOM if not specified.

Memoization:

FileLike objects provide a memoization key based on file_path (file identity). When used as arguments to a memoized function, CocoIndex uses a two-level validation: it checks the modification time first (cheap), then computes a content fingerprint only if the modification time has changed. This means touching a file or moving it won't cause unnecessary recomputation if the content is unchanged.

FilePath

FilePath is a base class that combines a base directory (with a stable key) and a relative path. This enables stable memoization even when the entire directory tree is moved to a different location.

from cocoindex.resources.file import FilePath

Each connector provides its own FilePath subclass (e.g., localfs.FilePath). The base class defines the common interface.

Properties:

  • base_dir — An object that holds the base directory. Its key is used for stable memoization.
  • path — The path relative to the base directory (PurePath).

Methods:

  • resolve() — Resolve to the full path (type depends on the connector, e.g., pathlib.Path for local filesystem).

Path Operations:

FilePath supports most pathlib.PurePath operations:

# Join paths with /
config_path = source_dir / "config" / "settings.json"

# Access path properties
config_path.name # "settings.json"
config_path.stem # "settings"
config_path.suffix # ".json"
config_path.parts # ("config", "settings.json")
config_path.parent # FilePath pointing to "config/"

# Modify path components
config_path.with_name("other.json")
config_path.with_suffix(".yaml")
config_path.with_stem("config")

# Pattern matching
config_path.match("*.json") # True

# Convert to POSIX string
config_path.as_posix() # "config/settings.json"

Memoization:

FilePath provides a memoization key based on (base_dir.key, path). This means:

  • Two FilePath objects with the same base directory key and relative path have the same memo key
  • Moving the entire project directory doesn't invalidate memoization, as long as the same base directory key is used

For connector-specific usage (e.g., register_base_dir), see the individual connector documentation like Local File System.

FilePathMatcher

FilePathMatcher is a protocol for filtering files and directories during traversal.

from cocoindex.resources.file import FilePathMatcher

class MyMatcher(FilePathMatcher):
def is_dir_included(self, path: PurePath) -> bool:
"""Return True to traverse this directory."""
return not path.name.startswith(".")

def is_file_included(self, path: PurePath) -> bool:
"""Return True to include this file."""
return path.suffix in (".py", ".md")

PatternFilePathMatcher

A built-in FilePathMatcher implementation using globset patterns:

from cocoindex.resources.file import PatternFilePathMatcher

# Include only Python and Markdown files, exclude tests and hidden dirs
matcher = PatternFilePathMatcher(
included_patterns=["**/*.py", "**/*.md"],
excluded_patterns=["**/test_*", "**/.*"],
)

Parameters:

  • included_patterns — Glob patterns (globset syntax) for files to include. Use **/*.ext to match at any depth. If None, all files are included.
  • excluded_patterns — Glob patterns (globset syntax) for files/directories to exclude. Excluded directories are not traversed.
note

Patterns use globset semantics: *.py matches only in the root directory; use **/*.py to match at any depth.

Vector Schema

The schema module (cocoindex.resources.schema) defines types that describe vector columns. CocoIndex connectors use these to automatically configure the correct column type (e.g., vector(384) in Postgres, fixed_size_list<float32>(384) in LanceDB).

VectorSchema

A frozen dataclass that describes a vector column's dtype and dimension.

from cocoindex.resources.schema import VectorSchema
import numpy as np

schema = VectorSchema(dtype=np.dtype(np.float32), size=768)

Fields:

  • dtype — NumPy dtype of each element (e.g., np.float32)
  • size — Number of dimensions in the vector (e.g., 384)

You can construct VectorSchema directly when using a custom embedding model that doesn't implement VectorSchemaProvider:

from cocoindex.resources.schema import VectorSchema

# For a custom CLIP model with known dimension
schema = VectorSchema(dtype=np.dtype(np.float32), size=768)

# Use it in a Qdrant vector definition
QDRANT_DB = coco.ContextKey[QdrantClient]("my_qdrant_db", tracked=False)
target_collection = await qdrant.mount_collection_target(
QDRANT_DB,
collection_name="image_search",
schema=await qdrant.CollectionSchema.create(
vectors=qdrant.QdrantVectorDef(schema=schema, distance="cosine")
),
)

VectorSchemaProvider

A protocol for objects that can provide vector schema information. The primary use case is as metadata in Annotated type annotations — connectors extract vector column configuration from the annotation automatically.

Any object that implements the __coco_vector_schema__() method satisfies this protocol. The built-in SentenceTransformerEmbedder implements it.

There are three ways to specify vector schema in annotations:

Define a ContextKey for the embedder and use it as the annotation. The connector resolves the key at schema creation time. This is the recommended approach because the embedder is configured once in the lifespan and shared across all functions via context.

import cocoindex as coco
from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder

EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
EMBEDDER = coco.ContextKey[SentenceTransformerEmbedder]("embedder")

@dataclass
class DocEmbedding:
id: int
text: str
embedding: Annotated[NDArray, EMBEDDER] # dimension resolved from context

# In lifespan, provide the embedder:
@coco.lifespan
async def coco_lifespan(builder: coco.EnvironmentBuilder) -> AsyncIterator[None]:
builder.provide(EMBEDDER, SentenceTransformerEmbedder(EMBED_MODEL))
yield

# In coco functions, access the embedder:
embedding = await coco.use_context(EMBEDDER).embed(text)

Using a VectorSchemaProvider instance

Pass an embedder instance directly as the annotation. Simpler for scripts where the embedder is a module-level constant.

from cocoindex.ops.sentence_transformers import SentenceTransformerEmbedder

embedder = SentenceTransformerEmbedder("sentence-transformers/all-MiniLM-L6-v2")

@dataclass
class DocEmbedding:
id: int
text: str
embedding: Annotated[NDArray, embedder] # dimension inferred from model (384)

Using a VectorSchema

Specify dimension and dtype explicitly. Useful when using a custom embedding model that doesn't implement VectorSchemaProvider.

from cocoindex.resources.schema import VectorSchema

@dataclass
class ImageEmbedding:
id: int
embedding: Annotated[NDArray, VectorSchema(dtype=np.dtype(np.float32), size=768)]

When a connector's TableSchema.from_class() encounters an Annotated[NDArray, annotation] field, it resolves the annotation — unwrapping ContextKey if needed — and calls __coco_vector_schema__() to determine the column's dimension and dtype.

MultiVectorSchema / MultiVectorSchemaProvider

Analogous types for multi-vector columns (e.g., ColBERT-style token-level embeddings). MultiVectorSchema wraps a VectorSchema describing the individual vectors. Used by connectors like Qdrant that support multi-vector storage.

from cocoindex.resources.schema import MultiVectorSchema, VectorSchema

multi_schema = MultiVectorSchema(
vector_schema=VectorSchema(dtype=np.dtype(np.float32), size=128)
)

ID Generation

The ID module (cocoindex.resources.id) provides utilities for generating stable unique IDs and UUIDs that persist across incremental updates.

Choosing the Right API

APISame dep produces...Use when...
generate_id(dep)Same ID every timeEach unique input maps to exactly one ID
IdGenerator.next_id(dep)Distinct ID each callYou need multiple IDs for potentially non-distinct inputs

The same distinction applies to generate_uuid vs UuidGenerator.

generate_id / generate_uuid

Async functions that return the same ID/UUID for the same dep value. These are idempotent: calling multiple times with identical dep yields identical results.

from cocoindex.resources.id import generate_id, generate_uuid

async def process_item(item: Item) -> Row:
# Same item.key always gets the same ID
item_id = await generate_id(item.key)
return Row(id=item_id, data=item.data)

async def process_document(doc: Document) -> Row:
# Same doc.path always gets the same UUID
doc_uuid = await generate_uuid(doc.path)
return Row(id=doc_uuid, content=doc.content)

Parameters:

  • dep — Dependency value that determines the ID/UUID. The same dep always produces the same result within a component. Defaults to None.

Returns:

  • generate_id returns an int (IDs start from 1; 0 is reserved)
  • generate_uuid returns a uuid.UUID

IdGenerator / UuidGenerator

Classes that return a distinct ID/UUID on each call, even when called with the same dep value. The sequence is stable across runs.

Use these when you need multiple IDs for potentially non-distinct inputs, such as splitting text into chunks where chunks may have identical content but still need unique IDs.

from cocoindex.resources.id import IdGenerator, UuidGenerator

async def process_document(doc: Document) -> list[Row]:
# Use doc.path to distinguish generators within the same processing component
id_gen = IdGenerator(deps=doc.path)
rows = []
for chunk in split_into_chunks(doc.content):
# Each call returns a distinct ID, even if chunks are identical
chunk_id = await id_gen.next_id(chunk.content)
rows.append(Row(id=chunk_id, content=chunk.content))
return rows

async def process_with_uuids(doc: Document) -> list[Row]:
# Use doc.path to distinguish generators within the same processing component
uuid_gen = UuidGenerator(deps=doc.path)
rows = []
for chunk in split_into_chunks(doc.content):
# Each call returns a distinct UUID, even if chunks are identical
chunk_uuid = await uuid_gen.next_uuid(chunk.content)
rows.append(Row(id=chunk_uuid, content=chunk.content))
return rows

Constructor:

  • IdGenerator(deps=None) / UuidGenerator(deps=None) — Create a generator. The deps parameter distinguishes generators within the same processing component. Use distinct deps values for different generator instances.

Methods:

  • async IdGenerator.next_id(dep=None) — Generate the next unique integer ID (distinct on each call)
  • async UuidGenerator.next_uuid(dep=None) — Generate the next unique UUID (distinct on each call)