Data Types in CocoIndex
In CocoIndex, all data processed by the flow have a type determined when the flow is defined, before any actual data is processed at runtime.
This makes schema of data processed by CocoIndex clear, and easily determine the schema of your index.
Data Types
As an engine written in Rust, designed to be used in different languages and data are always serializable, CocoIndex defines a type system independent of any specific programming language.
CocoIndex automatically infers data types of the output created by CocoIndex sources and functions. You don't need to spell out any data type explicitly when you define the flow. All you need to do is to make sure the data passed to functions and targets are compatible with them.
Each type in CocoIndex type system is mapped to one or multiple types in Python. When you define a custom function, you need to annotate the data types of arguments and return values.
-
When you pass a Python value to the engine (e.g. return values of a custom function), a specific type annotation is required. The type annotation needs to be specific in describing the target data type, as it provides the ground truth of the data type in the flow.
This is critical because CocoIndex uses return type annotations to infer data types throughout the flow without processing any actual data. This enables:
- Creating proper target schemas (e.g., vector indexes with fixed dimensions)
- Type checking during flow definition
- Clear documentation of data transformations
-
When you use a Python variable to bind to an engine value (e.g. arguments of a custom function), the engine already knows the specific data type, so we don't require a specific type annotation. Type annotations can be omitted, or you can use
Anyat any level. When a specific type annotation is provided, it's still used as a guidance to construct the Python value with compatible type. Otherwise, we will bind to a default Python type.
Basic Types
Primitive Types
Primitive types are basic types that are not composed of other types. This is the list of all primitive types supported by CocoIndex:
| CocoIndex Type | Python Types | Convertible to | Explanation |
|---|---|---|---|
| Bytes | bytes | ||
| Str | str | ||
| Bool | bool | ||
| Int64 | cocoindex.Int64, int, numpy.int64 | ||
| Float32 | cocoindex.Float32, numpy.float32 | Float64 | |
| Float64 | cocoindex.Float64, float, numpy.float64 | ||
| Range | cocoindex.Range | ||
| Uuid | uuid.UUId | ||
| Date | datetime.date | ||
| Time | datetime.time | ||
| LocalDatetime | cocoindex.LocalDateTime | OffsetDatetime | without timezone |
| OffsetDatetime | cocoindex.OffsetDateTime, datetime.datetime | with timezone | |
| TimeDelta | datetime.timedelta |
Notes:
-
For some CocoIndex types, we support multiple Python types. You can annotate with any of these Python types. The first one is the default type, i.e. CocoIndex will create a value with this type when a specific type annotation is not provided (e.g. for arguments of a custom function).
-
All Python types starting with
cocoindex.are type aliases exported by CocoIndex. They're annotated types based on certain Python types:cocoindex.Int64:intcocoindex.Float64:floatcocoindex.Float32:floatcocoindex.Range:tuple[int, int], i.e. a start offset (inclusive) and an end offset (exclusive)cocoindex.OffsetDateTime:datetime.datetimecocoindex.LocalDateTime:datetime.datetime
These aliases provide a non-ambiguous way to represent a specific type in CocoIndex, given their base Python types can represent a superset of possible values.
-
When we say a CocoIndex type is convertible to another type, it means Python types for the second type can be also used to bind to a value of the first type.
- For example, Float32 is convertible to Float64, so you can bind a value of Float32 to a Python value of
floatornp.float64types. - For LocalDatetime, when you use
cocoindex.OffsetDateTimeordatetime.datetimeas the annotation to bind its value, the timezone will be set to UTC.
- For example, Float32 is convertible to Float64, so you can bind a value of Float32 to a Python value of
Json Type
Json type can hold any data convertible to JSON by json package.
In Python, it's represented by cocoindex.Json.
It's useful to hold data without fixed schema known at flow definition time.
Vector Types
A vector type is a collection of elements of the same basic type. Optionally, it can have a fixed dimension. Noted as Vector[Type] or Vector[Type, Dim], e.g. Vector[Float32] or Vector[Float32, 384].
When to specify vector dimension:
Specify the dimension in return type annotations if you plan to export the vector to a target, as most targets require a fixed vector dimension for creating vector indexes. For example, use cocoindex.Vector[cocoindex.Float32, typing.Literal[768]] for 768-dimensional embeddings.
It supports the following Python types:
cocoindex.Vector[T]orcocoindex.Vector[T, typing.Literal[Dim]], e.g.cocoindex.Vector[cocoindex.Float32]orcocoindex.Vector[cocoindex.Float32, typing.Literal[384]]- The underlying Python type is
numpy.typing.NDArray[T]whereTis a numpy numeric type (numpy.int64,numpy.float32ornumpy.float64) or array type (numpy.typing.NDArray[T]), orlist[T]otherwise
- The underlying Python type is
numpy.typing.NDArray[T]whereTis a numpy numeric type or array typelist[T]
Example:
from typing import Literal
import cocoindex
# ✅ Good: Specify dimension for vectors that will be exported to targets
@cocoindex.op.function(behavior_version=1)
def embed_text(text: str) -> cocoindex.Vector[cocoindex.Float32, Literal[768]]:
"""Generate 768-dimensional embedding."""
# ... embedding logic ...
return embedding # numpy array or list of 768 floats
# ⚠️ Works but less precise: Vector without dimension
@cocoindex.op.function(behavior_version=1)
def embed_text_no_dim(text: str) -> list[float]:
"""Generate embedding without dimension specification."""
return embedding
Union Types
A union type is a type that can represent values in one of multiple basic types. Noted as Type1 | Type2 | ..., e.g. Int64 | Float32 | Float64.
The Python type is T1 | T2 | ..., e.g. cocoindex.Int64 | cocoindex.Float32 | cocoindex.Float64, int | float (equivalent to cocoindex.Int64 | cocoindex.Float64)
Struct Types
A Struct has a bunch of fields, each with a name and a type.
In Python, a Struct type is represented by either a dataclass, a NamedTuple, or a Pydantic model, with all fields annotated with a specific type. These options define a structured type with named fields, but they differ slightly:
- Dataclass: A flexible class-based structure, mutable by default, defined using the
@dataclassdecorator. - NamedTuple: An immutable tuple-based structure, defined using
typing.NamedTuple. - Pydantic model: A modern data validation and parsing structure, defined by inheriting from
pydantic.BaseModel. Make sure you installed thepydanticpackage when using Pydantic model.
For example:
from dataclasses import dataclass
from typing import NamedTuple
from pydantic import BaseModel # requires `pydantic` package to be installed
import datetime
# Using dataclass
@dataclass
class Person:
first_name: str
last_name: str
dob: datetime.date
# Using NamedTuple
class PersonTuple(NamedTuple):
first_name: str
last_name: str
dob: datetime.date
# Using Pydantic
class PersonModel(BaseModel):
first_name: str
last_name: str
dob: datetime.date
All three examples (Person, PersonTuple, and PersonModel) are valid Struct types in CocoIndex, with identical schemas (three fields: first_name (Str), last_name (Str), dob (Date)).
Choose dataclass for mutable objects, NamedTuple for immutable lightweight structures, or Pydantic for data validation and serialization features.
Type annotations for Struct:
- For return values: Must use a specific Struct type (dataclass, NamedTuple, or Pydantic model)
- For arguments: Can use
dict[str, Any]orAnyinstead of a specific Struct type.dict[str, Any]is the default binding if you don't annotate the function argument with a specific type.
Example:
from dataclasses import dataclass
from typing import Any
import datetime
@dataclass
class Person:
first_name: str
last_name: str
dob: datetime.date
# ✅ Good: Specific return type, relaxed argument type
@cocoindex.op.function(behavior_version=1)
def process_person(person_data: dict[str, Any]) -> Person:
"""Argument can use dict[str, Any], return must be specific Struct."""
return Person(
first_name=person_data["first_name"],
last_name=person_data["last_name"],
dob=person_data["dob"]
)
# ❌ Wrong: Return type is not a valid specific CocoIndex type
# @cocoindex.op.function(behavior_version=1)
# def bad_example(person: Person) -> dict[str, str]:
# return {"name": person.first_name} # dict[str, str] is not a CocoIndex type
Table Types
A Table type models a collection of rows, each with multiple columns. Each column of a table has a specific type.
We have two specific types of Table types: KTable and LTable.
KTable
KTable is a Table type whose one or more columns together serve as the key. The row order of a KTable is not preserved. Each key column must be a key type. When multiple key columns are present, they form a composite key.
In Python, a KTable type is represented by dict[K, V].
K represents the key and V represents the value for each row:
K(key type) can be:- A primitive key type (e.g.,
str,int) for single-part keys - An immutable Struct type (frozen dataclass or
NamedTuple) for multi-part composite keys
- A primitive key type (e.g.,
V(value type) must be a Struct type representing the non-key value fields of each row- For return values: Must use a specific Struct type (dataclass, NamedTuple, or Pydantic model)
- For arguments: Can use
dict[str, Any]orAny
When a specific type annotation is not provided:
- For composite keys (multiple key parts), the key binds to a Python tuple of the key parts, e.g.
tuple[str, str]. - For a single basic key part, the key binds to that basic Python type.
- The value binds to
dict[str, Any].
For example, you can use dict[str, Person], dict[str, PersonTuple], or dict[str, PersonModel] to represent a KTable, with 4 columns: key (Str), first_name (Str), last_name (Str), dob (Date).
It's bound to dict[str, dict[str, Any]] if you don't annotate the function argument with a specific type.
Note that when using a Struct as the key, it must be immutable in Python. For a dataclass, annotate it with @dataclass(frozen=True). For NamedTuple, immutability is built-in. For Pydantic models, use frozen=True in the model configuration. For example:
@dataclass(frozen=True)
class PersonKey:
id_kind: str
id: str
class PersonKeyTuple(NamedTuple):
id_kind: str
id: str
# Pydantic frozen model (if available)
try:
from pydantic import BaseModel
class PersonKeyModel(BaseModel):
model_config = {"frozen": True}
id_kind: str
id: str
except ImportError:
pass
Then you can use dict[PersonKey, Person], dict[PersonKeyTuple, PersonTuple], or dict[PersonKeyModel, PersonModel] to represent a KTable keyed by both id_kind and id.
If you don't annotate the function argument with a specific type, it's bound to dict[tuple[str, str], dict[str, Any]].
LTable
LTable is a Table type whose row order is preserved. LTable has no key column.
In Python, a LTable type is represented by list[R], where R must be a Struct type representing the value fields of each row:
- For return values: Must use a specific Struct type (e.g.,
list[Person],list[PersonTuple], orlist[PersonModel]) - For arguments: Can use
list[dict[str, Any]]orlist[Any]. Defaults tolist[dict[str, Any]]if you don't annotate the function argument.
For example, list[Person] represents a LTable with 3 columns: first_name (Str), last_name (Str), dob (Date).
Example:
from dataclasses import dataclass
from typing import Any
import datetime
@dataclass
class Person:
first_name: str
last_name: str
dob: datetime.date
# ✅ Good: Return type specifies list of specific Struct
@cocoindex.op.function(behavior_version=1)
def filter_adults(people: list[Any]) -> list[Person]:
"""Filter people - argument relaxed, return type specific."""
return [p for p in people if p["age"] >= 18]
Key Types
Currently, the following types are key types
- Bytes
- Str
- Bool
- Int64
- Range
- Uuid
- Date
- Struct with all fields being key types (using
@dataclass(frozen=True)orNamedTuple)
None Values
CocoIndex supports None values. A None value represents the absence of data or an unknown value, distinct from empty strings, zero numbers, or false boolean values.
Optional Type
For any data (e.g. a field of a Struct, an argument or return value of a CocoIndex function), if it is optional, it means its value can be None. We use Optional[T] to indicate an optional type, e.g. Optional[Str], Optional[Person].
In Python, None is represented as None, so an optional type can be represented by T | None or typing.Optional[T].
None propagating on CocoIndex functions
A function may specify whether each input argument is optional or not. Non-optional argument means the function needs a known value for the argument to work. However, it doesn't forbid the argument to be None at runtime. When a non-optional argument receives a None value, the function execution is skipped and the result is None.
For example, for SplitRecursively function, the text and chunk_size arguments are not optional. If the input value of either of them is None, the function will return None.