Data Types in CocoIndex
In CocoIndex, all data processed by the flow have a type determined when the flow is defined, before any actual data is processed at runtime.
This makes schema of data processed by CocoIndex clear, and easily determine the schema of your index.
Data Types
As an engine written in Rust, designed to be used in different languages and data are always serializable, CocoIndex defines a type system independent of any specific programming language.
CocoIndex automatically infers data types of the output created by CocoIndex sources and functions. You don't need to spell out any data type explicitly when you define the flow. All you need to do is to make sure the data passed to functions and targets are compatible with them.
Each type in CocoIndex type system is mapped to one or multiple types in Python. When you define a custom function, you need to annotate the data types of arguments and return values.
- For return values, type annotation is required. Because this provides the ground truth to define the type of the output of the custom function.
- For arguments, type annotation is only used to enable the conversion from data values already existing in CocoIndex engine to Python value. Type annotation is optional for basic types. When not specified, CocoIndex will use the default Python type for the argument. Type annotation is required for arguments of struct types and table types.
Basic Types
Primitive Types
Primitive types are basic types that are not composed of other types. This is the list of all primitive types supported by CocoIndex:
CocoIndex Type | Python Types | Convertible to | Explanation |
---|---|---|---|
Bytes | bytes | ||
Str | str | ||
Bool | bool | ||
Int64 | cocoindex.Int64 , int , numpy.int64 | ||
Float32 | cocoindex.Float32 , numpy.float32 | Float64 | |
Float64 | cocoindex.Float64 , float , numpy.float64 | ||
Range | cocoindex.Range | ||
Uuid | uuid.UUId | ||
Date | datetime.date | ||
Time | datetime.time | ||
LocalDatetime | cocoindex.LocalDateTime | OffsetDatetime | without timezone |
OffsetDatetime | cocoindex.OffsetDateTime , datetime.datetime | with timezone | |
TimeDelta | datetime.timedelta |
Notes:
-
For some CocoIndex types, we support multiple Python types. You can annotate with any of these Python types. The first one is the default Python type, which means CocoIndex will create a value with this type when you don't annotate the type in function arguments.
-
All Python types starting with
cocoindex.
are type aliases exported by CocoIndex. They're annotated types based on certain Python types:cocoindex.Int64
:int
cocoindex.Float64
:float
cocoindex.Float32
:float
cocoindex.Range
:tuple[int, int]
, i.e. a start offset (inclusive) and an end offset (exclusive)cocoindex.OffsetDateTime
:datetime.datetime
cocoindex.LocalDateTime
:datetime.datetime
These aliases provide a non-ambiguous way to represent a specific type in CocoIndex, given their base Python types can represent a superset of possible values.
-
When we say a CocoIndex type is convertible to another type, it means Python types for the second type can be also used to bind to a value of the first type. For example, Float32 is convertible to Float64, so you can bind a value of Float32 to a Python value of
float
ornp.float64
types. For LocalDatetime, when you usecocoindex.OffsetDateTime
ordatetime.datetime
as the annotation to bind its value, the timezone will be set to UTC.
Json Type
Json type can hold any data convertible to JSON by json
package.
In Python, it's represented by cocoindex.Json
.
It's useful to hold data without fixed schema known at flow definition time.
Vector Types
A vector type is a collection of elements of the same basic type. Optionally, it can have a fixed dimension. Noted as Vector[Type] or Vector[Type, Dim], e.g. Vector[Float32] or Vector[Float32, 384].
It supports the following Python types:
cocoindex.Vector[T]
orcocoindex.Vector[T, typing.Literal[Dim]]
, e.g.cocoindex.Vector[cocoindex.Float32]
orcocoindex.Vector[cocoindex.Float32, 384]
- The underlying Python type is
numpy.typing.NDArray[T]
whereT
is a numpy numeric type (numpy.int64
,numpy.float32
ornumpy.float64
), orlist[T]
otherwise
- The underlying Python type is
numpy.typing.NDArray[T]
whereT
is a numpy numeric typelist[T]
Union Types
A union type is a type that can represent values in one of multiple basic types. Noted as Type1 | Type2 | ..., e.g. Int64 | Float32 | Float64.
The Python type is T1 | T2 | ...
, e.g. cocoindex.Int64 | cocoindex.Float32 | cocoindex.Float64
, int | float
(equivalent to cocoindex.Int64 | cocoindex.Float64
)
Struct Types
A Struct has a bunch of fields, each with a name and a type.
In Python, a Struct type is represented by either a dataclass or a NamedTuple, with all fields annotated with a specific type. Both options define a structured type with named fields, but they differ slightly:
- Dataclass: A flexible class-based structure, mutable by default, defined using the
@dataclass
decorator. - NamedTuple: An immutable tuple-based structure, defined using
typing.NamedTuple
.
For example:
from dataclasses import dataclass
from typing import NamedTuple
import datetime
# Using dataclass
@dataclass
class Person:
first_name: str
last_name: str
dob: datetime.date
# Using NamedTuple
class PersonTuple(NamedTuple):
first_name: str
last_name: str
dob: datetime.date
Both Person
and PersonTuple
are valid Struct types in CocoIndex, with identical schemas (three fields: first_name
(Str), last_name
(Str), dob
(Date)).
Choose dataclass
for mutable objects or when you need additional methods, and NamedTuple
for immutable, lightweight structures.
Table Types
A Table type models a collection of rows, each with multiple columns. Each column of a table has a specific type.
We have two specific types of Table types: KTable and LTable.
KTable
KTable is a Table type whose first column serves as the key. The row order of a KTable is not preserved. Type of the first column (key column) must be a key type.
In Python, a KTable type is represented by dict[K, V]
.
The V
should be a Struct type, either a dataclass
or NamedTuple
, representing the value fields of each row.
For example, you can use dict[str, Person]
or dict[str, PersonTuple]
to represent a KTable, with 4 columns: key (Str), first_name
(Str), last_name
(Str), dob
(Date).
Note that if you want to use a Struct as the key, you need to ensure its value in Python is immutable. For dataclass
, annotate it with @dataclass(frozen=True)
. For NamedTuple
, immutability is built-in. For example:
For example:
@dataclass(frozen=True)
class PersonKey:
id_kind: str
id: str
class PersonKeyTuple(NamedTuple):
id_kind: str
id: str
Then you can use dict[PersonKey, Person]
or dict[PersonKeyTuple, PersonTuple]
to represent a KTable keyed by PersonKey
or PersonKeyTuple
.
LTable
LTable is a Table type whose row order is preserved. LTable has no key column.
In Python, a LTable type is represented by list[R]
, where R
is a dataclass representing a row.
For example, you can use list[Person]
to represent a LTable with 3 columns: first_name
(Str), last_name
(Str), dob
(Date).
Key Types
Currently, the following types are key types
- Bytes
- Str
- Bool
- Int64
- Range
- Uuid
- Date
- Struct with all fields being key types (using
@dataclass(frozen=True)
orNamedTuple
)