Data Types in CocoIndex
In CocoIndex, all data processed by the flow have a type determined when the flow is defined, before any actual data is processed at runtime.
This makes schema of data processed by CocoIndex clear, and easily determine the schema of your index.
Data Types
You don't need to spell out data types in CocoIndex, when you define the flow using existing operations (source, function, etc). These operations decide data types of fields produced by them based on the spec and input data types. All you need to do is to make sure the data passed to functions and storage targets are accepted by them.
When you define custom functions, you need to specify the data types of arguments and return values.
Basic Types
This is the list of all basic types supported by CocoIndex:
Type | Description | Specific Python Type | Native Python Type |
---|---|---|---|
Bytes | bytes | bytes | |
Str | str | str | |
Bool | bool | bool | |
Int64 | int | int | |
Float32 | cocoindex.Float32 | float | |
Float64 | cocoindex.Float64 | float | |
Range | cocoindex.Range | tuple[int, int] | |
Uuid | uuid.UUId | uuid.UUID | |
Date | datetime.date | datetime.date | |
Time | datetime.time | datetime.time | |
LocalDatetime | Date and time without timezone | cocoindex.LocalDateTime | datetime.datetime |
OffsetDatetime | Date and time with a timezone offset | cocoindex.OffsetDateTime | datetime.datetime |
TimeDelta | A duration of time | datetime.timedelta | datetime.timedelta |
Json | cocoindex.Json | Any data convertible to JSON by json package | |
Vector[T, Dim?] | T can be a basic type or a numeric type. Dim is a positive integer and optional. | cocoindex.Vector[T] or cocoindex.Vector[T, Dim] | numpy.typing.NDArray[T] or list[T] |
Values of all data types can be represented by values in Python's native types (as described under the Native Python Type column). However, the underlying execution engine and some storage system (like Postgres) has finer distinctions for some types, specifically:
- Float32 and Float64 for
float
, with different precision. - LocalDateTime and OffsetDateTime for
datetime.datetime
, with different timezone awareness. - Range and Json provide a clear tag for the type, to clearly distinguish the type in CocoIndex.
- Vector holds elements of type T. If T is numeric (e.g.,
np.float32
ornp.float64
), it's represented asNDArray[T]
; otherwise, aslist[T]
. - Vector also has optional dimension information.
The native Python type is always more permissive and can represent a superset of possible values.
- Only when you annotate the return type of a custom function, you should use the specific type, so that CocoIndex will have information about the precise type to be used in the execution engine and storage system.
- For all other purposes, e.g. to provide annotation for argument types of a custom function, or used internally in your custom function, you can choose whatever to use. The native Python type is usually simpler.
Struct Types
A Struct has a bunch of fields, each with a name and a type.
In Python, a Struct type is represented by either a dataclass or a NamedTuple, with all fields annotated with a specific type. Both options define a structured type with named fields, but they differ slightly:
- Dataclass: A flexible class-based structure, mutable by default, defined using the
@dataclass
decorator. - NamedTuple: An immutable tuple-based structure, defined using
typing.NamedTuple
.
For example:
from dataclasses import dataclass
from typing import NamedTuple
import datetime
# Using dataclass
@dataclass
class Person:
first_name: str
last_name: str
dob: datetime.date
# Using NamedTuple
class PersonTuple(NamedTuple):
first_name: str
last_name: str
dob: datetime.date
Both Person
and PersonTuple
are valid Struct types in CocoIndex, with identical schemas (three fields: first_name
(Str), last_name
(Str), dob
(Date)).
Choose dataclass
for mutable objects or when you need additional methods, and NamedTuple
for immutable, lightweight structures.
Table Types
A Table type models a collection of rows, each with multiple columns. Each column of a table has a specific type.
We have two specific types of Table types: KTable and LTable.
KTable
KTable is a Table type whose first column serves as the key. The row order of a KTable is not preserved. Type of the first column (key column) must be a key type.
In Python, a KTable type is represented by dict[K, V]
.
The V
should be a struct type, either a dataclass
or NamedTuple
, representing the value fields of each row.
For example, you can use dict[str, Person]
or dict[str, PersonTuple]
to represent a KTable, with 4 columns: key (Str), first_name
(Str), last_name
(Str), dob
(Date).
Note that if you want to use a struct as the key, you need to ensure the struct is immutable. For dataclass
, annotate it with @dataclass(frozen=True)
. For NamedTuple
, immutability is built-in.
For example:
@dataclass(frozen=True)
class PersonKey:
id_kind: str
id: str
class PersonKeyTuple(NamedTuple):
id_kind: str
id: str
Then you can use dict[PersonKey, Person]
or dict[PersonKeyTuple, PersonTuple]
to represent a KTable keyed by PersonKey
or PersonKeyTuple
.
LTable
LTable is a Table type whose row order is preserved. LTable has no key column.
In Python, a LTable type is represented by list[R]
, where R
is a dataclass representing a row.
For example, you can use list[Person]
to represent a LTable with 3 columns: first_name
(Str), last_name
(Str), dob
(Date).
Key Types
Currently, the following types are key types
- Bytes
- Str
- Bool
- Int64
- Range
- Uuid
- Date
- Struct with all fields being key types (using
@dataclass(frozen=True)
orNamedTuple
)