Skip to main content

Data Types in CocoIndex

In CocoIndex, all data processed by the flow have a type determined when the flow is defined, before any actual data is processed at runtime.

This makes schema of data processed by CocoIndex clear, and easily determine the schema of your index.

Data Types

You don't need to spell out data types in CocoIndex, when you define the flow using existing operations (source, function, etc). These operations decide data types of fields produced by them based on the spec and input data types. All you need to do is to make sure the data passed to functions and storage targets are accepted by them.

When you define custom functions, you need to specify the data types of arguments and return values.

Basic Types

This is the list of all basic types supported by CocoIndex:

TypeDescriptionSpecific Python TypeNative Python Type
Bytesbytesbytes
Strstrstr
Boolboolbool
Int64intint
Float32cocoindex.Float32float
Float64cocoindex.Float64float
Rangecocoindex.Rangetuple[int, int]
Uuiduuid.UUIduuid.UUID
Datedatetime.datedatetime.date
Timedatetime.timedatetime.time
LocalDatetimeDate and time without timezonecocoindex.LocalDateTimedatetime.datetime
OffsetDatetimeDate and time with a timezone offsetcocoindex.OffsetDateTimedatetime.datetime
TimeDeltaA duration of timedatetime.timedeltadatetime.timedelta
Jsoncocoindex.JsonAny data convertible to JSON by json package
Vector[T, Dim?]T can be a basic type or a numeric type. Dim is a positive integer and optional.cocoindex.Vector[T] or cocoindex.Vector[T, Dim]numpy.typing.NDArray[T] or list[T]

Values of all data types can be represented by values in Python's native types (as described under the Native Python Type column). However, the underlying execution engine and some storage system (like Postgres) has finer distinctions for some types, specifically:

  • Float32 and Float64 for float, with different precision.
  • LocalDateTime and OffsetDateTime for datetime.datetime, with different timezone awareness.
  • Range and Json provide a clear tag for the type, to clearly distinguish the type in CocoIndex.
  • Vector holds elements of type T. If T is numeric (e.g., np.float32 or np.float64), it's represented as NDArray[T]; otherwise, as list[T].
  • Vector also has optional dimension information.

The native Python type is always more permissive and can represent a superset of possible values.

  • Only when you annotate the return type of a custom function, you should use the specific type, so that CocoIndex will have information about the precise type to be used in the execution engine and storage system.
  • For all other purposes, e.g. to provide annotation for argument types of a custom function, or used internally in your custom function, you can choose whatever to use. The native Python type is usually simpler.

Struct Types

A Struct has a bunch of fields, each with a name and a type.

In Python, a Struct type is represented by either a dataclass or a NamedTuple, with all fields annotated with a specific type. Both options define a structured type with named fields, but they differ slightly:

  • Dataclass: A flexible class-based structure, mutable by default, defined using the @dataclass decorator.
  • NamedTuple: An immutable tuple-based structure, defined using typing.NamedTuple.

For example:

from dataclasses import dataclass
from typing import NamedTuple
import datetime

# Using dataclass
@dataclass
class Person:
first_name: str
last_name: str
dob: datetime.date

# Using NamedTuple
class PersonTuple(NamedTuple):
first_name: str
last_name: str
dob: datetime.date

Both Person and PersonTuple are valid Struct types in CocoIndex, with identical schemas (three fields: first_name (Str), last_name (Str), dob (Date)). Choose dataclass for mutable objects or when you need additional methods, and NamedTuple for immutable, lightweight structures.

Table Types

A Table type models a collection of rows, each with multiple columns. Each column of a table has a specific type.

We have two specific types of Table types: KTable and LTable.

KTable

KTable is a Table type whose first column serves as the key. The row order of a KTable is not preserved. Type of the first column (key column) must be a key type.

In Python, a KTable type is represented by dict[K, V]. The V should be a struct type, either a dataclass or NamedTuple, representing the value fields of each row. For example, you can use dict[str, Person] or dict[str, PersonTuple] to represent a KTable, with 4 columns: key (Str), first_name (Str), last_name (Str), dob (Date).

Note that if you want to use a struct as the key, you need to ensure the struct is immutable. For dataclass, annotate it with @dataclass(frozen=True). For NamedTuple, immutability is built-in. For example:

@dataclass(frozen=True)
class PersonKey:
id_kind: str
id: str

class PersonKeyTuple(NamedTuple):
id_kind: str
id: str

Then you can use dict[PersonKey, Person] or dict[PersonKeyTuple, PersonTuple] to represent a KTable keyed by PersonKey or PersonKeyTuple.

LTable

LTable is a Table type whose row order is preserved. LTable has no key column.

In Python, a LTable type is represented by list[R], where R is a dataclass representing a row. For example, you can use list[Person] to represent a LTable with 3 columns: first_name (Str), last_name (Str), dob (Date).

Key Types

Currently, the following types are key types

  • Bytes
  • Str
  • Bool
  • Int64
  • Range
  • Uuid
  • Date
  • Struct with all fields being key types (using @dataclass(frozen=True) or NamedTuple)