Insight ~7 min read

From pickle to type-guided deserialization: safer Python serialization

How CocoIndex evolved from pickle to a type-guided serialization system that uses Python type hints to automatically choose the right serializer — no decorators or registration needed.

From *pickle* to type-guided deserialization: safer Python serialization

CocoIndex is a framework for building incremental data pipelines. It has a Rust core for performance and exposes a Python SDK for users to define their pipelines. Under the hood, the engine needs to serialize and deserialize Python objects constantly — caching function results, persisting pipeline state, tracking records for change detection — with the serialized data crossing the Rust/Python boundary and stored by the Rust core.

In CocoIndex v0, we had a closed type system: a fixed set of supported data types, with serialization handled entirely in Rust. Safe and fast, but too rigid. Users couldn’t use their own data types without converting them into our type system first.

When we started building v1, we wanted to natively support any Python data type. Pickle was the obvious first choice.

Pickle: Fast to Prototype, Painful to Live With

Pickle is Python’s built-in serialization. It handles virtually any Python object out of the box — dataclasses, NamedTuples, Pydantic models, nested containers, custom classes. For early prototyping under prerelease, it was perfect: one line of code, and everything just worked.

But pickle has two fundamental problems that make it unsuitable for production.

Security. Pickle executes arbitrary code during deserialization. A crafted payload can run any Python code on your machine. In CocoIndex’s case, the serialized data lives in internal storage (a database managed by the framework), so the risk requires an attacker to have access to that storage — which is unlikely for most deployments. Still, defense in depth matters: if the storage is ever exposed through a misconfiguration or a compromised database, pickle becomes an escalation path from data access to code execution. We didn’t want that risk in the foundation of our persistence layer.

Pickle escalation path — a crafted blob in storage becomes arbitrary code execution when CocoIndex calls pickle.loads

Overhead. Pickle encodes type information into every value — each value is a sequence of opcodes that tells the deserializer what type to reconstruct and how. This makes pickle self-describing, but at a cost. The integer 1024 takes 15 bytes in pickle, while msgpack encodes it in just 3. Msgpack can be this compact because it relies on the deserializer already knowing the expected type:

ValuepicklemsgpackRatio
"" (empty string)15 bytes1 byte15x
102415 bytes3 bytes5x
"hello"20 bytes6 bytes3.3x
{"key": [1,2,3]}31 bytes9 bytes3.4x
Point(42, "origin")63 bytes16 bytes3.9x

For a pipeline that caches thousands of intermediate results, this overhead adds up.

Whitelisted Pickle: Secure but Tedious

Our first attempt to fix the security problem was a restricted unpickler. We maintained an allowlist of types that were safe to deserialize, and rejected everything else. Users could opt their types into the allowlist with a @coco.unpickle_safe decorator:

python
@coco.unpickle_safe
@dataclass
class DocumentChunk:
    text: str
    embedding: list[float]

This solved the security problem. But it introduced a new one: annotation burden.

Every type that appeared anywhere in a serialized object graph needed the decorator — not just the top-level type, but every type nested inside it. Miss one, and you get a runtime error.

We experienced this firsthand when building the conversation_to_knowledge example. It needed 9 @coco.unpickle_safe annotations across the codebase. Getting there took 4 rounds of run-fail-annotate-repeat: run the pipeline, hit an error about an unregistered type, add the decorator, run again, hit the next error. Four times.

For our own example. Imagine what this is like for users building real pipelines with dozens of model types.

Type-Guided Deserialization: Letting Type Hints Do the Work

We stepped back and asked: what do we actually have that describes the shape of users’ data? The answer was already in their code — type hints.

In CocoIndex, there are three places where data gets serialized and deserialized: memoized function return values (the most common case), memo state values for change detection, and target tracking records. For memoized return values, the type hint is right there in the function signature:

python
@coco.fn(memo=True)
async def embed_chunk(chunk: str) -> list[float]:
    return await embedding_model.embed(chunk)

The return type list[float] tells us exactly how to deserialize the cached result. No extra annotation needed — the type hint that users already write for readability and IDE support doubles as the deserialization schema.

This led us to a type-guided serialization system built on msgspec, a fast msgpack-based serialization library for Python. Msgspec natively handles dataclasses, NamedTuples, and standard types (str, int, list, dict, datetime, UUID, etc.) — and it does so by relying on type information at deserialization time.

The routing byte

Different types need different serialization engines. We use a single routing byte at the start of each payload to select the engine:

ByteEngineUsed for
0x01msgspec (msgpack)Dataclasses, NamedTuples, primitives, collections
0x02PydanticBaseModel subclasses
0x80PickleOpted-in types, legacy data

Serialization checks types in priority order: explicit pickle opt-in first (the user specifically requested it), then Pydantic models, then msgspec as the default. Deserialization reads the routing byte and dispatches accordingly.

For Pydantic models, we serialize via model_dump(mode="json") into msgpack, and deserialize via TypeAdapter.validate_python(). This avoids pickle entirely while preserving Pydantic’s validation semantics.

For types that msgspec can’t handle natively — like numpy arrays or pathlib paths — users can opt in to pickle serialization with @coco.serialize_by_pickle. These types get the pickle routing byte (0x80), still through a restricted unpickler for safety.

What changed for users

The conversation_to_knowledge example? All 9 @coco.unpickle_safe annotations — gone. Users define their data types with standard Python type hints, and serialization just works:

python
@dataclass
class DocumentChunk:
    text: str
    embedding: list[float]
    metadata: dict[str, str]

No decorators. No registration. No run-fail-annotate-repeat cycle. The framework reads the type hint and picks the right serializer and deserializer automatically.

The Edge Cases: When Round-Trip Isn’t Perfect

We’d be lying if we said this approach is perfect for every type. There are edge cases with union types that users should be aware of — and these aren’t specific to CocoIndex. They’re fundamental to how msgspec and Pydantic handle ambiguous types.

msgspec: loud failure

msgspec rejects ambiguous unions at construction time:

python
import msgspec.msgpack

# These all raise TypeError immediately:
msgspec.msgpack.Decoder(str | datetime.date)     # two str-like types
msgspec.msgpack.Decoder(list[int] | tuple[int, ...])  # two array-like types
msgspec.msgpack.Decoder(bytes | bytearray)        # two bytes-like types

This is actually a feature. msgspec tells you upfront that these types can’t be distinguished on the wire, rather than silently corrupting your data.

Pydantic: silent surprise

Pydantic is more permissive — it accepts date | str — but the behavior after a serialization-deserialization round-trip can be surprising.

Consider a Pydantic model with a date | str field:

python
class Event(BaseModel):
    ts: datetime.date | str

e = Event(ts=datetime.date(2024, 1, 15))

If you serialize this to JSON (or msgpack), date becomes the string "2024-01-15". On deserialization, Pydantic sees a string input and needs to decide: is it a str or a date?

With Pydantic v2’s default smart mode, the answer is always str. Smart mode scores candidates by exactness: a string input is an exact match for str but only a lax match (requires coercion) for date. Exact wins. Ordering doesn’t matter — date | str and str | date both produce str.

Left-to-right mode behaves differently. With date | str ordering, Pydantic tries date first, the lax coercion from "2024-01-15" succeeds, and you get your date back:

python
from pydantic import Field
from typing import Annotated

class Event(BaseModel):
    # Left-to-right mode: more specific type first
    ts: Annotated[datetime.date | str, Field(union_mode='left_to_right')]

But this requires users to explicitly opt in to left-to-right mode AND put the more specific type first.

The fundamental issue

This isn’t a bug in any library. It’s inherent to the wire format. Msgpack (like JSON) has no date type — dates are serialized as strings. Once the type information is erased on the wire, no amount of clever decoding can reliably recover it when multiple types map to the same wire representation.

The practical advice: avoid union types where members share a wire representation in data objects that go through serialization. Use unambiguous types, or use Pydantic’s discriminated unions with a literal tag field.

What We Learned

Four lessons — type hints as a source of truth, fail loudly on ambiguity, design with the wire format, and stage simple-to-production with a plan

Type hints are an underutilized source of truth. Python developers already annotate their data types. Using those annotations for serialization is a natural extension — no new concepts, no new decorators, no new registration APIs.

Make ambiguity loud, not silent. msgspec’s approach of rejecting ambiguous unions at construction time is better than silently producing wrong results. When building a serialization layer, fail early and clearly.

Don’t fight the wire format. If the wire format can’t distinguish two types, your serialization layer can’t either. Design your data types around what the serialization format can express, not the other way around.

Start simple, but have a plan. Pickle was right for prototyping. The whitelisted pickle was right for the security fix. The type-guided approach was right for production. Each step taught us what the next step needed to be.

Learn More

Support us

⭐ Star CocoIndex on GitHub and share with your community if you find it useful!

CocoIndex

An incremental engine for long-horizon agents — always-fresh, explainable data, one Python file.

Frequently asked questions.

Is Python's pickle safe to use? What are the security risks?

No — pickle is unsafe for any data you do not fully control. Pickle executes arbitrary code during deserialization: a crafted payload can run any Python code on the host. Even when the serialized data lives in internal storage that is not directly user-facing, defense in depth matters — a misconfiguration or a compromised database can turn data access into code execution.

This is the reason CocoIndex moved off pickle as the default persistence format. See Pickle: Fast to Prototype, Painful to Live With.

What are good alternatives to pickle in Python?

For typed data — dataclasses, NamedTuples, primitives, lists, dicts, datetimes, UUIDs — use msgspec, a fast msgpack-based library that uses type information at deserialization time. For Pydantic BaseModel subclasses, dump via model_dump(mode="json") into msgpack and reload via TypeAdapter.validate_python(). Reserve pickle for types that neither library handles natively (e.g. numpy arrays, pathlib paths) and route them through a restricted unpickler.

This is the layered approach CocoIndex landed on — see The routing byte.

What is type-guided deserialization?

Type-guided deserialization uses the type hints already present in user code as the schema for serialization. Instead of asking developers to register or decorate their classes, the framework reads the type annotation (for example, the return type of a memoized function) and selects a deserializer that matches that shape. The wire format does not need to embed type metadata for every value — the consumer already knows the expected type.

See Type-Guided Deserialization: Letting Type Hints Do the Work.

Can Python type hints be used for serialization?

Yes. Libraries like msgspec rely on type hints at both serialization and deserialization time, so a single annotation — for example list[float] on a function's return type — doubles as the schema. CocoIndex uses the function signature directly: a memoized async function annotated with -> list[float] tells the framework exactly how to decode the cached result.

See Type-Guided Deserialization: Letting Type Hints Do the Work.

How do I safely deserialize untrusted data in Python?

Pick a format that does not execute code. msgspec (msgpack) and Pydantic's JSON pipeline are both safe by construction — they reconstruct values according to a declared schema, never by running pickle opcodes. If you must accept pickle for a small set of opted-in types, gate it behind a restricted unpickler that allowlists exactly the classes you expect; reject everything else.

CocoIndex's escape-hatch design keeps pickle behind both an explicit opt-in (@coco.serialize_by_pickle) and a restricted unpickler — see The routing byte.

How do I replace pickle with msgpack or msgspec?

Declare your data shape with standard Python type hints — a @dataclass, a NamedTuple, or primitives — and let msgspec serialize and deserialize against that type. msgspec encodes values much more compactly than pickle because the deserializer already knows the type: 1024 takes 3 bytes in msgpack vs. 15 bytes in pickle, an empty string 1 byte vs. 15. For pipelines that cache thousands of intermediate results, that adds up quickly.

See the size comparison and rationale in Pickle: Fast to Prototype, Painful to Live With.

How do I serialize dataclasses and Pydantic models together in one system?

Use a single-byte routing prefix at the start of every payload to select the engine, and dispatch by type at serialization time:

  • 0x01 — msgspec (msgpack) for dataclasses, NamedTuples, primitives, and collections.
  • 0x02 — Pydantic for BaseModel subclasses (dump via model_dump(mode="json"), reload via TypeAdapter.validate_python()).
  • 0x80 — pickle for opted-in types and legacy data, behind a restricted unpickler.

Resolve types in priority order — explicit pickle opt-in first, then Pydantic, then msgspec as the default. On the read path, look at the routing byte and call the matching engine. See The routing byte.

How do I build a restricted (whitelisted) unpickler in Python?

Subclass pickle.Unpickler, override find_class, and refuse any class that is not on a known allowlist. In CocoIndex's first iteration, types opted in by way of a @coco.unpickle_safe decorator and the unpickler raised on anything missing.

The catch: every nested type must also be allowlisted, not just the top-level one. CocoIndex's own conversation_to_knowledge example needed nine annotations spread across the codebase, found over four rounds of run-fail-annotate. That annotation burden is what eventually pushed the framework toward type-guided dispatch — see Whitelisted Pickle: Secure but Tedious.

How do I choose a serializer automatically based on a function's return type?

Inspect the function's return annotation, pick the engine that matches that type, and write a routing byte at the start of the payload so deserialization can reverse the decision. CocoIndex's resolver checks types in priority order: explicit pickle opt-in → Pydantic BaseModel → msgspec for everything else. For memoized return values, the return-type annotation in the function signature is the only metadata required — no decorators, no registration.

See Type-Guided Deserialization: Letting Type Hints Do the Work and What changed for users.

What are the edge cases when using type hints for serialization?

Union types whose members share a wire representation are the main hazard. msgspec rejects ambiguous unions like str | datetime.date or list[int] | tuple[int, ...] at construction time, which is the safer behavior. Pydantic accepts them but, in default smart mode, prefers the exact match — so a round-tripped date field typed as date | str comes back as str, regardless of declaration order. Left-to-right mode with the more specific type first restores the date, but it must be opted into explicitly.

The underlying constraint is the wire format: msgpack and JSON have no native date type, so once a date is on the wire it is a string. Prefer unambiguous types or Pydantic discriminated unions with a literal tag. See The Edge Cases: When Round-Trip Isn't Perfect.