Skip to main content

CocoIndex Changelog 2025-04-05

Β· 5 min read

CocoIndex Changelog

In the past 2 weeks, we added incremental processing with live update mode, evaluation utilities, support for date/time types, Google Drive, and assorted core/performance improvements.

Incremental processing​

Incremental processing is one of the core values provided by CocoIndex. CocoIndex create and maintain indexes while keeping them up to date with source changes through minimal reprocessing. Users don't really need to do anything special. Just focus on defining the transformation needed.

CocoIndex automatically tracks the lineage of the data and maintains a cache of computation results. This approach ensures low latency between source and index updates while minimizing computational costs.

For more details, please refer to Incremental Processing.

Continous synchronize between source and index​

With the new live update mode, CocoIndex continuously captures changes from the source data and updates the target data accordingly. It's long-running and only stops when being aborted explicitly. This ensures that your applications always have access to the most current information without the performance overhead of full reindexing.

CocoIndex supports two main categories of change detection mechanisms:

  1. General Mechanism: with refresh interval, CocoIndex periodically checks for changes.

  2. Source-Specific Mechanisms: For example, push change notification and recent changes poll that lists recently modified entries.

Under the hood, after the change is detected, CocoIndex will use its incremental processing mechanism to update the target data. For more details, please refer to Continous Sync between Source and Index.

It's super simple to get started. You just need to configure change data capture for your source:

data_scope["documents"] = flow_builder.add_source(
cocoindex.sources.GoogleDrive(
...
recent_changes_poll_interval=datetime.timedelta(seconds=10)),
refresh_interval=datetime.timedelta(minutes=1))

recent_changes_poll_interval is the interval for polling recent changes from Google Drive. refresh_interval is the interval for general change detection.

and run:

cocoindex update -L

-L is the flag for live update mode. Or use cocoindex.FlowLiveUpdater in Python SDK.

For complete example, please refer to Live Index from Google Drive Example. For more detailed documentation, please refer to Live Update.

Evaluation utilities​

CocoIndex exposed a simple way to evaluate the quality of extraction via dumping output to files and comparing with goldens.

To dump the ETL output to YAML files, run:

python3 main.py cocoindex evaluate

It dumps what should be indexed to files under a directory.
And users could compare the output with golden files using tools like DirEqual or Meld.

A full example for evaluate and troubleshoot ETL (LLM data extraction) can be found in this blog.

Support for date/time types​

TypeDescriptionType in PythonOriginal Type in Python
Datedatetime.datedatetime.date
Timedatetime.timedatetime.time
LocalDatetimeDate and time without timezonecocoindex.typing.LocalDateTimedatetime.datetime
OffsetDatetimeDate and time with a timezone offsetcocoindex.typing.OffsetDateTimedatetime.datetime

For some types, CocoIndex Python SDK provides annotated types with finer granularity than Python's original type, e.g.

  • LocalDateTime and OffsetDateTime for datetime.datetime, with different timezone awareness.

Review the full list of core data types.

An example of using date/time types in ETL data extraction can be found in this blog.

New examples and Tutorials​

Thanks to the Community πŸ€—πŸŽ‰!​

  • @Anush008 from qdrant made their first contribution in #182. We are super excited for the upcoming official integration from our friends at qdrant!

We’re always improving CocoIndex and would love to hear your feedback.

Full Changelog v0.1.13...v0.1.18​

v0.1.18 Incremental/Live update mode​

  • Support live update mode in CocoIndex engine, exposed by FlowLiveUpdater (Python SDK) and update -L / server -L (CLI).
  • Offer a refresh_interval option for add_source() API in Python SDK for periodical metadata-traverse based change detection
  • Update GoogleDrive data source to support detecting recent changes based on last modified time.
  • Continuously show stats in live update mode.
  • Make @main_fn decorator support async functions.
  • Carry over more metadata for function and class decorators in Python SDK.

v0.1.17 Core, Performance/Incremental updates​

  • Skip reprocessing a source row when source data and logic has no change.
  • Keep source indexing states in memory to achieve lightweight incremental reprocessing when update called multiple times.
  • Minor optimization for auto generated uuid as storage target key.

v0.1.16 Core Data Types​

  • Support date/time types

v0.1.15 Core, LLM structured extraction​

  • Add UUID type, and support automatically generate stable UUID.
  • Support non-required field for LLM extraction.
  • Storage target setup logic robustness improvement.
  • Also allow collect() taking constant.

v0.1.14 Support Evaluation​

  • Support evaluate flow and dump output to files for offline evaluation purpose

v0.1.13 Error handling, performance​

  • Add a field mime_type for GoogleDrive source.
  • Count source rows with ERROR during indexing.
  • Make error location caused by functions more clear.
  • StructSchema take a optional description and put in JSON schema
  • Use class docstring as description for struct types in Python SDK.
  • Correctly output unsupported type name.
  • Correctly handle None values for composite types.
  • Improve error message for encoding of field annotations.
  • Bug fix: correctly deal with deleted/trashed files in GoogleDrive source
  • Make indexing update stats more consistent with what really happened
  • Bug fix: make sure dangling precommit states properly handled.