Skip to main content

CocoIndex Changelog 2025-10-19

· 17 min read
George He
CocoIndex Maintainer

CocoIndex Changelog 2025-10-19

🚀 Over the 20+ releases, CocoIndex continued to advance its ultra-performant AI-native ETL infrastructure with a series of targeted features and enhancements focused on reliable incremental processing, error resilience, and extensibility.

We are constantly improving CocoIndex to make it easier to use, more powerful, and more reliable. Star us on GitHub to stay updated!

Core Capability

Durable execution

CocoIndex is designed for massive, ever-changing datasets. Reliability and efficiency are critical. With durable execution, CocoIndex now automatically captures failures and retries only the affected rows, ensuring progress is never lost and avoiding full reprocessing.

  • Automatic recovery on failure: Failed rows are automatically retried in the next update #995, with optional fine-grained control through configurable retry lists #997. CocoIndex also maintains precise processing order tracking to guarantee consistent reprocessing sequences #1001.
  • Stable transient authentication: Improved handling of short-lived auth keys ensures reliability and prevents drift across retries #1002.

These enhancements make incremental processing more fault-tolerant, efficient, and self-healing, allowing CocoIndex to continuously move forward - no wasted computation, no data left behind.

Avoid Unnecessary Reprocessing

In CocoIndex, reprocessing can be triggered by changes in either source data or flow definitions. We improved our flow change detection mechanism, significantly reducing false positives and avoiding unnecessary reprocessing.

  • Improved change detection logic: Refined how CocoIndex evaluates flow and dependency changes, cutting down false positives and ensuring reprocessing only happens when it’s truly needed. This leads to faster, more predictable runs and reduced compute costs. #1182
  • Stable transient auth keys: Enhanced retry behavior now keeps authentication keys consistent across transient failures, preventing drift and improving reliability for long-running flows. #1002

Together, these updates make CocoIndex’s incremental engine more deterministic, efficient, and resilient. So your flows stay lean even as your data and logic evolve.

Enhanced Build & Update Target Data

We’ve expanded how CocoIndex builds and updates target data, giving users more control over replication and recovery workflows.

Supported Modes

  • One-time update: Builds or updates target data based on the current state of the source.
  • Live update: Starts with a one-time update, then continuously captures source changes to keep targets up to date.

New Option: Reexport Targets with --reexport

We’ve added a new --reexport option that can be applied to both modes. When enabled, CocoIndex ignores the previous state of the target and rebuilds it entirely — even if neither the source data nor the flow definition has changed.

This is especially useful for data recovery, corrections, or full target refreshes after data loss. In live update mode, reexport applies only to the initial build phase.

Example:

cocoindex update --reexport main.py

This enhancement broadens pipeline flexibility and improves data reliability in recovery or replay scenarios.

Read more in Build and update data.

New Option: --reset for Setup, Update, and Server Commands

We’ve added a new --reset flag that makes it easier to start fresh by automatically clearing existing setups before running key commands.

When you use --reset, CocoIndex will drop the existing setup or state before performing the new operation — equivalent to running cocoindex drop first.

Example:

cocoindex setup --reset main.py

This ensures a clean environment for rebuilds, testing, or reinitializing workflows — without needing to manually clear prior state.

Incremental Engine Optimizations with Fast Fingerprint Collapsing

CocoIndex has smart incremental processing out of box, that only process what’s changed. In the latest version, we’ve included more engine-level optimizations to improve efficiency for large datasets:

  • Source content fingerprinting — the engine tracks a lightweight “fingerprint” of each source row or file, allowing it to detect when content hasn’t changed. #892
  • Fast collapse — unchanged data can be skipped without reading the full source, significantly reducing processing time and resource usage. #895

This optimization is particularly useful for sources that provide a content hash, version tag, or fingerprint (e.g., GitHub files, S3 objects, or custom sources). By avoiding unnecessary reprocessing even without reading content from the data source, CocoIndex can save large amounts of external API calls in addition to compute costs, making workflows faster and more cost-efficient.

While primarily designed for versioned sources like GitHub, this approach can benefit any source that exposes change indicators, improving incremental processing across diverse pipelines.

None Handling

CocoIndex supports None values. A None value represents the absence of data or an unknown value, distinct from empty strings, zero numbers, or false boolean values. None input values passed to required arguments of CocoIndex functions will short-circuit the function and result in None output value. See the documentation for more details.

In this release, None value handling in transformations was improved to better manage required and nullable fields, increasing robustness.

Robustness

Isolate GPU workloads in subprocess

Added full subprocess support for GPU workloads, improving stability and isolation while removing the need for global locks. These enhancements mark a strong step forward in robustness and performance for GPU‐driven workloads within CocoIndex. They reduce risk of GPU jobs hanging, and provide better tolerance to GPU failures. For users leveraging GPU processing, the update is highly recommended and should bring a smoother, more resilient operational experience.

What to expect for users / developers

If you’re using CocoIndex’s GPU-capable workloads, you may notice improved stability. Temporary failure caused by GPU will be isolated in the subprocess, which will be restarted gracefully.

Error Tolerance improvements

Fixed live-mode behaviour so that source-level errors are now logged instead of aborting the process. Also improved coverage of retrying for several external APIs, e.g. OpenAI and Vertex AI. This reduces failures and improves stability.

Query handler and CocoInsight support

Query Handler

Query handlers let you expose a simple function that takes a query string and returns structured results. They are discoverable by tools like CocoInsight so you can query your indexes without writing extra glue code.

@my_flow.query_handler(name="run_query")  # Name is optional, use the function name by default
def run_query(query: str) -> cocoindex.QueryOutput:
# 1) Perform your query against the input `query`
...

# 2) Return structured results
return cocoindex.QueryOutput(results=[{"filename": "...", "text": "..."}])

CocoInsight

We are launching a major feature in both CocoIndex and CocoInsight to help users fast iterate with the indexing strategy, and trace back all the way to the data.

🎉 Announcement: https://cocoindex.io/blogs/query-support Query Support

Demo:

https://www.youtube.com/watch?v=crV7odEVYTE

Python SDK

Support pydantic with field-level description for structured extraction

ExtractByLlm is one of a native building blocks in CocoIndex to make it easier to extract structured information from unstructured document with LLM.

This change allows users to direct define the description at fields level when doing structured extraction.

For example, user could define the following class:

class ProductTaxonomyInfo(BaseModel):
"""
Taxonomy information for the product.
"""

taxonomies: list[ProductTaxonomy] = Field(
...,
description="Taxonomies for the current product."
)

complementary_taxonomies: list[ProductTaxonomy] = Field(
...,
description="Think about when customers buy this product, what else they might need as complementary products. Put labels for these complentary products."
)

And you can plug it into any ExtractByLlm transformation.

 taxonomy = data["detail"].transform(
cocoindex.functions.ExtractByLlm(
llm_spec=cocoindex.LlmSpec(
api_type=cocoindex.LlmApiType.OPENAI, model="gpt-4.1"
),
output_type=ProductTaxonomyInfo,
)
)

support @cocoindex.settings and make init() optional

A more flexible approach is to provide a setting function that returns a cocoindex.Settings dataclass object. The setting function can have any name, and needs to be decorated with the @cocoindex.settings decorator, for example:

@cocoindex.settings
def cocoindex_settings() -> cocoindex.Settings:
return cocoindex.Settings(
database=cocoindex.DatabaseConnectionSpec(
url="postgres://cocoindex:cocoindex@localhost/cocoindex"
)
)

This setting function will be called once when CocoIndex is initialized, whenever you're running the CLI or your own main script, hence you can provide settings in a single way within your code. Once the settings function is provided, environment variables will be ignored.

Building blocks

Postgres Source Enhancements

Postgres source allows users to take data from any PostgreSQL table as input. Schema of the imported data is automatically inferred from the source table. It supports change-capture based on LISTEN/NOTIFY.

  • Composite-key Postgres source: Auto-detects schema, supports single or multi-column keys, and maps Postgres → CocoIndex types. #910
  • Faster small-row performance: Refactored Postgres source and list() API for high-volume, many-small-rows workloads. #948
  • Real-time Postgres updates: New LISTEN/NOTIFY change-capture mode with auto channel naming and full source wiring. #952, #953, #954, #955

Postgres Target Enhancements

  • Postgres filtering support: added the ability to apply filters directly on Postgres sources, enabling more efficient data selection and ingestion. #1178
  • Half-precision vector (halfvec) support: added halfvec support for Postgres targets, reducing storage overhead for embeddings while maintaining compatibility with vector operations. #1171
  • Schema support for Postgres targets: users can now specify and write to custom schemas, allowing better organization in multi-schema database setups #1138

Target Enhancements – New & Improved Built-ins

Support for LanceDB Target

We are officially supporting LanceDB as a target. Get started with LanceDB target.

Postgres Target Enhancements

Target Attachment / PostgreSQL SQL Command Attachment — The core engine now supports attachments on targets, including a PostgresSqlCommand type that allows you to execute arbitrary SQL (setup/teardown) on the target side #1131.

Example (create a custom index):

collector.export(
"doc_embeddings",
cocoindex.targets.Postgres(table_name="doc_embeddings"),
primary_key_fields=["id"],
attachments=[
cocoindex.targets.PostgresSqlCommand(
name="fts",
setup_sql=(
"CREATE INDEX IF NOT EXISTS doc_embeddings_text_fts "
"ON doc_embeddings USING GIN (to_tsvector('english', text));"
),
teardown_sql= "DROP INDEX IF EXISTS doc_embeddings_text_fts;",
)
],
)

Check the full documentation here.

Neo4j Improvements

  • Support vector index method and
  • Changed the KEY constraint to UNIQUE in Neo4j mappings for compatibility with the Community Edition. #983

New LLM integrations - Add AWS Bedrock LLM Support

CocoIndex provides builtin functions integrating with various LLM APIs, for various inference tasks. In the new release we’ve added AWS Bedrock LLM Support in addition to all major commercial and open source LLM / API supported. See full documentation here.

Native Functions

CocoIndex’s native building blocks for codebase indexing received major upgrades this month, driven by extensive community feedback. We’ve refined language detection, splitting strategies, and parsing robustness for diverse codebases.

  • DetectProgrammingLanguage automatically identifies programming languages from file content or extensions, improving indexing precision across mixed-language repositories.
  • Improved SplitRecursively function. Improved its tree-sitter based chunking method with better markdown handling, richer punctuation handling for plain text, configurable chunk overlap, and supported Solidity.
  • SplitBySeparators lightweight regex-only splitter for simpler, high-performance text segmentation workflows.

To see how to build customized codebase indexing, you can get started here.

New Blogs and examples:

https://cocoindex.io/blogs/etl-to-snowflake

  • Incrementally Transform Structured + Unstructured Data from Postgres with AI Incrementally Transform Structured + Unstructured Data from Postgres with AI This blog introduces the new PostgreSQL source and shows how to take data from PostgreSQL table as source, transform with both AI models and non-AI calculations, and write them into a new PostgreSQL table for semantic + structured search.

  • Automated invoice processing with AI, Snowflake and CocoIndex - with incremental processing Snowflake ETL with CocoIndex This blog shows how to automate invoice processing using CocoIndex, an open-source ETL framework built for AI-powered data transformation. It walks through how to extract supplier invoices from Azure Blob Storage, use LLMs to read and structure the data, and then load everything into Snowflake — all with incremental processing for scalability and data freshness.

  • Build a Visual Document Index from multiple formats all at once - PDFs, Images, Slides - with ColPali Multi-format indexing with ColPali This blog shows how to build a visual document index from multiple formats all at once - PDFs, Images, Slides - with ColPali. It walks through how to use ColPali to extract text from multiple formats, and then use CocoIndex to index the text into a visual document index.

Thanks to the Community 🤗🎉

Welcome new contributors to the CocoIndex community! We are so excited to have you!

@MrAnayDongre

MrAnayDongre

Thanks @MrAnayDongre for the work on SplitBySeparators regex-only splitter #1010, enabling simpler and high-performance text segmentation workflows.

@lemorage

lemorage

Thanks @lemorage for adding explicit type hints in engine value encoding for the Python SDK #807, making value serialization more robust and type-safe.

@banrovegrie

banrovegrie

Thanks @banrovegrie for fixing the Gemini embedding configuration and exposing Postgres index tuning #1050, improving integration reliability.

@thisisharsh7

thisisharsh7

Thanks @thisisharsh7 for the work on:

  • adding the --reset flag for setup, update, and server commands #1106
  • improving Postgres schema support #1138
  • enhancing error messages when no flows are registered #1070
  • making example paths OS-friendly #1066

These updates improve usability, reliability, and cross-platform compatibility, making workflows smoother and more robust for all users.

@Davda-James

Davda-James

Thanks @Davda-James for the work on:

  • binding Pydantic models to CocoIndex Struct #1072
  • reorganizing the Python SDK package structure #1081, #1082
  • generating full CLI docs and adding them to pre-commit hooks #1096
  • collecting additional row processing counters for stats #1105
  • and splitting convert.py into modular components #1120

These updates improve developer productivity, maintainability, and observability, making the Python SDK and CLI easier to use, extend, and monitor for large-scale workflows.

@princyballabh

princyballabh

Thanks @princyballabh for auto-applying safe defaults for missing fields in load_engine_object #1104, improving reliability and reducing errors when loading engine objects.

@shresthashim

shresthashim

Thanks @shresthashim for implementing str and repr methods for Python schema classes #1095, enhancing developer experience when working with schema objects.

@belloibrahv

belloibrahv

Thanks @belloibrahv for the work on:

  • Adding AWS Bedrock LLM #1173
  • Neo4j vector index methods #1111
  • Field-level descriptions to FieldSchema #1087

These updates expand integration capabilities, improve graph database support, and enhance schema clarity.

@mensonones

mensonones

Thanks @mensonones for updating documentation to use cocoindex update --setup main #1093, making setup instructions clearer.

@esther-anierobi

esther-anierobi

Thanks @esther-anierobi for improving documentation navigation #1136.

@TheVijayVignesh

TheVijayVignesh

Thanks @TheVijayVignesh for moving built-in sources to top-level navigation in the docs #1119, improving documentation structure.

@aryasoni98

aryasoni98

Thanks @aryasoni98 for adding tests for json_schema.rs #1133, improving code reliability.

@skalwaghe-56

skalwaghe-56

Thanks @skalwaghe-56 for adding the legacy-states-v0 Cargo feature for backward compatibility #1135, making it easier to maintain legacy state logic.

@siddharthbaleja7

siddharthbaleja7

Thanks @siddharthbaleja7 for flattening recursion in SplitRecursively to prevent stack overflow #1127, improving stability.

Support us

We are constantly improving CocoIndex, more features are coming soon! Stay tuned and follow us by starring our GitHub repo.