# CocoIndex Changelog 1.0.1 - 1.0.7

> CocoIndex's first post-v1 releases: stable memoization keys, scheduled live refresh, scoped stats, safer SQL connectors, and more integrations.

Published: 2026-06-01 · Canonical: https://cocoindex.io/blogs/changelog-101-107/

CocoIndex reached [v1.0](https://github.com/cocoindex-io/cocoindex) at the start of this cycle.

This is the first changelog since the v1 launch, covering seven releases (1.0.1–1.0.7). The focus is on making the engine easier to operate in production pipelines, then connecting it to more of the systems where teams already store data.

CocoIndex builds fresh, structured context for AI from sources such as PDFs, codebases, emails, screenshots, and meeting notes, and keeps it up to date with an [incremental engine](https://cocoindex.io/docs/programming_guide/core_concepts/). The source is on [GitHub](https://github.com/cocoindex-io/cocoindex).

## Why this release matters

For developers building retrieval, extraction, and knowledge-graph pipelines, the most significant changes are not the new connectors but the engine work that makes long-running, expensive, and shared pipelines easier to reason about:

- **Per-argument [memoization](https://cocoindex.io/docs/advanced_topics/memoization_keys/) keys.** Ignore clients, loggers, debug flags, and other handles that should not invalidate cached LLM calls or transforms, so recomputation depends only on the inputs that matter.
- **Scheduled refresh as an engine primitive.** `coco.auto_refresh` turns the common "poll this source every few minutes" pattern into a live component with consistent error handling and target-state reconciliation.
- **Per-slice stats.** `coco.stats_group(...)` breaks the `adds` / `reprocesses` / `deletes` counts down by data slice (per tenant, project, or folder) so you can see what each slice is doing (growth, churn, an unexpected reprocess), not just one aggregate per processor.
- **More production backends.** New and upgraded connectors cover graph stores, vector stores, streaming systems, and OCI Object Storage without changing CocoIndex's [declare-target-state model](https://cocoindex.io/docs/programming_guide/target_state/).
- **Correctness and security fixes.** This cycle closed SQL identifier injection paths and fixed concurrency and cancellation edge cases that appear only with real databases and long-running work.

## Engine

The engine work this cycle centers on four areas: predictable recomputation, first-class live refresh, scoped observability, and correctness under real I/O latency.

### Per-argument memoization control

`@coco.fn` now accepts a per-argument **`memo_key`**, giving you fine-grained control over which arguments participate in the memoization cache key.

Why it matters: production functions often take more than data. They also take clients, handles, config objects, loggers, tracing context, or debug flags. Those values can change across runs even when the meaningful input is identical. With `memo_key`, you can keep the cache keyed to the semantic input and avoid rerunning expensive transforms, embeddings, or LLM calls for the wrong reason.

The API maps parameter names to either a callable (transform the value before fingerprinting) or `None` (exclude the parameter entirely):

```python
@coco.fn(memo=True, memo_key={"entry": lambda e: (e.name, e.version), "extra": None})
def transform(entry: SourceDataEntry, extra: str) -> str:
    ...
```

- **Callable** → applied to the argument; its return value is fingerprinted in place of the original.
- **`None`** → the parameter is excluded from the memo key; changing it never invalidates the cache.
- **Not listed** → fingerprinted normally.

Read more in [Memoization keys & states → Override at the call site](https://cocoindex.io/docs/advanced_topics/memoization_keys/) ([**#1888**](https://github.com/cocoindex-io/cocoindex/pull/1888), [**#2000**](https://github.com/cocoindex-io/cocoindex/pull/2000)).

### `coco.auto_refresh` for live components

A common live-mode pattern is "run this processor on a fixed schedule." `coco.auto_refresh` now makes that pattern first-class.

Why it matters: without an engine-level primitive, every pipeline has to hand-roll a loop, decide how errors propagate, and remember how to reconcile missing rows. `coco.auto_refresh` wraps any processor function as a [live component](https://cocoindex.io/docs/programming_guide/live_mode/) that re-runs on an interval and routes cycle failures through one unified error-handling channel.

```python
import datetime
import cocoindex as coco

@coco.fn
async def app_main(db, target) -> None:
    await coco.mount(
        coco.auto_refresh(sync_users, interval=datetime.timedelta(minutes=5)),
        db, target,
    )
```

Each cycle reconciles target states against the previous cycle. If `sync_users` stops declaring a row, CocoIndex deletes the corresponding target automatically. See [Live components → periodic refresh](https://cocoindex.io/docs/advanced_topics/live_component/) ([**#1995**](https://github.com/cocoindex-io/cocoindex/pull/1995)).

### Scoped stats reports

By default, CocoIndex already breaks stats down **by processor**: each entry function (`process_doc`, `process_code`, …) reports its own counts: `adds`, `reprocesses`, `deletes`. What it can't do by default is break those counts down across **data slices**. `coco.stats_group(title)` opens a scope where everything mounted inside aggregates into a **separate** report under `title`, split out of the parent (no double counting).

Why it matters: the counts are the point. Seeing that one tenant did 600 reprocesses while another did 1,150 fresh adds tells you who is churning, who is growing, and where an unexpected reprocess storm came from, the kind of thing one aggregate per processor hides completely. The natural slice is the data: per tenant, per project, or per source folder. (The same breakdown also surfaces which slice dominates a slow or expensive run, but seeing the work each slice is doing is the primary win.)

```python
@coco.fn
async def app_main(tenants, target):
    for tenant in tenants:
        # same processor, but stats scoped per tenant
        with coco.stats_group(f"tenant:{tenant.id}", report_to_stdout=True):
            files = localfs.walk_dir(tenant.docs_dir, ...)
            await coco.mount_each(process_doc, files.items(), target)
```

The block also yields a handle with the same `stats()` and `watch()` methods as `UpdateHandle`, so you can stream a single slice's counts to a dashboard. See [Progress monitoring → Scoped reports](https://cocoindex.io/docs/advanced_topics/progress_monitoring/) ([**#2042**](https://github.com/cocoindex-io/cocoindex/pull/2042)).

### Other engine improvements

- **Configurable logging**: set the Python logger level directly via the `COCOINDEX_LOG_LEVEL` environment variable ([**#2035**](https://github.com/cocoindex-io/cocoindex/pull/2035)).
- **Cleaner Rust internals**: the workspace was unified and the core SDK isolated from PyO3, separating the engine from its Python bindings and laying the groundwork for native Rust consumers of the engine ([**#1973**](https://github.com/cocoindex-io/cocoindex/pull/1973)).

## Built-in operations

### LiteLLM speech-to-text

CocoIndex added **speech-to-text (STT)** support through [LiteLLM](https://cocoindex.io/docs/ops/litellm/) ([**#1889**](https://github.com/cocoindex-io/cocoindex/pull/1889)), so audio sources can be transcribed inside a pipeline using any LiteLLM-backed STT provider, extending CocoIndex's multimodal reach from images and PDFs into audio. A `LiteLLMTranscriber` wraps the transcription API:

```python
from cocoindex.ops.litellm import LiteLLMTranscriber

transcriber = LiteLLMTranscriber("whisper-1")
transcript = await transcriber.transcribe(audio_file)
```

### Code splitter: eight new languages

The [`RecursiveSplitter`](https://cocoindex.io/docs/ops/text/) gained [tree-sitter](https://cocoindex.io/blogs/index-codebase-v1) support for **eight new languages** this cycle, so syntax-aware chunking now covers a much wider slice of real codebases and config:

- **Svelte** and **Vue**: component-aware chunking for frontend code ([**#1937**](https://github.com/cocoindex-io/cocoindex/pull/1937))
- **Julia**: for scientific and numerical codebases ([**#1942**](https://github.com/cocoindex-io/cocoindex/pull/1942))
- **Elm** ([**#1955**](https://github.com/cocoindex-io/cocoindex/pull/1955)) and **Astro** ([**#1984**](https://github.com/cocoindex-io/cocoindex/pull/1984))
- **Bash, CMake, and HCL**: shell scripts, build files, and Terraform/infra configs ([**#1954**](https://github.com/cocoindex-io/cocoindex/pull/1954))

Pass the language name to `language=` (or let [`detect_code_language()`](https://cocoindex.io/docs/ops/text/) infer it from a filename):

```python
from cocoindex.ops.text import RecursiveSplitter

splitter = RecursiveSplitter()
chunks = splitter.split(source_code, chunk_size=2000, chunk_overlap=200, language="svelte")
```

Both new operations run inside a pipeline on the same incremental engine: audio transcription and syntax-aware chunking across a wider range of codebases.

### Entity resolution: parallel by default

[Entity resolution](https://cocoindex.io/docs/ops/entity_resolution/) (dedup-and-canonicalize a set of extracted entity names) now resolves independent connected components of the candidate graph in parallel, speeding up LLM-backed deduplication on large entity sets ([**#2006**](https://github.com/cocoindex-io/cocoindex/pull/2006)).

## Connectors

This cycle adds new streaming sources and target connectors.

### Source: OCI Object Storage with live bucket watching

A new [**Oracle Cloud Infrastructure (OCI) Object Storage**](https://cocoindex.io/docs/connectors/oci_object_storage/) source mirrors the S3 source's API for scanning a bucket, and adds an optional live mode.

It's a good illustration of the **stream–state duality** in CocoIndex's source model. A bucket is *state*: `list_objects()` scans it and establishes the full set of objects. A change feed is a *stream*: when you supply a `live_stream`, CocoIndex does the initial scan once and then keeps watching, applying `createobject` / `updateobject` / `deleteobject` events incrementally instead of rescanning. For OCI those events flow through OCI Streaming, consumed over the [Kafka](https://cocoindex.io/blogs/csv-to-kafka-live) protocol; an event-time cutoff with a 5-second clock-skew tolerance decides which streamed events predate the scan. Under the hood it's built on a new lower-level `LiveStream` abstraction, a keyless stream of messages with in-memory watermark tracking and ack-on-completion ([**#1905**](https://github.com/cocoindex-io/cocoindex/pull/1905)).

This is the same model CocoIndex already uses for the [local file system](https://cocoindex.io/docs/connectors/localfs/), where the change stream comes from OS file-system events rather than Kafka. More sources, such as S3, will gain live mode over time.

### Source: Apache Iggy

CocoIndex now reads from **Apache Iggy**, the high-throughput, low-latency persistent message-streaming platform, bringing another streaming backbone into the source ecosystem ([**#1969**](https://github.com/cocoindex-io/cocoindex/pull/1969)).

### Target: Turbopuffer

A new [**Turbopuffer**](https://cocoindex.io/docs/connectors/turbopuffer/) target connector brings the serverless, object-storage-backed vector + full-text search database into CocoIndex pipelines, with a text embedding example included out of the box ([**#1934**](https://github.com/cocoindex-io/cocoindex/pull/1934)).

### Target: Neo4j

A native [**Neo4j**](https://cocoindex.io/docs/connectors/neo4j/) property-graph target landed alongside a `meeting_notes_graph_neo4j` example, mapping nodes and relationships into Neo4j without writing Cypher by hand ([**#1932**](https://github.com/cocoindex-io/cocoindex/pull/1932)).

### Target: FalkorDB

The [**FalkorDB**](https://cocoindex.io/docs/connectors/falkordb/) property-graph target, a high-performance graph database built on Redis, was brought forward into the v1 connector set with full support for nodes, relationships, and vector/FTS indexes ([**#1908**](https://github.com/cocoindex-io/cocoindex/pull/1908)).

### Target: LanceDB

[LanceDB](https://cocoindex.io/docs/connectors/lancedb/) got two upgrades for production workloads:

- The v1 target now **optimizes (compacts) tables periodically** in the background, keeping query performance steady as data churns ([**#2008**](https://github.com/cocoindex-io/cocoindex/pull/2008), with a scheduling fix in [**#2013**](https://github.com/cocoindex-io/cocoindex/pull/2013)).
- Table targets can **add new columns in place**, so schema evolution doesn't require a rebuild ([**#1951**](https://github.com/cocoindex-io/cocoindex/pull/1951)).

## Bug fixes

This cycle also includes correctness and security fixes.

### Postgres correctness

- `halfvec` op classes are now used for indexes on half-precision vectors ([**#2029**](https://github.com/cocoindex-io/cocoindex/pull/2029)).
- `U+0000` (NUL) bytes are stripped when writing `text`/`jsonb`, so Postgres no longer rejects otherwise-valid payloads ([**#2032**](https://github.com/cocoindex-io/cocoindex/pull/2032)).
- The `pgvector` extension now installs into the default schema, avoiding `search_path` surprises ([**#1979**](https://github.com/cocoindex-io/cocoindex/pull/1979)).

### Other fixes

- **SQL identifier validation**: the Postgres and SQLite connectors now validate table and column identifiers before interpolating them into statements, closing a class of SQL-injection vectors at the connector boundary ([**#1947**](https://github.com/cocoindex-io/cocoindex/pull/1947), [**#1965**](https://github.com/cocoindex-io/cocoindex/pull/1965)).
- **Ownership-transfer race**: a pending-state protocol closes a preempt race so target-state ownership transfer stays correct under Postgres I/O latency, not just on microsecond-fast LMDB ([**#1994**](https://github.com/cocoindex-io/cocoindex/pull/1994)).
- **Clean cancellation**: cancellation propagates through task spawn boundaries, tearing work down cleanly instead of leaking orphaned tasks ([**#1902**](https://github.com/cocoindex-io/cocoindex/pull/1902)).
- **`on_error` cascade**: errors cascade through the Build-mode GC sweep ([**#1999**](https://github.com/cocoindex-io/cocoindex/pull/1999)).
- **numpy serde**: `_frombuffer` is registered under both numpy 1.x and 2.x paths ([**#2012**](https://github.com/cocoindex-io/cocoindex/pull/2012)).
- **Responsive progress display**: the PTY reader moved onto a dedicated OS thread, keeping the async runtime responsive under heavy logging ([**#2033**](https://github.com/cocoindex-io/cocoindex/pull/2033), [**#2040**](https://github.com/cocoindex-io/cocoindex/pull/2040)).

## Build with CocoIndex

### Meeting notes → knowledge graph, now on Neo4j

The popular **meeting-notes-to-knowledge-graph** example now ships in a Neo4j flavor ([`meeting_notes_graph_neo4j`](https://github.com/cocoindex-io/cocoindex/tree/main/examples/meeting_notes_graph_neo4j)). It watches a folder of meeting notes, uses LLM extraction to pull out structured entities (Meetings, People, Tasks) and the relationships between them, then maps everything into Neo4j as nodes and edges. Change one note and CocoIndex reprocesses only that note, keeping the graph continuously in sync.

### Turbopuffer and OCI text-embedding examples

Two more end-to-end examples landed alongside the new connectors: [`text_embedding_turbopuffer`](https://github.com/cocoindex-io/cocoindex/tree/main/examples/text_embedding_turbopuffer) shows a complete embed-and-query pipeline against Turbopuffer, and [`oci_object_storage_embedding`](https://github.com/cocoindex-io/cocoindex/tree/main/examples/oci_object_storage_embedding) demonstrates ingesting from OCI Object Storage with live bucket watching. Both are ready to clone and run.

### Watch: build it end to end on FalkorDB

Follow along with the [Build with CocoIndex walkthrough](https://www.youtube.com/watch?v=r1eRG8JPMJM), which builds the meeting-notes-to-knowledge-graph pipeline end to end on FalkorDB, then clone the [`meeting_notes_graph_falkordb`](https://github.com/cocoindex-io/cocoindex/tree/main/examples/meeting_notes_graph_falkordb) example to run it yourself.

## Summary

This cycle focused on operating the v1 engine in production: finer control over memoization, scheduled live refresh, and scoped observability, plus correctness and security fixes, including a concurrency race that appeared only under real database latency and SQL-injection hardening in the connectors. It also broadened reach with new sources (OCI Object Storage, Apache Iggy), new targets (Turbopuffer, Neo4j, FalkorDB) and LanceDB optimization, audio transcription via LiteLLM, and eight new languages for the code splitter.

For the complete list of changes, see the [GitHub releases](https://github.com/cocoindex-io/cocoindex/releases). If CocoIndex is useful to you, consider starring the [repository](https://github.com/cocoindex-io/cocoindex).

## Thanks to the community

Thanks to the contributors who shipped changes this cycle.

### @prrao87

Thanks [@prrao87](https://github.com/prrao87) for [fixing LanceDB optimize scheduling](https://github.com/cocoindex-io/cocoindex/pull/2013), ensuring background compaction runs on the right cadence.

### @zherendong

Thanks [@zherendong](https://github.com/zherendong) for [parallelizing entity resolution via candidate-graph components](https://github.com/cocoindex-io/cocoindex/pull/2006), speeding up LLM-backed deduplication on large entity sets.

### @countradooku

Thanks [@countradooku](https://github.com/countradooku) for adding the [Apache Iggy connector](https://github.com/cocoindex-io/cocoindex/pull/1969), bringing another high-throughput streaming backbone into the source ecosystem.

### @Haleshot

Thanks [@Haleshot](https://github.com/Haleshot) for keeping the docs in shape after the v1 launch: [removing the stale `v1` mention from the install command](https://github.com/cocoindex-io/cocoindex/pull/1924), [fixing a broken URL](https://github.com/cocoindex-io/cocoindex/pull/1958), and [updating stale `v1` branch links to `main`](https://github.com/cocoindex-io/cocoindex/pull/1959).

### @galshubeli

Thanks [@galshubeli](https://github.com/galshubeli) for bringing the [FalkorDB target connector](https://github.com/cocoindex-io/cocoindex/pull/1908) into the v1 connector set, a complete property-graph integration with nodes, relationships, and vector/FTS indexes.

### @Gohlub

Thanks [@Gohlub](https://github.com/Gohlub) for adding [LiteLLM speech-to-text support](https://github.com/cocoindex-io/cocoindex/pull/1889), extending CocoIndex's multimodal reach into audio transcription.

### @MrAnayDongre

Thanks [@MrAnayDongre](https://github.com/MrAnayDongre) for adding [per-argument `memo_key` support](https://github.com/cocoindex-io/cocoindex/pull/1888) to `@coco.fn`, giving fine-grained control over what participates in the memoization cache key.

### @aaronjmars

Thanks [@aaronjmars](https://github.com/aaronjmars) for [validating SQL identifiers in the Postgres and SQLite connectors](https://github.com/cocoindex-io/cocoindex/pull/1947), hardening CocoIndex against SQL injection at the connector boundary.

### @tuanaiseo

Thanks [@tuanaiseo](https://github.com/tuanaiseo) for [reporting and fixing a potential SQL injection in the Postgres connector](https://github.com/cocoindex-io/cocoindex/pull/1965), a responsible-disclosure win for the whole community.

### @nuthalapativarun

Thanks [@nuthalapativarun](https://github.com/nuthalapativarun) for a prolific cycle of splitter work: adding [Elm](https://github.com/cocoindex-io/cocoindex/pull/1955), [Astro](https://github.com/cocoindex-io/cocoindex/pull/1984), and [Bash, CMake, and HCL](https://github.com/cocoindex-io/cocoindex/pull/1954) tree-sitter support, plus [adding context to Rust→Python error messages](https://github.com/cocoindex-io/cocoindex/pull/1986) and [expanding the supported-languages docs](https://github.com/cocoindex-io/cocoindex/pull/1953).

### @qWaitCrypto

Thanks [@qWaitCrypto](https://github.com/qWaitCrypto) for making the [LanceDB v1 target optimize tables periodically](https://github.com/cocoindex-io/cocoindex/pull/2008) and for [adding columns in place](https://github.com/cocoindex-io/cocoindex/pull/1951), keeping LanceDB performant under churn and enabling schema evolution without rebuilds.

### @shaiar

Thanks [@shaiar](https://github.com/shaiar) for [fixing serde `_frombuffer` registration across numpy 1.x and 2.x](https://github.com/cocoindex-io/cocoindex/pull/2012), keeping deserialization correct across numpy versions.

### @octo-patch

Thanks [@octo-patch](https://github.com/octo-patch) for [adding Python formatting and linting commands to CLAUDE.md](https://github.com/cocoindex-io/cocoindex/pull/1855), smoothing the contributor onboarding path.

### @phuctoan123

Thanks [@phuctoan123](https://github.com/phuctoan123) for [documenting `gws` for Google Drive setup](https://github.com/cocoindex-io/cocoindex/pull/1949), making the Drive source easier to configure.
