July 9, 2025 · 8 min read

Academic Papers Indexing

In this blog we will walk through a comprehensive example of indexing research papers with extracting different metadata — beyond full text chunking and embedding — and build semantic embeddings for indexing and querying.

Introducing CocoInsight

June 24, 2025 · 4 min read

Linghua Jin

From day zero, we envisioned CocoInsight as a fundamental companion to CocoIndex — not just a tool, but a philosophy: making data explainable, auditable, and actionable at every stage of the data pipeline with AI workloads. CocoInsight has been in private beta for a while, it is one of the most loved feature for our users building ETL with coco, with significant boost on developer velocity, and lowering the barrier to entry for data engineering.

We are officially launching CocoInsight today - it has zero pipeline data retention and connects to your on-premise CocoIndex server for pipeline insights. This makes data directly visible and easy to develop ETL pipelines.

Getting Started

Start using it by running:

cocoindex server -ci main.py

for your cocoindex projects.

Overview

This is an example view for CocoInsight:

right panel is dataflow, and
left panel is step-by-step data preview. Each field is tied to an input or output of a step in the dataflow transformation.

CocoInsight Panels

Inspect lineage

You could click on any field (either in data flow or data preview), or any transformation step in the dataflow to inspect lineage - to understand where the data comes from.

Inspect Lineage

The clicked element will be set to purple color, as the element being inspected.

Visibility:
- Direct data/ops with transitive dependency (upstream or downstream) will stay in view.
- Data/ops unrelated to the current selected element will be dimmed.
Color:
- Direct upstream data dependency (exact fields) will be colored blue.
- Direct downstream data output (exact fields) will be colored green.

Let's walk through some simple examples on how these AI pipelines work. You don't need to know how to write code, just need to make sense from spreadsheet 😊.

Codebase Indexing Example

Ingest files, which outputs file names and contents.
Take the filename and extract extension.
Take the content (source code) and extension (language, e.g., .py) to do split based on code boundaries with Tree-sitter.

You could further click on each chunk of a document to expand the details of the chunks.

Knowledge Graph Example

In this example, we process a list of files and generate a knowledge graph with documents and entities as nodes, and relationships between document/entity and entity/entity.

Some key steps:

Use LLM to summarize a document.
Use LLM to extract entities and relationships between entities.

Click on any relationship "rows" to drill into the child table.

How it works

At the core of CocoIndex, both data and data operations are first-class citizens.

Because of this pure dataflow foundation, CocoIndex offers full observability by default:

Before/after of the data are available at every transformation node.
Every output field can be traced back to the exact set of input fields and operations that created it.
Lineage is first-class — not as metadata bolted on afterward, but as a structural property of how data is defined and transformed in the system.

This lineage model is not just useful for debugging — it enables features like incremental processing, intelligent caching, and transformation-level explainability, all out of the box.

While CocoIndex is architecturally a dataflow engine, its user experience is deeply inspired by spreadsheets. Just like in a spreadsheet:

Values of cells are derived from others through clearly visible formulas or expressions.
You can visually inspect how data looks before and after each transformation, cell by cell.
There’s no implicit global state, and every value can be explained in terms of its formula and input values.
Once value of a source cell changes, we automatically update derived cell values based on formulas with minimum reprocessing.

This spreadsheet-inspired paradigm is more than a UI choice — it’s a cognitive model. It bridges the gap between low-code users and developers, allowing anyone familiar with spreadsheets to reason about data transformations intuitively.

We have lots of features planned for CocoInsight 😎, including query debugging, stats, and more. Stay tuned and join our Discord for any questions.

Flow-based schema inference for Qdrant

June 8, 2025 · 7 min read

Linghua Jin

CocoIndex + Qdrant Automatic Schema Setup

CocoIndex supports Qdrant natively - the integration features a high performance Rust stack with incremental processing end to end for scale and data freshness. 🎉 We just rolled out our latest change that handles automatic target schema setup with Qdrant from CocoIndex indexing flow.

CocoIndex + Kuzu: Real-time knowledge graph with Kuzu

June 3, 2025 · 5 min read

Linghua Jin

cover

CocoIndex now provides native support for Kuzu as a target graph data store. This integration features a high performance knowledge graph stack with real-time updates.

Real-time data transformation pipeline with Amazon S3 bucket, SQS and CocoIndex

May 29, 2025 · 6 min read

Linghua Jin

cover

CocoIndex now provides native support for Amazon S3 as a data source. Additionally, CocoIndex integrates with AWS Simple Queue Service (SQS), enabling true real-time incremental processing of your S3 data.

Build image search and query with natural language with vision model CLIP

May 20, 2025 · 8 min read

Linghua Jin

In this project, we will build image search and query it with natural language. You can search for “a cute animal” or “a red car”, and the system returns visually relevant results — no manual tagging needed.

Demo

How to build index with text embeddings

May 19, 2025 · 4 min read

Linghua Jin

In this blog, we will build index with text embeddings and query it with natural language. We try to keep it minimalistic and focus on the gist of the indexing flow.

Cover

Story of CocoIndex, at 1k stars 🎉

May 8, 2025 · 4 min read

Linghua Jin

CocoIndex got 1k stars

We have been working on CocoIndex - a real-time data framework for AI for a while, with lots of excitement from the community. We officially crossed 1k stars earlier this week. Huge thanks to everyone who starred, forked, contributed, or shared the love ❤️!

Build Real-Time Product Recommendation Engine with LLM and Graph Database

May 7, 2025 · 8 min read

Linghua Jin

Product Graph

In this blog, we will build a real-time product recommendation engine with LLM and graph database. In particular, we will use LLM to understand the category (taxonomy) of a product. In addition, we will use LLM to enumerate the complementary products - users are likely to buy together with the current product (pencil and notebook). We will use Graph to explore the relationships between products that can be further used for product recommendations or labeling.

Build Real-Time Knowledge Graph For Documents with LLM

April 29, 2025 · 7 min read

Linghua Jin

Building Knowledge Graph for Documents with LLM

CocoIndex makes it easy to build and maintain knowledge graphs with continuous source updates. In this blog, we will process a list of documents (using CocoIndex documentation as an example). We will use LLM to extract relationships between the concepts in each document.

Getting Started​

Overview​

Inspect lineage​

Codebase Indexing Example​

Knowledge Graph Example​

How it works​

Getting Started

Overview

Inspect lineage

Codebase Indexing Example

Knowledge Graph Example

How it works