Skip to main content
Linghua Jin
CocoIndex Maintainer
View all authors

AI-Native Data Pipeline - Why We Made It

· 7 min read
Linghua Jin
CocoIndex Maintainer

AI-Native Data Pipeline - Why We Made It

There’s more need for open data infrastructure for AI, than ever.

Data for humans → to data for AI

Traditionally, people build data frameworks heavily in this space to prepare data for humans. Over the years, we’ve seen massive progress in analytics-focused data infrastructure. Platforms like Spark and Flink fundamentally changed how the world processes and transforms data, at scale.

But with the rise of AI, entirely new needs — and new capabilities — have emerged. A new generation of data transformations is now required to support AI-native workloads.

Index PDF elements - text, images with mixed embedding models and metadata

· 7 min read
Linghua Jin
CocoIndex Maintainer

Index PDF elements - text, images with mixed encoders and citations with metadata

PDFs are rich with both text and visual content — from descriptive paragraphs to illustrations and tables. This example builds an end-to-end flow that parses, embeds, and indexes both, with full traceability to the original page.

In this example, we split out both text and images, link them back to page metadata, and enable unified semantic search. We’ll use CocoIndex to define the flow, SentenceTransformers for text embeddings, and CLIP for image embeddings — all stored in Qdrant for retrieval.

Bring your own data: Index any data with Custom Sources

· 7 min read
Linghua Jin
CocoIndex Maintainer

Bring your own data: Index any data with Custom Sources

We’re excited to announce Custom Sources — a new capability in CocoIndex that lets you read data from any system you want. Whether it’s APIs, databases, file systems, cloud storage, or other external services, CocoIndex can now ingest data incrementally, track changes efficiently, and integrate seamlessly into your flows.

After this change, users for CocoIndex are not bounded by any connectors, targets or some prebuilt libraries. You can use CocoIndex for anything, and enjoy the robust incremental computing to build fresh knowledge for AI.

Custom sources are the perfect complement to custom targets, giving you full control over both ends of your data pipelines.

🚀 Get started with custom sources by following the documentation now.

Fast iterate your indexing strategy - trace back from query to data

· 4 min read
Linghua Jin
CocoIndex Maintainer

cover

We are launching a major feature in both CocoIndex and CocoInsight to help users fast iterate with the indexing strategy, and trace back all the way to the data — to make the transformation experience more seamlessly integrated with the end goal.

We deeply care about making the overall experience seamless. With the new launch, you can define query handlers, so that you can easily run queries in tools like CocoInsight.

Incrementally Transform Structured + Unstructured Data from Postgres with AI

· 7 min read
Linghua Jin
CocoIndex Maintainer

PostgreSQL Product Indexing Flow

CocoIndex is one framework for building incremental data flows across structured and unstructured sources.

In CocoIndex, AI steps -- like generating embeddings -- are just transforms in the same flow as your other types of transformations, e.g. data mappings, calculations, etc.

Why One Framework for Structured + Unstructured?

  • One mental model: Treat files, APIs, and databases uniformly; AI steps are ordinary ops.
  • Incremental by default: Use an ordinal column to sync only changes; no fragile glue jobs.
  • Consistency: Embeddings are always derived from the exact transformed row state.
  • Operational simplicity: One deployment, one lineage view, fewer moving parts.

This blog introduces the new PostgreSQL source and shows how to take data from PostgreSQL table as source, transform with both AI models and non-AI calculations, and write them into a new PostgreSQL table for semantic + structured search.

Build a Visual Document Index from multiple formats all at once - PDFs, Images, Slides - with ColPali

· 5 min read
Linghua Jin
CocoIndex Maintainer

Colpali

Do you have a messy collection of scanned documents, PDFs, academic papers, presentation slides, and standalone images — all mixed together with charts, tables, and figures — that you want to process into the same vector space for semantic search or to power an AI agent?

In this example, we’ll walk through how to build a visual document indexing pipeline using ColPali for embedding both PDFs and images — and then query the index using natural language.
We’ll skip OCR entirely — ColPali can directly understand document layouts, tables, and figures from images, making it perfect for semantic search across visual-heavy content.

CocoIndex Changelog 2025-08-18

· 13 min read
Linghua Jin
CocoIndex Maintainer

CocoIndex Changelog 2025-08-15

We’ve shipped 20+ releases — packed with production-ready features, scalability upgrades, and runtime improvements. 🚀 Huge thanks to our amazing users for the feedback and for running CocoIndex at scale!

Index Images with ColPali: Multi-Modal Context Engineering

· 7 min read
Linghua Jin
CocoIndex Maintainer

Colpali

We’re excited to announce that CocoIndex now supports native integration with ColPali — enabling multi-vector, patch-level image indexing using cutting-edge multimodal models.

With just a few lines of code, you can now embed and index images with ColPali’s late-interaction architecture, fully integrated into CocoIndex’s composable flow system.

Multi-Dimensional Vector Support in CocoIndex

· 6 min read
Linghua Jin
CocoIndex Maintainer

Custom Targets

CocoIndex now provides robust and flexible support for typed vector data — from simple numeric arrays to deeply nested multi-dimensional vectors. This support is designed for seamless integration with high-performance vector databases such as Qdrant, and enables advanced indexing, embedding, and retrieval workflows across diverse data modalities.

Bring your own building blocks: Export anywhere with Custom Targets

· 8 min read
Linghua Jin
CocoIndex Maintainer

Custom Targets

We’re excited to announce that CocoIndex now officially supports custom targets — giving you the power to export data to any destination, whether it's a local file, cloud storage, a REST API, or your own bespoke system.

This new capability unlocks a whole new level of flexibility for integrating CocoIndex into your pipelines and allows you to bring your own "building blocks" into our flow model.

Indexing Faces for Scalable Visual Search - Build your own Google Photo Search

· 5 min read
Linghua Jin
CocoIndex Maintainer

Face Detection

CocoIndex supports multi-modal processing natively - it could process both text and image with the same programming model and observe in the same user flow (in CocoInsight).

In this blog, we’ll walk through a comprehensive example of building a scalable face recognition pipeline using CocoIndex. We’ll show how to extract and embed faces from images, structure the data relationally, and export everything into a vector database for real-time querying.

CocoInsight can now visualize identified sections of an image based on the bounding boxes and makes it easier to understand and evaluate AI extractions - seamlessly attaching computed features in the context of unstructured visual data.

Introducing CocoInsight

· 4 min read
Linghua Jin
CocoIndex Maintainer

CocoInsight From day zero, we envisioned CocoInsight as a fundamental companion to CocoIndex — not just a tool, but a philosophy: making data explainable, auditable, and actionable at every stage of the data pipeline with AI workloads. CocoInsight has been in private beta for a while, it is one of the most loved feature for our users building ETL with coco, with significant boost on developer velocity, and lowering the barrier to entry for data engineering.

We are officially launching CocoInsight today - it has zero pipeline data retention and connects to your on-premise CocoIndex server for pipeline insights. This makes data directly visible and easy to develop ETL pipelines.

Flow-based schema inference for Qdrant

· 7 min read
Linghua Jin
CocoIndex Maintainer

CocoIndex + Qdrant Automatic Schema Setup

CocoIndex supports Qdrant natively - the integration features a high performance Rust stack with incremental processing end to end for scale and data freshness. 🎉 We just rolled out our latest change that handles automatic target schema setup with Qdrant from CocoIndex indexing flow.

Build Real-Time Product Recommendation Engine with LLM and Graph Database

· 8 min read
Linghua Jin
CocoIndex Maintainer

Product Graph

In this blog, we will build a real-time product recommendation engine with LLM and graph database. In particular, we will use LLM to understand the category (taxonomy) of a product. In addition, we will use LLM to enumerate the complementary products - users are likely to buy together with the current product (pencil and notebook). We will use Graph to explore the relationships between products that can be further used for product recommendations or labeling.