Blog | CocoIndex

Turn Podcasts into a Knowledge Graph with LLM and CocoIndex

Build a pipeline that converts YouTube podcasts into a structured knowledge graph — extracting speakers, statements, and entities with LLM, then resolving duplicates with embeddings.

Examples

March 27, 2026

From Pickle to Type-Guided Deserialization: How We Made Python Serialization Safe and Automatic

How CocoIndex evolved from pickle to a type-guided serialization system that uses Python type hints to automatically choose the right serializer — no decorators or registration needed.

Insight

March 24, 2026

Building an Invisible Daemon: Architecture Patterns for Local Developer Tools

Patterns for building local daemons that start on first use, upgrade transparently, and shut down cleanly — learned from building cocoindex-code's semantic search daemon.

Best Practices

March 10, 2026

CocoIndex Changelog 0.3.27 - 0.3.34

Featuring five new target connectors, filesystem-level change detection, Python 3.14 free-threading, and smarter pipeline lifecycle management.

Changelog

February 17, 2026

A Leap Forward in Security: CocoIndex Joins GitHub Secure Open Source Fund

CocoIndex participated in the GitHub Secure Open Source Fund, strengthening our security practices for the AI data infrastructure that developers depend on. Here's what we learned and what changed.

Announcementsecurityopen-sourcegithub

February 9, 2026

Building SEC EDGAR Financial Analytics with CocoIndex and Apache Doris

A multi-source pipeline that ingests SEC filings (TXT, JSON, PDF), scrubs PII, extracts topics, and powers hybrid search with CocoIndex + Apache Doris.

Examplesdorishybrid-searchmulti-source

February 5, 2026

Build a Self-Updating Wiki for Your Codebases with LLM

Automatically generates a wiki page for each project in your codebase, and keeps it fresh with incremental processing.

Examples

January 22, 2026

Slides-to-Speech: Turn your presentations into narrated content with CocoIndex and LanceDB

Turn slide decks into a continuously updated, searchable multimodal dataset. CocoIndex watches a Drive folder for new and modified files, extracts structured speaker notes, synthesizes narration, and keeps LanceDB in sync.

Examples

January 18, 2026

CocoIndex Changelog 0.3.11 - 0.3.26

Featuring production-ready resilience, structured error system, expanded integrations, and always-fresh structured context for agents operating in the real world.

Changelog

December 15, 2025

Extracting Structured Data from Patient Intake Forms with DSPy and CocoIndex

Continuously extract clean, typed, Pydantic-validated structured data directly from patient intake forms, using DSPy and CocoIndex. This tutorial demonstrates building scalable, production-grade AI pipelines with typed Pydantic validation, OCR vision models, and fast, incremental data processing.

Examples

December 8, 2025

Building a Knowledge Graph from Meeting Notes that automatically updates

Most companies sit on an ocean of meeting notes - inside those documents are decisions, tasks, owners, and relationships — an untapped knowledge graph that is constantly changing. A full walkthrough to turn meeting notes in Google Drive into a live-updating Neo4j knowledge graph using CocoIndex and LLM.

Examplescustom-source

December 2, 2025

Building a Real-Time HackerNews Trending Topics Detector with CocoIndex: A Deep Dive into Custom Sources and AI

Examplescustom-source

November 25, 2025

CocoIndex Changelog 0.2.21 - 0.3.10

Featuring batching support for CocoIndex functions, execution robustness, schema & type system improvements, custom source support, and more.

Changelog

November 25, 2025

Extract structured information from HackerNews with a Custom Source and keep it in sync with Postgres

Build a lightweight, incremental pipeline by treating any API as a data component - custom incremental connector for HackerNews using CocoIndex’s Custom Source API. Export the data to Postgres for semantic search and analytics.

Examplescustom-sourceFeature

November 21, 2025

Extracting Intake Forms with BAML and CocoIndex

How to use BAML and CocoIndex to extract structured data from patient intake forms in PDF/Word with LLM continuous for production.

Examples

November 10, 2025

Adaptive Batching - 5x throughput on your data pipelines

Discover how CocoIndex delivers automatic batch processing for GPU workloads and machine learning pipelines. Framework-level batching optimizes performance for text embeddings and other AI operations without configuration.

Feature

October 29, 2025

AI-Native Data Pipeline - Why We Made It

Why the next wave of AI needs open source, scalable, and AI-native data infrastructure, and how CocoIndex is building the foundation for the future of intelligent data pipelines.

Insight

October 27, 2025

Index PDF elements - text, images with mixed embedding models and metadata

Extracts, embeds, and stores multimodal PDF elements — text with SentenceTransformers and images with CLIP — in vector database for unified semantic search. Includes full metadata traceability, thumbnail generation, and flow definitions for building multimodal retrieval systems.

ExamplesFeature

October 21, 2025

Bring your own data: Index any data with Custom Sources

CocoIndex now officially supports custom sources — giving you the power to read data from any system you want. You can use CocoIndex for anything, and enjoy the robust incremental computing to build fresh knowledge for AI.

Feature

October 19, 2025

CocoIndex Changelog 2025-10-19

Featuring significant optimizations for production ready infrastructure: durable execution, efficient incremental processing optimizations over large datasets, GPU isolation, etc. and better support over native building blocks.

Changelog

October 11, 2025

Automated invoice processing with AI, Snowflake and CocoIndex - with incremental processing

Discover how to rapidly optimize your data indexing strategy with CocoIndex and CocoInsight. Learn to define query handlers, trace search results back to source data, and seamlessly integrate AI-powered transformations for efficient, end-to-end retrieval and analytics.

Examples

October 10, 2025

Thinking in Rust: Ownership, Access, and Memory Safety

A mental framework for Rust's memory safety concepts. Think systematically about ownership, references, Send, Sync, and Rc, Arc, RefCell, Mutex, etc.

RustInsight

September 21, 2025

Fast iterate your indexing strategy - trace back from query to data

Examples

September 1, 2025

Incrementally Transform Structured + Unstructured Data from Postgres with AI

Comprehensive walkthrough on using CocoIndex to build unified, incrementally updated search and analytics pipelines across both structured and unstructured data sources, using PostgreSQL tables as the main origin and target for indexed data.

Examples

August 20, 2025

Build a Visual Document Index from multiple formats all at once - PDFs, Images, Slides - with ColPali

Build a unified visual document index from multiple file formats — including PDFs, images, and slides — using CocoIndex and ColPali, No OCR needed.

Examples

August 18, 2025

CocoIndex Changelog 2025-08-18

Featuring production readiness, scalability, and reliability. More flexibility with customization and native integrations. Extended features for multi-modalities pipelines and more.

Changelog

August 13, 2025

Control Processing Concurrency in CocoIndex

Learn how CocoIndex's layered concurrency control features help you optimize data processing performance, prevent system overload, and ensure stable, efficient pipelines at scale.

Feature

August 12, 2025

Index Images with ColPali: Multi-Modal Context Engineering

CocoIndex now supports native integration with ColPali — enabling multi-vector, patch-level image indexing.

ExamplesFeature

August 10, 2025

Multi-Dimensional Vector Support in CocoIndex

CocoIndex natively handles typed multi-dimensional vectors — from simple arrays to multi-vector embeddings, unlocks multimodal AI pipelines at scale.

Feature

August 3, 2025

Bring your own building blocks: Export anywhere with Custom Targets

CocoIndex now officially supports custom targets — giving you the power to export data to any destination, whether it's a local file, cloud storage, a REST API, or your own bespoke system.

ExamplesFeature

July 24, 2025

Indexing Faces for Scalable Visual Search - Build your own Google Photo Search

Build a scalable face detection and recognition pipeline using CocoIndex. This tutorial covers extracting and embedding faces from images, structuring data for visual search, and exporting to a vector database for face similarity queries.

Examples

July 9, 2025

Index academic papers and extract metadata for AI agents

How to index academic research papers by extracting metadata (e.g., title, authors, abstract) for AI agents and AI workflows using LLMs and CocoIndex

Examples

July 7, 2025

CocoIndex Changelog 2025-07-07

CocoIndex updates: in-process API, CLI improvements, EmbedText support, codebase indexing enhancements, and more.

Changelog

June 24, 2025

Introducing CocoInsight

CocoInsight is a platform for data lineage and data observability.

Feature

June 8, 2025

Flow-based schema inference for Qdrant

Automatic schema inference for Qdrant.

Feature

June 3, 2025

CocoIndex + Kuzu: Real-time knowledge graph with Kuzu

Integrate Kuzu with CocoIndex.

ExamplesFeature

May 31, 2025

CocoIndex Changelog 2025-05-31

CocoIndex updates: Amazon S3 as a data source, updates on query handling, standalone, and more.

Changelog

May 29, 2025

Real-time data transformation pipeline with Amazon S3 bucket, SQS and CocoIndex

Build real-time data transformation pipeline with S3 and CocoIndex.

Examples

May 20, 2025

Build image search and query with natural language with vision model CLIP

Indexing images with CocoIndex and Vision Model in real-time: multi-modal embedding, and build vector index for efficient retrieval.

Examples

May 19, 2025

How to build index with text embeddings

Indexing text with CocoIndex and text embeddings, and query it with natural language.

Examples

May 8, 2025

Story of CocoIndex, at 1k stars 🎉

CocoIndex is the world's first open-source engine that supports both custom transformation logic and incremental processing specialized for data indexing. We just crossed 1k stars, thank you so much!

AnnouncementChangelog

May 7, 2025

Build Real-Time Product Recommendation Engine with LLM and Graph Database

Build a real-time product recommendation engine with LLM and graph database, from the aspect of product category (taxonomy) understanding.

Examples

April 30, 2025

CocoIndex Changelog 2025-04-30

CocoIndex updates: Knowledge Graphs, Qdrant, Supabase, KTable/LTable, and more LLM providers.

Changelog

April 29, 2025

Build Real-Time Knowledge Graph For Documents with LLM

CocoIndex now supports knowledge graph with incremental processing. Build live knowledge for agents is super easy with CocoIndex!

Examples

April 7, 2025

CocoIndex Changelog 2025-04-07

CocoIndex updates: Incremental processing with live update mode, evaluation utilities, support for date/time types, Google Drive, and assorted core/performance improvements

Changelog

April 7, 2025

Continuous update derived data on source updates, automatically

CocoIndex continuously watches source changes and keeps derived data in sync, with low latency and minimal performance overhead.

Feature

April 6, 2025

Incremental Processing with CocoIndex

CocoIndex helps to keep index up to date with source changes, super efficient and low latency - with the support of incremental processing.

Feature

March 26, 2025

Structured Extraction from Patient Intake Form with LLM

Extract structured data from patient intake forms in PDF/Word with LLM by CocoIndex.

Examples

March 23, 2025

Build text embeddings from Google Drive for RAG

Tutorial to create text embeddings from docs on Google Drive, save in vector stores for semantics search / RAG, using CocoIndex.

Examples

March 20, 2025

CocoIndex Changelog 2025-03-20

First release of CocoIndex Changelog: LLM support, codebase indexing, custom functions, and assorted core/performance improvements

Changelog

March 18, 2025

Build Real-Time Codebase Indexing for AI Code Generation

Indexing codebase for RAG with CocoIndex and Tree-sitter in real-time: chunking, embedding, semantic search, and build vector index for efficient retrieval.

Examples

March 17, 2025

On-premise structured extraction with LLM using Ollama

Learn to use CocoIndex extracting structured data from PDF/Markdown with Ollama's local LLM models. All running on premise without sending data to external APIs.

Examples

March 3, 2025

We are officially open sourced! 🎉

CocoIndex is the world's first open-source engine that supports both custom transformation logic and incremental processing specialized for data indexing. We are now officially open sourced!

AnnouncementChangelog

February 20, 2025

Customizable Data Indexing Pipelines

Explain what customizable data indexing pipelines are through comparisons and examples.

data-indexingInsight

January 30, 2025

What Makes Indexing Pipelines Different?

Understanding the unique characteristics of indexing pipelines compared to other data processing systems. Learn why indexing requires special handling for incremental processing and persistence.

Insight

January 20, 2025

Handling System Updates and CocoIndex Automatic Schema Inference

Explore how CocoIndex handles system updates in indexing flows and our approach to automatic schema inference. Learn about the challenges of managing data and logic evolution, infrastructure setup, and how CocoIndex simplifies these processes through smart automation.

data-indexingBest PracticesFeature

January 10, 2025

Processing Large Files in Data Indexing Systems

Learn best practices for handling large files in data indexing systems. Understand processing granularity, fan-in/fan-out scenarios, and strategies for efficient processing of large datasets like patent XML files. Discover how CocoIndex helps manage memory pressure and ensures reliable processing.

data-indexingprocessing-granularityBest Practices

January 6, 2025

Data Consistency in Indexing Pipelines

Explore the challenges and solutions for maintaining data consistency in indexing pipelines. Learn about concurrent updates, data exposure risks, and best practices for ensuring reliable, up-to-date indexes using CocoIndex's data-driven approach.

data-indexingBest PracticesInsight

January 5, 2025

Data Indexing and Common Challenges

Explore the fundamentals of data indexing pipelines for RAG applications. Learn about key characteristics of effective indexing systems, common challenges in production, and how CocoIndex solves them. Discover best practices for building maintainable, cost-effective, and scalable data indexing infrastructure.

data-indexingBest PracticesInsight

January 4, 2025

CocoIndex - A Data Indexing Platform for AI Applications

Discover how CocoIndex simplifies data indexing for AI applications. Learn about our comprehensive platform that handles data ingestion, processing, and management for RAG, semantic search, and other AI use cases.

data-indexingairagsemantic-search

January 2, 2025

Welcome to CocoIndex

Welcome to the official CocoIndex blog! We're excited to share our journey in building high-performance indexing infrastructure for AI applications.

Announcement

Popular Topics

Turn Podcasts into a Knowledge Graph with LLM and CocoIndex

From Pickle to Type-Guided Deserialization: How We Made Python Serialization Safe and Automatic

Building an Invisible Daemon: Architecture Patterns for Local Developer Tools

CocoIndex Changelog 0.3.27 - 0.3.34

A Leap Forward in Security: CocoIndex Joins GitHub Secure Open Source Fund

Building SEC EDGAR Financial Analytics with CocoIndex and Apache Doris

Build a Self-Updating Wiki for Your Codebases with LLM

Slides-to-Speech: Turn your presentations into narrated content with CocoIndex and LanceDB

CocoIndex Changelog 0.3.11 - 0.3.26

Extracting Structured Data from Patient Intake Forms with DSPy and CocoIndex

Building a Knowledge Graph from Meeting Notes that automatically updates

Building a Real-Time HackerNews Trending Topics Detector with CocoIndex: A Deep Dive into Custom Sources and AI

CocoIndex Changelog 0.2.21 - 0.3.10

Extract structured information from HackerNews with a Custom Source and keep it in sync with Postgres

Extracting Intake Forms with BAML and CocoIndex

Adaptive Batching - 5x throughput on your data pipelines

AI-Native Data Pipeline - Why We Made It

Index PDF elements - text, images with mixed embedding models and metadata

Bring your own data: Index any data with Custom Sources

CocoIndex Changelog 2025-10-19

Automated invoice processing with AI, Snowflake and CocoIndex - with incremental processing

Thinking in Rust: Ownership, Access, and Memory Safety

Fast iterate your indexing strategy - trace back from query to data

Incrementally Transform Structured + Unstructured Data from Postgres with AI

Build a Visual Document Index from multiple formats all at once - PDFs, Images, Slides - with ColPali

CocoIndex Changelog 2025-08-18

Control Processing Concurrency in CocoIndex

Index Images with ColPali: Multi-Modal Context Engineering

Multi-Dimensional Vector Support in CocoIndex

Bring your own building blocks: Export anywhere with Custom Targets

Indexing Faces for Scalable Visual Search - Build your own Google Photo Search

Index academic papers and extract metadata for AI agents

CocoIndex Changelog 2025-07-07

Introducing CocoInsight

Flow-based schema inference for Qdrant

CocoIndex + Kuzu: Real-time knowledge graph with Kuzu

CocoIndex Changelog 2025-05-31

Real-time data transformation pipeline with Amazon S3 bucket, SQS and CocoIndex

Build image search and query with natural language with vision model CLIP

How to build index with text embeddings

Story of CocoIndex, at 1k stars 🎉

Build Real-Time Product Recommendation Engine with LLM and Graph Database

CocoIndex Changelog 2025-04-30

Build Real-Time Knowledge Graph For Documents with LLM

CocoIndex Changelog 2025-04-07

Continuous update derived data on source updates, automatically

Incremental Processing with CocoIndex

Structured Extraction from Patient Intake Form with LLM

Build text embeddings from Google Drive for RAG

CocoIndex Changelog 2025-03-20

Build Real-Time Codebase Indexing for AI Code Generation

On-premise structured extraction with LLM using Ollama

We are officially open sourced! 🎉

Customizable Data Indexing Pipelines

What Makes Indexing Pipelines Different?

Handling System Updates and CocoIndex Automatic Schema Inference

Processing Large Files in Data Indexing Systems

Data Consistency in Indexing Pipelines

Data Indexing and Common Challenges

CocoIndex - A Data Indexing Platform for AI Applications

Welcome to CocoIndex