Tutorials, deep dives, and notes from the CocoIndex team. Incremental data infrastructure, Rust internals, knowledge graphs, and stories from production.
Build a pipeline that converts YouTube podcasts into a structured knowledge graph — extracting speakers, statements, and entities with LLM, then resolving duplicates with embeddings.
How CocoIndex evolved from pickle to a type-guided serialization system that uses Python type hints to automatically choose the right serializer — no decorators or registration needed.
Patterns for building local daemons that start on first use, upgrade transparently, and shut down cleanly — learned from building cocoindex-code's semantic search daemon.
Featuring five new target connectors, filesystem-level change detection, Python 3.14 free-threading, and smarter pipeline lifecycle management.
CocoIndex joined the GitHub Secure Open Source Fund — strengthening security for the AI data infrastructure developers depend on.
A multi-source pipeline that ingests SEC filings (TXT, JSON, PDF), scrubs PII, extracts topics, and powers hybrid search with CocoIndex + Apache Doris.
Automatically generates a wiki page for each project in your codebase, and keeps it fresh with incremental processing.
Turn slide decks into a continuously updated multimodal dataset — extract speaker notes, synthesize narration, keep LanceDB in sync.
Featuring production-ready resilience, structured error system, expanded integrations, and always-fresh structured context for agents operating in the real world.
Extract Pydantic-typed structured data from patient intake forms using DSPy and CocoIndex — OCR vision models with incremental processing.
"Most companies sit on an ocean of meeting notes - inside those documents are decisions, tasks, owners, and relationships — an untapped knowledge graph that is constantly changing.
Build a real-time HackerNews trending topics detector with CocoIndex — a deep dive into Custom Sources and AI-powered topic extraction.
Featuring batching support for CocoIndex functions, execution robustness, schema & type system improvements, custom source support, and more.
Build a custom incremental HackerNews connector with CocoIndex's Custom Source API and export to Postgres for semantic search and analytics.
How to use BAML and CocoIndex to extract structured data from patient intake forms in PDF/Word with LLM continuous for production.
CocoIndex now batches GPU and ML workloads automatically — 5x throughput on text embeddings and AI ops, with zero configuration required.
Why the next wave of AI needs open source, scalable, and AI-native data infrastructure, and how CocoIndex is building the foundation for the future of intelligent data pipelines.
Extract, embed, and store multimodal PDF elements — text with SentenceTransformers, images with CLIP — for unified semantic search with traceable metadata.
CocoIndex now supports custom sources — read data from any system and keep it incrementally fresh as knowledge for AI agents.
Production-ready upgrades: durable execution, faster incremental processing over large datasets, GPU isolation, and richer native building blocks.
Build an incremental AI pipeline that extracts invoice fields from PDFs in Azure Blob Storage and loads them into Snowflake — with CocoIndex, OpenAI GPT-4o, and a ~50-line custom Snowflake target. Open-source alternative to Snowflake Openflow and Cortex Document AI for unstructured ETL.
A mental framework for Rust's memory safety concepts. Think systematically about ownership, references, Send, Sync, and Rc, Arc, RefCell, Mutex, etc.
Define query handlers in CocoIndex and trace search results back to source data in CocoInsight — close the loop on indexing strategy.
Build unified, incrementally updated search and analytics over structured + unstructured data in PostgreSQL with CocoIndex.
Build a unified visual document index from multiple file formats — including PDFs, images, and slides — using CocoIndex and ColPali, No OCR needed.
Featuring production readiness, scalability, and reliability. More flexibility with customization and native integrations. Extended features for multi-modalities pipelines and more.
Learn how CocoIndex's layered concurrency control features help you optimize data processing performance, prevent system overload, and ensure stable, efficient pipelines at scale.
CocoIndex now supports native integration with ColPali — enabling multi-vector, patch-level image indexing.
CocoIndex natively handles typed multi-dimensional vectors — from simple arrays to multi-vector embeddings, unlocks multimodal AI pipelines at scale.
CocoIndex now officially supports custom targets — giving you the power to export data to any destination, whether it's a local file, cloud storage, a REST API, or your own bespoke system.
Build a scalable face detection and recognition pipeline with CocoIndex — embed faces, structure for search, and export to a vector DB.
How to index academic research papers by extracting metadata (e.g., title, authors, abstract) for AI agents and AI workflows using LLMs and CocoIndex
CocoIndex updates: in-process API, CLI improvements, EmbedText support, codebase indexing enhancements, and more.
CocoInsight is a platform for data lineage and data observability.
CocoIndex now sets up Qdrant collections automatically by inferring the target schema from your indexing flow — no manual config.
Build a real-time knowledge graph with Kuzu as a native CocoIndex target — incremental updates, high-performance graph queries.
CocoIndex updates: Amazon S3 as a data source, updates on query handling, standalone, and more.
Build a real-time data transformation pipeline with Amazon S3 and SQS using CocoIndex — incremental indexing on object storage.
Indexing images with CocoIndex and Vision Model in real-time: multi-modal embedding, and build vector index for efficient retrieval.
Indexing text with CocoIndex and text embeddings, and query it with natural language.
CocoIndex is the world's first open-source engine that supports both custom transformation logic and incremental processing specialized for data indexing. We just crossed 1k stars, thank you so much!
Build a real-time product recommendation engine with LLM and graph database, from the aspect of product category (taxonomy) understanding.
CocoIndex updates: Knowledge Graphs, Qdrant, Supabase, KTable/LTable, and more LLM providers.
CocoIndex now supports knowledge graph with incremental processing. Build live knowledge for agents is super easy with CocoIndex!
CocoIndex updates: Incremental processing with live update mode, evaluation utilities, support for date/time types, Google Drive, and assorted core/performance improvements
CocoIndex continuously watches source changes and keeps derived data in sync, with low latency and minimal performance overhead.
CocoIndex helps to keep index up to date with source changes, super efficient and low latency - with the support of incremental processing.
Extract structured data from patient intake forms in PDF/Word with LLM by CocoIndex.
Tutorial to create text embeddings from docs on Google Drive, save in vector stores for semantics search / RAG, using CocoIndex.
First release of CocoIndex Changelog: LLM support, codebase indexing, custom functions, and assorted core/performance improvements
Indexing codebase for RAG with CocoIndex and Tree-sitter in real-time: chunking, embedding, semantic search, and build vector index for efficient retrieval.
Learn to use CocoIndex extracting structured data from PDF/Markdown with Ollama's local LLM models. All running on premise without sending data to external APIs.
CocoIndex is the world's first open-source engine that supports both custom transformation logic and incremental processing specialized for data indexing. We are now officially open sourced!
Explain what customizable data indexing pipelines are through comparisons and examples.
What makes indexing pipelines different from other data systems — and why they need special handling for incremental processing and persistence.
How CocoIndex handles system updates in indexing flows: automatic schema inference and managing data + logic evolution without downtime.
Handle large files in data indexing: processing granularity, fan-in/fan-out, and memory pressure — walked through a patent XML example in CocoIndex.
Data consistency in indexing pipelines: concurrent updates, exposure risks, and how CocoIndex's data-driven approach keeps indexes converging.
Fundamentals of data indexing pipelines for RAG: what makes a good one, common production pitfalls, and how CocoIndex addresses them.
CocoIndex is a data indexing platform for AI applications — ingestion, processing, and management for RAG and semantic search.
Welcome to the official CocoIndex blog! We're excited to share our journey in building high-performance indexing infrastructure for AI applications.