Skip to main content

Introducing CocoInsight

Β· 4 min read

CocoInsight From day zero, we envisioned CocoInsight as a fundamental companion to CocoIndex β€” not just a tool, but a philosophy: making data explainable, auditable, and actionable at every stage of the data pipeline with AI workloads. CocoInsight has been in private beta for a while, it is one of the most loved feature for our users building ETL with coco, with significant boost on developer velocity, and lowering the barrier to entry for data engineering.

We are officially launching CocoInsight today - it has zero pipeline data retention and connects to your on-premise CocoIndex server for pipeline insights. This makes data directly visible and easy to develop ETL pipelines.

Getting Started​

Start using it by running:

cocoindex server -ci main.py

for your cocoindex projects.

Overview​

This is an example view for CocoInsight:

  • right panel is dataflow, and
  • left panel is step-by-step data preview. Each field is tied to an input or output of a step in the dataflow transformation.

CocoInsight Panels

Inspect lineage​

You could click on any field (either in data flow or data preview), or any transformation step in the dataflow to inspect lineage - to understand where the data comes from.

Inspect Lineage

The clicked element will be set to purple color, as the element being inspected.

  • Visibility:
    • Direct data/ops with transitive dependency (upstream or downstream) will stay in view.
    • Data/ops unrelated to the current selected element will be dimmed.
  • Color:
    • Direct upstream data dependency (exact fields) will be colored blue.
    • Direct downstream data output (exact fields) will be colored green.

Let's walk through some simple examples on how these AI pipelines work. You don't need to know how to write code, just need to make sense from spreadsheet 😊.

Codebase Indexing Example​

  1. Ingest files, which outputs file names and contents. Ingest files for codebase

  2. Take the filename and extract extension. Extract extension

  3. Take the content (source code) and extension (language, e.g., .py) to do split based on code boundaries with Tree-sitter.

    Split code

    You could further click on each chunk of a document to expand the details of the chunks.

    Code chunk details

Knowledge Graph Example​

In this example, we process a list of files and generate a knowledge graph with documents and entities as nodes, and relationships between document/entity and entity/entity.

Some key steps:

  1. Use LLM to summarize a document. Knowledge Graph LLM Summary

  2. Use LLM to extract entities and relationships between entities. Knowledge Graph LLM Relation Extraction

    Click on any relationship "rows" to drill into the child table.

    relationship

How it works​

At the core of CocoIndex, both data and data operations are first-class citizens.

Because of this pure dataflow foundation, CocoIndex offers full observability by default:

  • Before/after of the data are available at every transformation node.
  • Every output field can be traced back to the exact set of input fields and operations that created it.
  • Lineage is first-class β€” not as metadata bolted on afterward, but as a structural property of how data is defined and transformed in the system.

This lineage model is not just useful for debugging β€” it enables features like incremental processing, intelligent caching, and transformation-level explainability, all out of the box.

While CocoIndex is architecturally a dataflow engine, its user experience is deeply inspired by spreadsheets. Just like in a spreadsheet:

  • Values of cells are derived from others through clearly visible formulas or expressions.
  • You can visually inspect how data looks before and after each transformation, cell by cell.
  • There’s no implicit global state, and every value can be explained in terms of its formula and input values.
  • Once value of a source cell changes, we automatically update derived cell values based on formulas with minimum reprocessing.

This spreadsheet-inspired paradigm is more than a UI choice β€” it’s a cognitive model. It bridges the gap between low-code users and developers, allowing anyone familiar with spreadsheets to reason about data transformations intuitively.

We have lots of features planned for CocoInsight 😎, including query debugging, stats, and more. Stay tuned and join our Discord for any questions.