Multi-Codebase Summarization

Last reviewed: January 29, 2026

Your code is the source of truth. In this tutorial, we'll build a pipeline that automatically generates a one-pager wiki for each project in a list, that never goes out-of-date with incremental processing. Think about building your own deep wiki that is always fresh.

For example, for each cocoindex example project, we can have an auto-one-pager like this:

markdown

Overview

This example uses structured LLM outputs to analyze code and generate documentation at scale with LLMs.

Scan top-level subdirectories, treating each as a separate project
Extract structured information from each file using an LLM (classes, functions, relationships)
Aggregate file-level data into project-level summaries
Generate Markdown documentation with Mermaid diagrams

You declare the transformation logic with native Python without worrying about changes.

Think: target_state = transformation(source_state)

When your source data is updated, or your processing logic is changed (for example, switching to a different model, updating your LLM extraction logic), CocoIndex performs smart incremental processing that only reprocesses the minimum. And it keeps your wikis always up to date in production.

Setup

Install CocoIndex and dependencies:

pip install --pre 'cocoindex>=1.0.0a6' instructor litellm pydantic

Create a new directory for your project:

mkdir multi-codebase-summarization
cd multi-codebase-summarization

Set up your LLM environment variables:

export GEMINI_API_KEY="your-api-key"
export LLM_MODEL="gemini/gemini-2.5-flash"  # Or any LiteLLM-supported model

Create a .env file to configure the database path:
```
echo "COCOINDEX_DB=./cocoindex.db" > .env
```

Create a projects/ directory with subdirectories for each Python project:

mkdir projects

projects/
├── my_project_1/
│   ├── main.py
│   └── utils.py
├── my_project_2/
│   └── app.py
└── ...

Define the app

Define a CocoIndex App — the top-level runnable unit in CocoIndex.

App Definition

main.py
from __future__ import annotations

import os
import asyncio
import pathlib
from typing import Collection

import instructor
from litellm import acompletion
from pydantic import BaseModel, Field

import cocoindex as coco
from cocoindex.connectors import localfs
from cocoindex.resources.file import FileLike, PatternFilePathMatcher

LLM_MODEL = os.environ.get("LLM_MODEL", "gemini/gemini-2.5-flash")


app = coco.App(
    "MultiCodebaseSummarization",
    app_main,
    root_dir=pathlib.Path("./projects"),
    output_dir=pathlib.Path("./output"),
)

The app scans projects/ and outputs documentation to output/

App

Define the main function

App Definition

In the main function, we walk through each project in the subdirectories and process it.

It is up to you to declare the process granularity. It can be

at a directory level per project. For example, code_embedding is a project, each containing multiple files,
or at file level,
or at even smaller units (e.g., page level, or semantic unit level).

In this example, we have a projects folder containing 20+ projects. It is natural to pick granularity at the directory level for each project, because we want to create a wiki page per project.

main.py
@coco.function
def app_main(
    root_dir: pathlib.Path,
    output_dir: pathlib.Path,
) -> None:
    """Scan subdirectories and generate documentation for each project."""
    for entry in root_dir.resolve().iterdir():
        if not entry.is_dir() or entry.name.startswith("."):
            continue
        project_name = entry.name

        files = list(
            localfs.walk_dir(
                entry,
                recursive=True,
                path_matcher=PatternFilePathMatcher(
                    included_patterns=["*.py"],
                    excluded_patterns=[".*", "__pycache__"],
                ),
            )
        )

        if files:
            coco.mount(
                coco.component_subpath("project", project_name),
                process_project,
                project_name,
                files,
                output_dir,
            )

The main function does two things:

Find all projects — Loop through each subdirectory in root_dir, treating each as a separate project.
Mount a processing component for each project — For each project with Python files, coco.mount() sets up a processing component. CocoIndex handles the execution and tracks dependencies automatically.

To learn more about processing components, you can read the documentation:

Processing Component

Process each project

For each project, we will

use LLM to extract info
aggregate all the extraction into a project-level summary
output the extraction to a nice documentation with a Mermaid diagram.

Process Project

main.py
@coco.function(memo=True)
async def process_project(
    project_name: str,
    files: Collection[localfs.File],
    output_dir: pathlib.Path,
) -> None:
    """Process a project: extract, aggregate, and output markdown."""
    # Extract info from each file concurrently using asyncio.gather
    file_infos = await asyncio.gather(*[extract_file_info(f) for f in files])

    # Aggregate into project-level summary
    project_info = await aggregate_project_info(project_name, file_infos)

    # Generate and output markdown
    markdown = generate_markdown(project_name, project_info, file_infos)
    localfs.declare_file(
        output_dir / f"{project_name}.md", markdown, create_parent_dirs=True
    )

Concurrent processing with async — By using asyncio.gather(), all file extractions run concurrently. This is significantly faster than sequential processing, especially when making LLM API calls.

Function

Extract file information with LLM

Now let's take a look at the details for each transformation. For file extraction, we define a structure using Pydantic and use Instructor to extract with LLMs.

Extract files

Define the data models

The key to structured LLM outputs is defining clear Pydantic models. Define Models

models.py
class FunctionInfo(BaseModel):
    """Information about a public function."""
    name: str = Field(description="Function name")
    signature: str = Field(
        description="Function signature, e.g. 'async def foo(x: int) -> str'"
    )
    is_coco_function: bool = Field(
        description="Whether decorated with @coco.function"
    )
    summary: str = Field(description="Brief summary of what the function does")


class ClassInfo(BaseModel):
    """Information about a public class."""
    name: str = Field(description="Class name")
    summary: str = Field(description="Brief summary of what the class represents")


class CodebaseInfo(BaseModel):
    """Extracted information from Python code."""
    name: str = Field(description="File path or project name")
    summary: str = Field(description="Brief summary of purpose and functionality")
    public_classes: list[ClassInfo] = Field(default_factory=list)
    public_functions: list[FunctionInfo] = Field(default_factory=list)
    mermaid_graphs: list[str] = Field(
        default_factory=list,
        description="Mermaid graphs showing function relationships"
    )

Extract file info

The core extraction function uses memoization to cache LLM results:

main.py
_instructor_client = instructor.from_litellm(acompletion, mode=instructor.Mode.JSON)

@coco.function(memo=True)
async def extract_file_info(file: FileLike) -> CodebaseInfo:
    """Extract structured information from a single Python file using LLM."""
    content = file.read_text()
    file_path = str(file.file_path.path)

    prompt = f"""Analyze the following Python file and extract structured information.

File path: {file_path}

```python
{content}
```

Instructions:
1. Identify all PUBLIC classes (not starting with _) and summarize their purpose
2. Identify all PUBLIC functions (not starting with _) and summarize their purpose
3. If this file contains CocoIndex apps (coco.App), create Mermaid graphs showing the
   function call relationships (see the mermaid_graphs field description for format)
4. Provide a brief summary of the file's purpose
"""

    result = await _instructor_client.chat.completions.create(
        model=LLM_MODEL,
        response_model=CodebaseInfo,
        messages=[{"role": "user", "content": prompt}],
    )
    return CodebaseInfo.model_validate(result.model_dump())

Why memo=True matters: LLM calls are expensive. With memoization, CocoIndex caches the result keyed by the file content. If you run the pipeline again without changing a file, the cached result is used—no LLM call needed.

Function

Aggregate project information

For projects with multiple files, we aggregate into a unified summary:

Aggregate files

main.py
@coco.function
async def aggregate_project_info(
    project_name: str,
    file_infos: list[CodebaseInfo],
) -> CodebaseInfo:
    """Aggregate multiple file extractions into a project-level summary."""
    if not file_infos:
        return CodebaseInfo(
            name=project_name, summary="Empty project with no Python files."
        )

    # Single file - just update the name
    if len(file_infos) == 1:
        info = file_infos[0]
        return CodebaseInfo(
            name=project_name,
            summary=info.summary,
            public_classes=info.public_classes,
            public_functions=info.public_functions,
            mermaid_graphs=info.mermaid_graphs,
        )

    # Multiple files - use LLM to create unified summary
    files_text = "\n\n".join(
        f"### {info.name}\n"
        f"Summary: {info.summary}\n"
        f"Classes: {', '.join(c.name for c in info.public_classes) or 'None'}\n"
        f"Functions: {', '.join(f.name for f in info.public_functions) or 'None'}"
        for info in file_infos
    )

    # Collect all mermaid graphs from files
    all_graphs = [g for info in file_infos for g in info.mermaid_graphs]

    prompt = f"""Aggregate the following Python files into a project-level summary.

Project name: {project_name}

Files:
{files_text}

Create a unified CodebaseInfo that:
1. Summarizes the overall project purpose (not individual files)
2. Lists the most important public classes across all files
3. Lists the most important public functions across all files
4. For mermaid_graphs: create a single unified graph showing how the CocoIndex
   components connect across the project (if applicable)
"""

    result = await _instructor_client.chat.completions.create(
        model=LLM_MODEL,
        response_model=CodebaseInfo,
        messages=[{"role": "user", "content": prompt}],
    )
    result = CodebaseInfo.model_validate(result.model_dump())

    # Keep original file-level graphs if LLM didn't generate a unified one
    if not result.mermaid_graphs and all_graphs:
        result.mermaid_graphs = all_graphs

    return result

This function combines file-level extractions into a single project summary:

Single file project — Just use that file's info directly (no extra LLM call needed)
Multi-file project — Ask the LLM to synthesize all file summaries into one cohesive project overview

The result is a unified CodebaseInfo that represents the entire project, not individual files.

Function

Generate markdown output

Create output markdown for each project. Create Markdown

main.py
@coco.function
def generate_markdown(
    project_name: str, info: CodebaseInfo, file_infos: list[CodebaseInfo]
) -> str:
    """Generate markdown documentation from project info."""
    lines = [
        f"# {project_name}",
        "",
        "## Overview",
        "",
        info.summary,
        "",
    ]

    if info.public_classes or info.public_functions:
        lines.extend(["## Components", ""])

        if info.public_classes:
            lines.append("**Classes:**")
            for cls in info.public_classes:
                lines.append(f"- `{cls.name}`: {cls.summary}")
            lines.append("")

        if info.public_functions:
            lines.append("**Functions:**")
            for fn in info.public_functions:
                marker = " ★" if fn.is_coco_function else ""
                lines.append(f"- `{fn.signature}`{marker}: {fn.summary}")
            lines.append("")

    if info.mermaid_graphs:
        lines.extend(["## CocoIndex Pipeline", ""])
        for graph in info.mermaid_graphs:
            graph_content = graph.strip()
            if not graph_content.startswith("```"):
                lines.append("```mermaid")
                lines.append(graph_content)
                lines.append("```")
            else:
                lines.append(graph_content)
            lines.append("")

    if len(file_infos) > 1:
        lines.extend(["## File Details", ""])
        for fi in file_infos:
            lines.extend([f"### {fi.name}", "", fi.summary, ""])

    lines.extend(["---", "", "*★ = CocoIndex function*"])
    return "\n".join(lines)

This function converts the structured CodebaseInfo into readable documentation:

Overview — Project summary at the top
Components — Lists classes and functions with descriptions (★ marks CocoIndex functions)
Pipeline diagram — Mermaid graphs showing how functions connect
File details — For multi-file projects, includes per-file summaries

Run the pipeline

cocoindex update main.py

CocoIndex will:

Scan each subdirectory in projects/
Extract structured information from Python files using the LLM
Aggregate file summaries into project summaries
Generate Markdown files in output/

Check the output:

ls output/
# project1.md project2.md ...

Incremental updates

The real power shows when you make changes:

Modify a file:

Edit a Python file in one of your projects, then run:

cocoindex update main.py

Only the modified file is re-analyzed by the LLM. Unchanged files use cached results.

Add a new project:

Add a new subdirectory with Python files:

mkdir projects/new_project
# add .py files
cocoindex update main.py

Only the new project is processed.

Key patterns demonstrated

This example showcases several powerful patterns:

Structured LLM outputs with Instructor + Pydantic models
Memoized LLM calls to avoid redundant API costs
Async concurrent processing with asyncio.gather()
Hierarchical aggregation (file → project)
Incremental processing for efficient updates

Overview​

Setup​

Define the app​

Define the main function​

Process each project​

Extract file information with LLM​

Define the data models​

Extract file info​

Aggregate project information​

Generate markdown output​

Run the pipeline​

Incremental updates​

Key patterns demonstrated​

Overview

Setup

Define the app

Define the main function

Process each project

Extract file information with LLM

Define the data models

Extract file info

Aggregate project information

Generate markdown output

Run the pipeline

Incremental updates

Key patterns demonstrated