Skip to main content
Tom Zhang
Solution Architect @ VeloDB, CocoIndex Contributor
View all authors

Building SEC EDGAR Financial Analytics with CocoIndex and Apache Doris

· 12 min read
Tom Zhang
Solution Architect @ VeloDB, CocoIndex Contributor
Linghua Jin
CocoIndex Maintainer

SEC EDGAR Financial Analytics Architecture

SEC filings are the backbone of financial transparency. Every public company in the United States files 10-Ks, 10-Qs, proxy statements, and exhibits with the SEC -- thousands of documents each quarter across text, structured data, and PDF formats.

Searching across all of these effectively requires more than keyword matching. You need semantic understanding, structured metadata filtering, and the ability to combine multiple document formats into a single searchable index.

In this post, we walk through the SEC EDGAR Financial Analytics example: a CocoIndex pipeline that ingests three source types (TXT filings, JSON company facts, PDF exhibits), scrubs PII, extracts topic tags, generates embeddings, and exports everything into Apache Doris for hybrid search combining vector similarity with full-text matching using Reciprocal Rank Fusion (RRF).

The project is open sourced and can be found here.