Skip to main content

One post tagged with "announcement"

View All Tags

We are officially open sourced! ๐ŸŽ‰

ยท 3 min read

CocoIndex is now open source

We are thrilled to announce the open-source release of CocoIndex, the world's first engine that supports both custom transformation logic and incremental updates specialized for data indexing.

CocoIndex combines custom transformation logic and incremental updates

CocoIndex is an ETL framework to preapare data for AI applications such as semantic search, retrieval-augmented generation (RAG). It offers a data-driven programming model that simplifies the creation and maintenance of data indexing pipelines, ensuring data freshness and consistency.

CocoIndex is now open source under the Apache License 2.0. This means the core functionality of CocoIndex is freely available for anyone to use, modify, and distribute. We believe that open sourcing CocoIndex will foster innovation, enable broader adoption, and create a vibrant community of contributors who can help shape its future. By choosing the Apache License 2.0, we're ensuring that both individual developers and enterprises can confidently build upon and integrate CocoIndex into their projects while maintaining the flexibility to create proprietary extensions.

๐Ÿ”ฅ Key Featuresโ€‹

  • Data Flow Programming: Build indexing pipelines by composing transformations like Lego blocks, with built-in state management and observability.
  • Support Custom Logic: Plug in your choice of chunking, embedding, and vector stores. Extend with custom transformations like deduplication and reconciliation.
  • Incremental Updates: Smart state management minimizes re-computation by tracking changes at the file level, with future support for chunk-level granularity.
  • Python SDK: Built with a RUST core ๐Ÿฆ€ for performance, exposed through an intuitive Python binding ๐Ÿ for ease of use.

We are moving fast and a lot of features and improvements are coming soon.

๐Ÿš€ Getting Startedโ€‹

  1. Installation: Install the CocoIndex Python library:

    pip install cocoindex
  2. Set Up Postgres with pgvector Extension: Ensure Docker Compose is installed, then start a Postgres database:

    docker compose -f <(curl -L https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/postgres.yaml) up -d
  3. Define Your Indexing Flow: Create a flow to index your data. For example:

    @cocoindex.flow_def(name="TextEmbedding")
    def text_embedding(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))
    doc_embeddings = data_scope.add_collector()

    with data_scope["documents"].row() as doc:
    doc["chunks"] = doc["content"].transform(
    cocoindex.functions.SplitRecursively(language="markdown", chunk_size=300, chunk_overlap=100))

    with doc["chunks"].row() as chunk:
    chunk["embedding"] = chunk["text"].transform(
    cocoindex.functions.SentenceTransformerEmbed(model="sentence-transformers/all-MiniLM-L6-v2"))

    doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
    text=chunk["text"], embedding=chunk["embedding"])

    doc_embeddings.export(
    "doc_embeddings",
    cocoindex.storages.Postgres(),
    primary_key_fields=["filename", "location"],
    vector_index=[("embedding", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])

For a detailed walkthrough, refer to our Quickstart Guide.

๐Ÿค— Communityโ€‹

We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests on GitHub, and discussions in our Discord.

  • GitHub: Please give us a star repository ๐Ÿค—.
  • Documentation: Check out our documentation for detailed guides and API reference.
  • Discord: Join discussions, seek support, and share your experiences on our Discord server.
  • Social Media: Follow us on Twitter and LinkedIn for updates.

We would love to fostering an inclusive, welcoming, and supportive environment. Contributing to CocoIndex should feel collaborative, friendly and enjoyable for everyone. Together, we can build better AI applications through robust data infrastructure.

Looking forward to seeing what you build with CocoIndex!