Skip to main content

CocoIndex Built-in Storages

For each target storage, data are exported from a data collector, containing data of multiple entries, each with multiple fields. The way to map data from a data collector to a target storage depends on data model of the target storage.

Entry-Oriented Targets

Entry-Oriented Storage organizes data into independent entries, such as rows, key-value pairs, or documents. Each entry is self-contained and does not explicitly link to others. There is usually a straightforward mapping from data collector rows to entries.

Postgres

Exports data to Postgres database (with pgvector extension).

Data Mapping

Here's how CocoIndex data elements map to Postgres elements during export:

CocoIndex ElementPostgres Element
an export targeta unique table
a collected rowa row
a fielda column

For example, if you have a data collector that collects rows with fields id, title, and embedding, it will be exported to a Postgres table with corresponding columns. It should be a unique table, meaning that no other export target should export to the same table.

Spec

The spec takes the following fields:

  • database (type: auth reference to DatabaseConnectionSpec, optional): The connection to the Postgres database. See DatabaseConnectionSpec for its specific fields. If not provided, will use the same database as the internal storage.

  • table_name (type: str, optional): The name of the table to store to. If unspecified, will generate a new automatically. We recommend specifying a name explicitly if you want to directly query the table. It can be omitted if you want to use CocoIndex's query handlers to query the table.

Qdrant

Exports data to a Qdrant collection.

Data Mapping

Here's how CocoIndex data elements map to Qdrant elements during export:

CocoIndex ElementQdrant Element
an export targeta unique collection
a collected rowa point
a fielda named vector (for fields with vector type); a field within payload (otherwise)

Spec

The spec takes the following fields:

  • collection_name (type: str, required): The name of the collection to export the data to.

  • grpc_url (type: str, optional): The gRPC URL of the Qdrant instance. Defaults to http://localhost:6334/.

  • api_key (type: str, optional). API key to authenticate requests with.

Before exporting, you must create a collection with a vector name that matches the vector field name in CocoIndex, and set setup_by_user=True during export.

Example:

doc_embeddings.export(
"doc_embeddings",
cocoindex.storages.Qdrant(
collection_name="cocoindex",
grpc_url="https://xyz-example.cloud-region.cloud-provider.cloud.qdrant.io:6334/",
api_key="<your-api-key-here>",
),
primary_key_fields=["id_field"],
setup_by_user=True,
)

You can find an end-to-end example here.

Property Graph Targets

Property graph is a graph data model where both nodes and relationships can have properties.

Data Mapping

In CocoIndex, you can export data to property graph databases. This usually involves more than one collectors, and you export them to different types of graph elements (nodes and relationships). In particular,

  1. You can export rows from some collectors to nodes in the graph.
  2. You can export rows from some other collectors to relationships in the graph.
  3. Some nodes referenced by relationships exported in 2 may not exist as nodes exported in 1. CocoIndex will automatically create and keep these nodes, as long as they're still referenced by at least one relationship. This guarantees that all relationships exported in 2 are valid.

We provide common types NodeMapping, RelationshipMapping, and ReferencedNode, to configure for each situation. They're agnostic to specific graph databases.

Nodes

Here's how CocoIndex data elements map to nodes in the graph:

CocoIndex ElementGraph Element
an export targetnodes with a unique label
a collected rowa node
a fielda property of node

Note that the label used in different NodeMappings should be unique.

cocoindex.storages.NodeMapping is to describe mapping to nodes. It has the following fields:

  • label (type: str): The label of the node.

For example, consider we have collected the following rows:

filenamesummary
chapter1.mdAt the beginning, ...
chapter2.mdIn the second day, ...

We can export them to nodes under label Document like this:

document_collector.export(
...
cocoindex.storages.Neo4j(
...
mapping=cocoindex.storages.NodeMapping(label="Document"),
),
primary_key_fields=["filename"],
)

The collected rows will be mapped to nodes in knowledge database like this:

Relationships

Here's how CocoIndex data elements map to relationships in the graph:

CocoIndex ElementGraph Element
an export targetrelationships with a unique type
a collected rowa relationship
a fielda property of relationship, or a property of source/target node, based on configuration

Note that the type used in different RelationshipMappings should be unique.

cocoindex.storages.RelationshipMapping is to describe mapping to relationships. It has the following fields:

  • rel_type (type: str): The type of the relationship.
  • source/target (type: cocoindex.storages.NodeReferenceMapping): Specify how to extract source/target node information from the collected row. It has the following fields:
    • label (type: str): The label of the node.

    • fields (type: Sequence[cocoindex.storages.TargetFieldMapping]): Specify field mappings from the collected rows to node properties, with the following fields:

      • source (type: str): The name of the field in the collected row.
      • target (type: str, optional): The name of the field to use as the node field. If unspecified, will use the same as source.
      Map necessary fields for nodes of relationships

      You need to map the following fields for nodes of each relationship:

      • Make sure all primary key fields for the label are mapped.
      • Optionally, you can also map non-key fields. If you do so, please make sure all value fields are mapped.

All fields in the collector that are not used in mappings for source or target node fields will be mapped to relationship properties.

For example, consider we have collected the following rows, to describe places mentioned in each file, along with embeddings of the places:

doc_filenameplace_nameplace_embeddinglocation
chapter1.mdCrystal Palace[0.1, 0.5, ...]12
chapter2.mdMagic Forest[0.4, 0.2, ...]23
chapter2.mdCrystal Palace[0.1, 0.5, ...]56

We can export them to relationships under type MENTION like this:

doc_place_collector.export(
...
cocoindex.storages.Neo4j(
...
mapping=cocoindex.storages.RelationshipMapping(
rel_type="MENTION",
source=cocoindex.storages.NodeReferenceMapping(
label="Document",
fields=[cocoindex.storages.TargetFieldMapping(source="doc_filename", target="filename")],
),
target=cocoindex.storages.NodeReferenceMapping(
label="Place",
fields=[
cocoindex.storages.TargetFieldMapping(source="place_name", target="name"),
cocoindex.storages.TargetFieldMapping(source="place_embedding", target="embedding"),
],
),
),
),
...
)

The doc_filename field is mapped to Document.filename property for the source node, while place_name and place_embedding are mapped to Place.name and Place.embedding properties for the target node. The remaining field location becomes a property of the relationship. For the data above, we get a bunch of relationships like this:

Nodes only referenced by relationships

If a node appears as source or target of a relationship, but not exported using NodeMapping, CocoIndex will automatically create and keep these nodes until they're no longer referenced by any relationships.

Merge of node values

If the same node (as identified by primary key values) appears multiple times (e.g. they're referenced by different relationships), CocoIndex uses value fields provided by an arbitrary one of them. The best practice is to make the value fields consistent across different appearances of the same node, to avoid non-determinism in the exported graph.

If a node's label specified in NodeReferenceMapping doesn't exist in any NodeMapping, you need to declare a ReferencedNode to configure storage indexes for nodes with this label. The following options are supported:

  • primary_key_fields (required)
  • vector_indexes (optional)

Using the same example above. After combining exported nodes and relationships, we get the knowledge graph with all information:

Nodes with Place label in the example aren't exported explicitly using NodeMapping, so CocoIndex will automatically create them as long as they're still referenced by any relationship. You need to declare a ReferencedNode:

flow_builder.declare(
cocoindex.storages.Neo4jDeclarations(
...
referenced_nodes=[
cocoindex.storages.ReferencedNode(label="Place", primary_key_fields=["name"]),
],
),
)

Neo4j

If you don't have a Neo4j database, you can start a Neo4j database using our docker compose config:

docker compose -f <(curl -L https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/neo4j.yaml) up -d
warning

The docker compose config above will start a Neo4j Enterprise instance under the Evaluation License, with 30 days trial period. Please read and agree the license before starting the instance.

The Neo4j storage exports each row as a relationship to Neo4j Knowledge Graph. The spec takes the following fields:

  • connection (type: auth reference to Neo4jConnectionSpec): The connection to the Neo4j database. Neo4jConnectionSpec has the following fields:
    • url (type: str): The URI of the Neo4j database to use as the internal storage, e.g. bolt://localhost:7687.
    • user (type: str): Username for the Neo4j database.
    • password (type: str): Password for the Neo4j database.
    • db (type: str, optional): The name of the Neo4j database to use as the internal storage, e.g. neo4j.
  • mapping (type: NodeMapping | RelationshipMapping): The mapping from collected row to nodes or relationships of the graph. 2 variations are supported:

Neo4j also provides a declaration spec Neo4jDeclaration, to configure indexing options for nodes only referenced by relationships. It has the following fields:

  • connection (type: auth reference to Neo4jConnectionSpec)
  • relationships (type: Sequence[ReferencedNode])

You can find an end-to-end example here.