cocoindex Manual - Doramagic.ai

Doramagic Project Pack · Human Manual

cocoindex

Incremental engine for long horizon agents 🌟 Star if you like it!

Overview and Core Concepts

Related topics: Architecture and Engine Internals, Connectors, Sources, and Targets, Operations, Live Mode, and Advanced Topics

Section Related Pages

Continue reading this section for the full explanation and source context.

Section 1. App

Continue reading this section for the full explanation and source context.

Section 2. Function (Memoized Compute)

Continue reading this section for the full explanation and source context.

Section 3. Source

Continue reading this section for the full explanation and source context.

Overview and Core Concepts

CocoIndex is a data indexing framework that turns heterogeneous source data (files, PDFs, code, meeting notes) into queryable, incrementally maintained artifacts (embeddings, knowledge graphs, transformed files). The core idea is that a pipeline is declarative: you describe what to compute from where, and CocoIndex takes care of caching, change detection, and writing to targets — re-running only what actually changed.

What "Incremental" Means Here

Every example in the repository shares the same runtime semantics. Sources are scanned, their fingerprints are compared against the last persisted state, and only modified inputs trigger recomputation downstream. This is true for:

Local files walked by localfs and transformed to HTML (examples/files_transform/README.md) or Markdown (examples/pdf_to_markdown/README.md).
S3 buckets ingested in one-shot catch-up runs (examples/amazon_s3_embedding/README.md).
Code corpora chunked and embedded into LanceDB, where unchanged files are skipped automatically (examples/code_embedding_lancedb/README.md).
Knowledge graph flows where re-running after editing one document only re-extracts that document's triples (examples/docs_to_knowledge_graph/README.md).

Because the framework memoizes per-input results, pipeline authors do not write cache invalidation logic themselves — they declare functions and let the runtime decide what to recompute.

Core Building Blocks

The same five concepts appear in every example, whether the example is Python or Rust.

1. App

The App is the unit of deployment. It owns a name, a root directory of source data, an output directory, and a set of function references that form the pipeline. Python uses coco.App(...) with app_main as the entry point (examples/multi_codebase_summarization/README.md); Rust uses cocoindex::App(...) and the analog is a proc-macro-annotated function (examples/rust/paper_metadata/README.md). The App is what you point cocoindex update main at.

2. Function (Memoized Compute)

Functions are the compute nodes of the graph. In Python they are declared with @coco.fn(memo=True) (for example, process_file in examples/text_embedding/README.md). In Rust the equivalent is #[cocoindex::function(memo)] (examples/rust/files_transform/README.md). The memo flag tells the runtime to store the output keyed by input fingerprint — so identical inputs across runs return the cached value without re-execution.

3. Source

A source is the entry point for data ingestion. CocoIndex ships with sources for local files (localfs.walk_dir), Amazon S3 (examples/amazon_s3_embedding/README.md), and Google Drive for the meeting-notes flows (examples/meeting_notes_graph_neo4j/README.md). Sources are responsible for enumeration and fingerprinting; the rest of the pipeline receives already-deduplicated, fingerprint-stable records.

4. Processing Component (Scoped Sub-Flow)

Larger pipelines are composed by mounting a function that in turn calls other functions. The mounted function gets its own scope — useful for grouping operations on a single logical unit (e.g. one project, one meeting, one document). In the multi-codebase example, process_project is mounted once per subdirectory and in turn calls extract_file_info, aggregate_project_info, and generate_markdown (examples/multi_codebase_summarization/README.md). The Mermaid diagram in that example makes the scoping explicit: thick arrows (==>) denote mount/use_mount, thin arrows are plain function calls.

5. Target

A target is where computed artifacts land. CocoIndex supports vector and relational backends — Postgres with pgvector (examples/text_embedding/README.md), LanceDB (examples/text_embedding_lancedb/README.md), Qdrant (examples/text_embedding_qdrant/README.md), Turbopuffer (examples/text_embedding_turbopuffer/README.md) — and graph stores: Neo4j (examples/docs_to_knowledge_graph/README.md) and FalkorDB (examples/meeting_notes_graph_falkordb/README.md). For non-tabular outputs, the filesystem itself is a target: a declarative DirTarget writes, updates, and prunes files automatically (examples/rust/files_transform/README.md).

Execution Modes

A single App supports two runtime modes (examples/files_transform/README.md):

Mode	Command	Behavior
Catch-up	`cocoindex update main`	Scans sources, reconciles, exits. Used by S3 (examples/amazon_s3_embedding/README.md) and the docs-to-graph example.
Live	`cocoindex update -L main`	Reconciles, then keeps watching for file changes. Activated when the source declares `live=True`.

Data Flow at a Glance

The diagram below summarizes how a typical embedding pipeline is wired.

flowchart LR
    A[Source<br/>localfs / S3 / Drive] --> B[Mounted function<br/>per-file scope]
    B --> C[Memoized compute<br/>@coco.fn / function(memo)]
    C --> D[Target<br/>Postgres / LanceDB / Neo4j / Files]
    D -. change detected .-> C

Community Direction

Several open feature requests point at the same underlying model. Shadow run / preview (issue #1890) asks for a dry-run that computes the diff before any write — a natural extension of the same fingerprint/cache substrate that powers incremental updates today. LanceDB commit optimization (issue #1429) is about batching small appends on the target side of the same graph. Ergonomic Rust SDK (issue #1667) and MCP support (issue #160) are about new surfaces over the existing App/Function/Source/Target model — they do not change the core model, only how it is reached.

Architecture and Engine Internals

Related topics: Overview and Core Concepts, Operations, Live Mode, and Advanced Topics

Section Related Pages

Continue reading this section for the full explanation and source context.

Architecture and Engine Internals

Overview and Design Philosophy

CocoIndex is a framework for building incremental data pipelines that keep derived state in sync with changing sources. The core promise — visible across every example in this repository — is that re-running a flow after editing a single file only re-processes the affected inputs, never the whole dataset. The engine is language-agnostic in design: a Rust core exposes the same incremental semantics to both a Python SDK (decorator-based) and a Rust SDK (proc-macro-based), and the two ports are kept feature-parallel through shared example ports. Source: examples/files_transform/README.md:1-25, examples/rust/files_transform/README.md:1-20

The latest release (v1.0.13) added a standalone code_match crate for pattern matching against source code, with bare-keyword support, anchored regex matchers, fragment ranges, and a prefilter pass that prunes the child-run scan. This crate ships next to the main pipeline engine and reuses the same fingerprinting primitives for memoization. Source: rust/code_match/README.md:1-30

Pipeline Anatomy

Every CocoIndex flow follows the same three-stage topology: Source → Functions → Target.

Sources are first-class, addressable streams: a local directory (via localfs.walk_dir or cocoindex::fs::walk), a Google Drive folder of meeting notes, or a list of YouTube URLs. Source: examples/text_embedding/README.md:1-20, examples/meeting_notes_graph_neo4j/README.md:1-25
Functions are pure-ish transformations with optional memoization. They chain via mount (Python) / use_mount (Rust) for sub-pipelines, and stable identifiers come from IdGenerator so that re-asserting the same fact across inputs maps to a single node or row. Source: examples/multi_codebase_summarization/README.md:1-30, examples/docs_to_knowledge_graph/README.md:10-30
Targets are declarative sinks that know how to upsert, skip, and prune their own data. The example catalog covers Postgres (pgvector), LanceDB (embedded, local file), Qdrant, Turbopuffer, Neo4j, FalkorDB, SurrealDB, and the local filesystem. Source: examples/text_embedding/README.md:1-25, examples/code_embedding_lancedb/README.md:1-25, examples/text_embedding_qdrant/README.md:1-30, examples/text_embedding_turbopuffer/README.md:1-20

flowchart LR
    A[Source<br/>localfs / Drive / URLs] --> B[Function<br/>memo=True]
    B --> C[Function<br/>chunk / embed / LLM]
    C --> D[Target<br/>DB / Graph / Files]
    D -. fingerprint .-> B

The engine tracks fingerprint hashes of every intermediate value, so a downstream function only re-runs when one of its inputs (or the function's own implementation) changes. When a source file is deleted, the engine walks forward through the graph and removes the corresponding rows or files in the targets automatically. Source: examples/rust/pdf_to_markdown/README.md:1-25, examples/rust/files_transform/README.md:1-20

Execution Modes

The CLI exposes two run modes, declared in the source and toggled at the command line:

Catch-up — cocoindex update main — scans sources, syncs changes, exits. Source: examples/text_embedding/README.md:10-25
Live — cocoindex update -L main — catches up, then keeps watching for file changes. Source: examples/text_embedding_lancedb/README.md:10-25

In live mode the source declares live=True and the engine re-evaluates the affected subgraph whenever the file watcher reports a change. This is what makes the "edit one doc, only that doc gets re-extracted" property visible in the knowledge-graph examples, where the graph-building pass only re-runs when the set of triples changes. Source: examples/docs_to_knowledge_graph/README.md:10-30

SDK Surface: Python vs Rust

The two SDKs expose the same engine semantics through different idioms. The mapping is documented side-by-side in the Rust example ports:

Concern	Python	Rust
Walk source dir	`localfs.walk_dir`	`cocoindex::fs::walk`
Memoized function	`@coco.fn(memo=True)`	`#[cocoindex::function(memo)]`
Stable ids	`IdGenerator`	`cocoindex::IdGenerator`
Entity resolution	`ops.entity_resolution`	`cocoindex::entity_resolution`
Directory target	`localfs.declare_file`	`DirTarget::declare_file`

Source: examples/rust/paper_metadata/README.md:1-30, examples/rust/conversation_to_knowledge/README.md:1-35

The Rust SDK is currently less ergonomic than its Python counterpart, an asymmetry explicitly raised in issue #1667 ("Ergonomic Rust SDK"), which proposes explicit &Ctx and proc-macro-friendly patterns to match the Python decorator experience. The v1.0.13 release continues to invest in the Rust side, with the new code_match crate leading that effort. Source: rust/code_match/README.md:1-30

Community-Driven Extensions

The architecture leaves clear extension points that the community is actively filling:

MCP support (#160) — an integration target for the Model Context Protocol, allowing CocoIndex flows to be exposed as MCP resources or tools from external clients.
LanceDB commit optimization (#1429) — the engine currently commits per-batch; for workloads with many small appends, Lance tables require compaction every several thousand rows to keep fragment counts manageable and to avoid index build blow-ups.
Shadow run / preview (#1890) — a dry-run mode requested on Discord that computes the new state but does not write to targets, so users can inspect what a chunking or schema change would produce before committing. This is the most-discussed open feature.
Code matching (v1.0.13) — the new code_match engine supports captures, descendant containment, regex matchers, and prefiltering, but rewriting is not yet implemented; planned work includes alternation, quantifiers, node-kind matchers, and a rule DSL. Source: rust/code_match/README.md:1-30

Connectors, Sources, and Targets

Related topics: Overview and Core Concepts, Architecture and Engine Internals, Operations, Live Mode, and Advanced Topics

Section Related Pages

Continue reading this section for the full explanation and source context.

Connectors, Sources, and Targets

CocoIndex pipelines move data in two directions: Sources describe where rows come from and how to detect changes, while Targets describe where derived rows are written and how to reconcile them against previously-written state. The connectorkits module under python/cocoindex/connectorkits/ provides the shared infrastructure (target lifecycle, state diffing, fingerprinting, async adapters) that every concrete connector under python/cocoindex/connectors/ builds on.

Conceptual model

A CocoIndex flow is a DAG of @coco.fn steps mounted onto sources and targets. Sources produce (key, value) rows and a per-key fingerprint so unchanged inputs can be skipped on re-run. Targets accept collected rows and reconcile them against previously-stored state, deleting rows whose source disappeared and updating rows whose fingerprint changed. Source: python/cocoindex/connectorkits/fingerprint.py:1-1 and python/cocoindex/connectorkits/target.py:1-1.

flowchart LR
    A[Source: walk / pull] -->|key + fingerprint| B[@coco.fn pipeline]
    B -->|collected rows| C[Target spec]
    C --> D{State diff}
    D -->|new / changed| E[Upsert]
    D -->|missing| F[Delete]
    C --> G[(Postgres / LanceDB / Qdrant / Neo4j / SurrealDB / localfs / S3)]

The dataflow is identical across SDKs: the Rust examples use cocoindex::fs::walk and DirTarget as the analogues of Python's localfs.walk_dir and localfs.declare_file (Source: examples/rust/files_transform/README.md:1-1).

Source connectors

Source connectors implement two responsibilities: enumerate current rows and emit a stable fingerprint per row so re-runs can detect inserts, updates, and deletes. The local filesystem source in python/cocoindex/connectors/localfs/_source.py:1-1 walks a directory matching a glob, reads each file, and exposes walk_dir(...) returning (filename, content) pairs keyed by relative path. Fingerprinting is content-hash-based by default, so a re-run with no file changes performs zero per-file work in memoized steps.

Common source patterns across the examples:

Pattern	Example	Notes
Local files (recursive glob)	`examples/text_embedding/`	Markdown files via `localfs.walk_dir`
Top-level files only	`examples/pdf_to_markdown/`	`localfs.walk_dir` with non-recursive glob
Live watching	`examples/text_embedding_lancedb/`	`live=True` enables `cocoindex update -L` catch-up-then-watch
S3 bucket	`examples/amazon_s3_embedding/`	One-shot catch-up; live mode not supported for S3 sources
Google Drive notes	`examples/docs_to_knowledge_graph/`	Markdown files split per meeting
Embedded	`examples/rust/files_transform/`	`cocoindex::fs::walk` mirror with `#[cocoindex::function(memo)]`

Community issue #1890 (shadow run / preview) proposes running the source→target pipeline without persisting, so users can preview diffs before applying them — this would extend the source-fingerprinting machinery in python/cocoindex/connectorkits/fingerprint.py:1-1 with a dry-run mode.

Target connectors

Targets declare a destination schema and reconcile collected rows against previously-stored state. The shared target lifecycle lives in python/cocoindex/connectorkits/target.py:1-1 and the diff engine in python/cocoindex/connectorkits/statediff.py:1-1: given a previous snapshot and a fresh batch, it produces three action sets — to_upsert, to_delete, and to_keep — keyed by a primary key derived from the row.

Concrete targets across the examples:

Postgres (pgvector) — three-table layout in examples/text_embedding/, mounted via postgres.mount_table_target. Sequential scan if no vector index is created. Source: examples/text_embedding/README.md:1-1.
LanceDB — embedded store under ./lancedb_data/; supports vector search plus FTS. Community issue #1429 (LanceDB commit optimization) notes that many small appends need compaction every few thousand rows. Source: examples/text_embedding_lancedb/README.md:1-1.
Qdrant — HTTP/gRPC container, no secrets required by default. Source: examples/text_embedding_qdrant/README.md:1-1.
Turbopuffer — remote namespace keyed by TURBOPUFFER_API_KEY. Source: examples/text_embedding_turbopuffer/README.md:1-1.
Neo4j / FalkorDB — property-graph targets for knowledge graphs (Document, Entity, Meeting, Person, Task nodes plus typed relationships). Source: examples/docs_to_knowledge_graph/README.md:1-1 and examples/meeting_notes_graph_neo4j/README.md:1-1.
Local filesystem directory — localfs.declare_file(...) writes/updates files and removes outputs whose source was deleted (see examples/pdf_to_markdown/ and the Rust DirTarget analogue in examples/rust/files_transform/README.md:1-1).

Targets are reconciled by primary key plus fingerprint: if a row's fingerprint is unchanged, it is skipped; changed rows are upserted; absent rows are deleted. This is what makes the same flow safe to run repeatedly. Source: python/cocoindex/connectorkits/statediff.py:1-1.

Connector kit infrastructure

The connectorkits package is what lets new connectors be added with minimal boilerplate:

target.py — base classes for mounted, key-based, and collected targets; defines the contract for setup, apply, and teardown. Source: python/cocoindex/connectorkits/target.py:1-1.
statediff.py — computes the to_upsert / to_delete / to_keep partition used by every reconciling target. Source: python/cocoindex/connectorkits/statediff.py:1-1.
fingerprint.py — produces stable per-row fingerprints so memoized steps and target reconciliation can detect change. Source: python/cocoindex/connectorkits/fingerprint.py:1-1.
async_adapters.py — bridges async SDKs (e.g. asyncpg, async S3 clients) into CocoIndex's sync per-row execution model. Source: python/cocoindex/connectorkits/async_adapters.py:1-1.

Common failure modes

Stale fingerprints after schema changes. Changing a transform changes the emitted rows but not the source fingerprint; without re-running, the target will see unchanged fingerprints and skip updates. Re-running with a forced rebuild clears the snapshot.
No live mode for cloud sources. S3 sources only support one-shot catch-up (examples/amazon_s3_embedding/); long-running pipelines must invoke cocoindex update on a schedule.
LanceDB fragmentation. Many small commits require periodic compaction; issue #1429 tracks optimizing this directly inside CocoIndex targets. Source: examples/text_embedding_lancedb/README.md:1-1.
MCP ingestion not yet supported. Community issue #160 (Support MCP) requests adding the Model Context Protocol as a first-class source so external tools can stream data in.

Operations, Live Mode, and Advanced Topics

Related topics: Overview and Core Concepts, Architecture and Engine Internals, Connectors, Sources, and Targets

Section Related Pages

Continue reading this section for the full explanation and source context.

Operations, Live Mode, and Advanced Topics

Overview

CocoIndex provides a layered operations model: a Python CLI orchestrates pipelines, each pipeline runs in either catch-up or live mode, and a target-state engine reconciles source and output stores incrementally. This page consolidates the operational surface area — the cocoindex command, live updates, target state semantics, and the advanced extension points (e.g. exception handlers) — so practitioners can run CocoIndex reliably in production and one-off contexts alike.

The CLI entry point is defined in python/cocoindex/cli.py, with user-facing documentation in docs/src/content/docs/cli.mdx and command references in docs/src/content/docs/cli_commands.mdx. The Rust core that powers both Python and (eventually) native-Rust pipelines is exposed via rust/py/src/lib.rs, exposing classes such as PyApp, PyUpdateHandle, and PyDropHandle.

Source: https://github.com/cocoindex-io/cocoindex / Human Manual

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

medium Capability evidence risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Maintenance risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Security or permission risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Security or permission risk requires verification

May increase setup, validation, or first-run risk for the user.

Doramagic Pitfall Log

Found 6 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Capability evidence risk - Capability evidence risk requires verification.

1. Capability evidence risk: Capability evidence risk requires verification

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: capability.assumptions | https://github.com/cocoindex-io/cocoindex

2. Maintenance risk: Maintenance risk requires verification

Severity: medium
Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/cocoindex-io/cocoindex

3. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: downstream_validation.risk_items | https://github.com/cocoindex-io/cocoindex

4. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: risks.scoring_risks | https://github.com/cocoindex-io/cocoindex

5. Maintenance risk: Maintenance risk requires verification

Severity: low
Finding: issue_or_pr_quality=unknown。
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/cocoindex-io/cocoindex

6. Maintenance risk: Maintenance risk requires verification

Severity: low
Finding: release_recency=unknown。
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/cocoindex-io/cocoindex

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using cocoindex with real data or production workflows.

[[FEATURE] allow users to specify type hints for component states](https://github.com/cocoindex-io/cocoindex/issues/2198) - github / github_issue
v1.0.13 - github / github_release
v1.0.12 - github / github_release
v1.0.11 - github / github_release
v1.0.10 - github / github_release
v1.0.9 - github / github_release
v1.0.8 - github / github_release
v1.0.7 - github / github_release
v1.0.6 - github / github_release
v1.0.5 - github / github_release
v1.0.4 - github / github_release
Capability evidence risk requires verification - GitHub / issue

Source: Project Pack community evidence and pitfall evidence

cocoindex

Overview and Core Concepts

Related Pages

Overview and Core Concepts

What "Incremental" Means Here

Core Building Blocks

1. App

2. Function (Memoized Compute)

3. Source

4. Processing Component (Scoped Sub-Flow)

5. Target

Execution Modes

Data Flow at a Glance

Community Direction

See Also

Architecture and Engine Internals

Related Pages

Architecture and Engine Internals

Overview and Design Philosophy

Pipeline Anatomy

Execution Modes

SDK Surface: Python vs Rust

Community-Driven Extensions

See Also

Connectors, Sources, and Targets

Related Pages

Connectors, Sources, and Targets

Conceptual model

Source connectors

Target connectors

Connector kit infrastructure

Common failure modes

See Also

Operations, Live Mode, and Advanced Topics

Related Pages

Operations, Live Mode, and Advanced Topics

Overview

Doramagic Pitfall Log

Doramagic Pitfall Log

1. Capability evidence risk: Capability evidence risk requires verification

2. Maintenance risk: Maintenance risk requires verification

3. Security or permission risk: Security or permission risk requires verification

4. Security or permission risk: Security or permission risk requires verification

5. Maintenance risk: Maintenance risk requires verification

6. Maintenance risk: Maintenance risk requires verification

Community Discussion Evidence

Community Discussion Evidence