Doramagic Project Pack Β· Human Manual
cocoindex
Incremental engine for long horizon agents π Star if you like it!
Overview and Core Concepts
Related topics: Architecture and Engine Internals, Connectors, Sources, and Targets, Operations, Live Mode, and Advanced Topics
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Architecture and Engine Internals, Connectors, Sources, and Targets, Operations, Live Mode, and Advanced Topics
Overview and Core Concepts
CocoIndex is a data indexing framework that turns heterogeneous source data (files, PDFs, code, meeting notes) into queryable, incrementally maintained artifacts (embeddings, knowledge graphs, transformed files). The core idea is that a pipeline is declarative: you describe what to compute from where, and CocoIndex takes care of caching, change detection, and writing to targets β re-running only what actually changed.
What "Incremental" Means Here
Every example in the repository shares the same runtime semantics. Sources are scanned, their fingerprints are compared against the last persisted state, and only modified inputs trigger recomputation downstream. This is true for:
- Local files walked by
localfsand transformed to HTML (examples/files_transform/README.md) or Markdown (examples/pdf_to_markdown/README.md). - S3 buckets ingested in one-shot catch-up runs (examples/amazon_s3_embedding/README.md).
- Code corpora chunked and embedded into LanceDB, where unchanged files are skipped automatically (examples/code_embedding_lancedb/README.md).
- Knowledge graph flows where re-running after editing one document only re-extracts that document's triples (examples/docs_to_knowledge_graph/README.md).
Because the framework memoizes per-input results, pipeline authors do not write cache invalidation logic themselves β they declare functions and let the runtime decide what to recompute.
Core Building Blocks
The same five concepts appear in every example, whether the example is Python or Rust.
1. App
The App is the unit of deployment. It owns a name, a root directory of source data, an output directory, and a set of function references that form the pipeline. Python uses coco.App(...) with app_main as the entry point (examples/multi_codebase_summarization/README.md); Rust uses cocoindex::App(...) and the analog is a proc-macro-annotated function (examples/rust/paper_metadata/README.md). The App is what you point cocoindex update main at.
2. Function (Memoized Compute)
Functions are the compute nodes of the graph. In Python they are declared with @coco.fn(memo=True) (for example, process_file in examples/text_embedding/README.md). In Rust the equivalent is #[cocoindex::function(memo)] (examples/rust/files_transform/README.md). The memo flag tells the runtime to store the output keyed by input fingerprint β so identical inputs across runs return the cached value without re-execution.
3. Source
A source is the entry point for data ingestion. CocoIndex ships with sources for local files (localfs.walk_dir), Amazon S3 (examples/amazon_s3_embedding/README.md), and Google Drive for the meeting-notes flows (examples/meeting_notes_graph_neo4j/README.md). Sources are responsible for enumeration and fingerprinting; the rest of the pipeline receives already-deduplicated, fingerprint-stable records.
4. Processing Component (Scoped Sub-Flow)
Larger pipelines are composed by mounting a function that in turn calls other functions. The mounted function gets its own scope β useful for grouping operations on a single logical unit (e.g. one project, one meeting, one document). In the multi-codebase example, process_project is mounted once per subdirectory and in turn calls extract_file_info, aggregate_project_info, and generate_markdown (examples/multi_codebase_summarization/README.md). The Mermaid diagram in that example makes the scoping explicit: thick arrows (==>) denote mount/use_mount, thin arrows are plain function calls.
5. Target
A target is where computed artifacts land. CocoIndex supports vector and relational backends β Postgres with pgvector (examples/text_embedding/README.md), LanceDB (examples/text_embedding_lancedb/README.md), Qdrant (examples/text_embedding_qdrant/README.md), Turbopuffer (examples/text_embedding_turbopuffer/README.md) β and graph stores: Neo4j (examples/docs_to_knowledge_graph/README.md) and FalkorDB (examples/meeting_notes_graph_falkordb/README.md). For non-tabular outputs, the filesystem itself is a target: a declarative DirTarget writes, updates, and prunes files automatically (examples/rust/files_transform/README.md).
Execution Modes
A single App supports two runtime modes (examples/files_transform/README.md):
| Mode | Command | Behavior |
|---|---|---|
| Catch-up | cocoindex update main | Scans sources, reconciles, exits. Used by S3 (examples/amazon_s3_embedding/README.md) and the docs-to-graph example. |
| Live | cocoindex update -L main | Reconciles, then keeps watching for file changes. Activated when the source declares live=True. |
Data Flow at a Glance
The diagram below summarizes how a typical embedding pipeline is wired.
flowchart LR
A[Source<br/>localfs / S3 / Drive] --> B[Mounted function<br/>per-file scope]
B --> C[Memoized compute<br/>@coco.fn / function(memo)]
C --> D[Target<br/>Postgres / LanceDB / Neo4j / Files]
D -. change detected .-> CCommunity Direction
Several open feature requests point at the same underlying model. Shadow run / preview (issue #1890) asks for a dry-run that computes the diff before any write β a natural extension of the same fingerprint/cache substrate that powers incremental updates today. LanceDB commit optimization (issue #1429) is about batching small appends on the target side of the same graph. Ergonomic Rust SDK (issue #1667) and MCP support (issue #160) are about new surfaces over the existing App/Function/Source/Target model β they do not change the core model, only how it is reached.
See Also
- Examples Gallery β every section above cites one or more runnable examples.
- rust/code_match/README.md β a non-pipeline utility (matching/captures over source code) shipped alongside the main framework.
Source: https://github.com/cocoindex-io/cocoindex / Human Manual
Architecture and Engine Internals
Related topics: Overview and Core Concepts, Operations, Live Mode, and Advanced Topics
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Overview and Core Concepts, Operations, Live Mode, and Advanced Topics
Architecture and Engine Internals
Overview and Design Philosophy
CocoIndex is a framework for building incremental data pipelines that keep derived state in sync with changing sources. The core promise β visible across every example in this repository β is that re-running a flow after editing a single file only re-processes the affected inputs, never the whole dataset. The engine is language-agnostic in design: a Rust core exposes the same incremental semantics to both a Python SDK (decorator-based) and a Rust SDK (proc-macro-based), and the two ports are kept feature-parallel through shared example ports. Source: examples/files_transform/README.md:1-25, examples/rust/files_transform/README.md:1-20
The latest release (v1.0.13) added a standalone code_match crate for pattern matching against source code, with bare-keyword support, anchored regex matchers, fragment ranges, and a prefilter pass that prunes the child-run scan. This crate ships next to the main pipeline engine and reuses the same fingerprinting primitives for memoization. Source: rust/code_match/README.md:1-30
Pipeline Anatomy
Every CocoIndex flow follows the same three-stage topology: Source β Functions β Target.
- Sources are first-class, addressable streams: a local directory (via
localfs.walk_dirorcocoindex::fs::walk), a Google Drive folder of meeting notes, or a list of YouTube URLs. Source: examples/text_embedding/README.md:1-20, examples/meeting_notes_graph_neo4j/README.md:1-25 - Functions are pure-ish transformations with optional memoization. They chain via
mount(Python) /use_mount(Rust) for sub-pipelines, and stable identifiers come fromIdGeneratorso that re-asserting the same fact across inputs maps to a single node or row. Source: examples/multi_codebase_summarization/README.md:1-30, examples/docs_to_knowledge_graph/README.md:10-30 - Targets are declarative sinks that know how to upsert, skip, and prune their own data. The example catalog covers Postgres (pgvector), LanceDB (embedded, local file), Qdrant, Turbopuffer, Neo4j, FalkorDB, SurrealDB, and the local filesystem. Source: examples/text_embedding/README.md:1-25, examples/code_embedding_lancedb/README.md:1-25, examples/text_embedding_qdrant/README.md:1-30, examples/text_embedding_turbopuffer/README.md:1-20
flowchart LR
A[Source<br/>localfs / Drive / URLs] --> B[Function<br/>memo=True]
B --> C[Function<br/>chunk / embed / LLM]
C --> D[Target<br/>DB / Graph / Files]
D -. fingerprint .-> BThe engine tracks fingerprint hashes of every intermediate value, so a downstream function only re-runs when one of its inputs (or the function's own implementation) changes. When a source file is deleted, the engine walks forward through the graph and removes the corresponding rows or files in the targets automatically. Source: examples/rust/pdf_to_markdown/README.md:1-25, examples/rust/files_transform/README.md:1-20
Execution Modes
The CLI exposes two run modes, declared in the source and toggled at the command line:
- Catch-up β
cocoindex update mainβ scans sources, syncs changes, exits. Source: examples/text_embedding/README.md:10-25 - Live β
cocoindex update -L mainβ catches up, then keeps watching for file changes. Source: examples/text_embedding_lancedb/README.md:10-25
In live mode the source declares live=True and the engine re-evaluates the affected subgraph whenever the file watcher reports a change. This is what makes the "edit one doc, only that doc gets re-extracted" property visible in the knowledge-graph examples, where the graph-building pass only re-runs when the set of triples changes. Source: examples/docs_to_knowledge_graph/README.md:10-30
SDK Surface: Python vs Rust
The two SDKs expose the same engine semantics through different idioms. The mapping is documented side-by-side in the Rust example ports:
| Concern | Python | Rust |
|---|---|---|
| Walk source dir | localfs.walk_dir | cocoindex::fs::walk |
| Memoized function | @coco.fn(memo=True) | #[cocoindex::function(memo)] |
| Stable ids | IdGenerator | cocoindex::IdGenerator |
| Entity resolution | ops.entity_resolution | cocoindex::entity_resolution |
| Directory target | localfs.declare_file | DirTarget::declare_file |
Source: examples/rust/paper_metadata/README.md:1-30, examples/rust/conversation_to_knowledge/README.md:1-35
The Rust SDK is currently less ergonomic than its Python counterpart, an asymmetry explicitly raised in issue #1667 ("Ergonomic Rust SDK"), which proposes explicit &Ctx and proc-macro-friendly patterns to match the Python decorator experience. The v1.0.13 release continues to invest in the Rust side, with the new code_match crate leading that effort. Source: rust/code_match/README.md:1-30
Community-Driven Extensions
The architecture leaves clear extension points that the community is actively filling:
- MCP support (#160) β an integration target for the Model Context Protocol, allowing CocoIndex flows to be exposed as MCP resources or tools from external clients.
- LanceDB commit optimization (#1429) β the engine currently commits per-batch; for workloads with many small appends, Lance tables require compaction every several thousand rows to keep fragment counts manageable and to avoid index build blow-ups.
- Shadow run / preview (#1890) β a dry-run mode requested on Discord that computes the new state but does not write to targets, so users can inspect what a chunking or schema change would produce before committing. This is the most-discussed open feature.
- Code matching (v1.0.13) β the new
code_matchengine supports captures, descendant containment, regex matchers, and prefiltering, but rewriting is not yet implemented; planned work includes alternation, quantifiers, node-kind matchers, and a rule DSL. Source: rust/code_match/README.md:1-30
See Also
- Getting Started
- Python SDK Reference
- Rust SDK Reference
- Targets Overview
- Operations (chunking, embedding, entity resolution)
Source: https://github.com/cocoindex-io/cocoindex / Human Manual
Connectors, Sources, and Targets
Related topics: Overview and Core Concepts, Architecture and Engine Internals, Operations, Live Mode, and Advanced Topics
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Overview and Core Concepts, Architecture and Engine Internals, Operations, Live Mode, and Advanced Topics
Connectors, Sources, and Targets
CocoIndex pipelines move data in two directions: Sources describe where rows come from and how to detect changes, while Targets describe where derived rows are written and how to reconcile them against previously-written state. The connectorkits module under python/cocoindex/connectorkits/ provides the shared infrastructure (target lifecycle, state diffing, fingerprinting, async adapters) that every concrete connector under python/cocoindex/connectors/ builds on.
Conceptual model
A CocoIndex flow is a DAG of @coco.fn steps mounted onto sources and targets. Sources produce (key, value) rows and a per-key fingerprint so unchanged inputs can be skipped on re-run. Targets accept collected rows and reconcile them against previously-stored state, deleting rows whose source disappeared and updating rows whose fingerprint changed. Source: python/cocoindex/connectorkits/fingerprint.py:1-1 and python/cocoindex/connectorkits/target.py:1-1.
flowchart LR
A[Source: walk / pull] -->|key + fingerprint| B[@coco.fn pipeline]
B -->|collected rows| C[Target spec]
C --> D{State diff}
D -->|new / changed| E[Upsert]
D -->|missing| F[Delete]
C --> G[(Postgres / LanceDB / Qdrant / Neo4j / SurrealDB / localfs / S3)]The dataflow is identical across SDKs: the Rust examples use cocoindex::fs::walk and DirTarget as the analogues of Python's localfs.walk_dir and localfs.declare_file (Source: examples/rust/files_transform/README.md:1-1).
Source connectors
Source connectors implement two responsibilities: enumerate current rows and emit a stable fingerprint per row so re-runs can detect inserts, updates, and deletes. The local filesystem source in python/cocoindex/connectors/localfs/_source.py:1-1 walks a directory matching a glob, reads each file, and exposes walk_dir(...) returning (filename, content) pairs keyed by relative path. Fingerprinting is content-hash-based by default, so a re-run with no file changes performs zero per-file work in memoized steps.
Common source patterns across the examples:
| Pattern | Example | Notes |
|---|---|---|
| Local files (recursive glob) | examples/text_embedding/ | Markdown files via localfs.walk_dir |
| Top-level files only | examples/pdf_to_markdown/ | localfs.walk_dir with non-recursive glob |
| Live watching | examples/text_embedding_lancedb/ | live=True enables cocoindex update -L catch-up-then-watch |
| S3 bucket | examples/amazon_s3_embedding/ | One-shot catch-up; live mode not supported for S3 sources |
| Google Drive notes | examples/docs_to_knowledge_graph/ | Markdown files split per meeting |
| Embedded | examples/rust/files_transform/ | cocoindex::fs::walk mirror with #[cocoindex::function(memo)] |
Community issue #1890 (shadow run / preview) proposes running the sourceβtarget pipeline without persisting, so users can preview diffs before applying them β this would extend the source-fingerprinting machinery in python/cocoindex/connectorkits/fingerprint.py:1-1 with a dry-run mode.
Target connectors
Targets declare a destination schema and reconcile collected rows against previously-stored state. The shared target lifecycle lives in python/cocoindex/connectorkits/target.py:1-1 and the diff engine in python/cocoindex/connectorkits/statediff.py:1-1: given a previous snapshot and a fresh batch, it produces three action sets β to_upsert, to_delete, and to_keep β keyed by a primary key derived from the row.
Concrete targets across the examples:
- Postgres (pgvector) β three-table layout in
examples/text_embedding/, mounted viapostgres.mount_table_target. Sequential scan if no vector index is created. Source: examples/text_embedding/README.md:1-1. - LanceDB β embedded store under
./lancedb_data/; supports vector search plus FTS. Community issue #1429 (LanceDB commit optimization) notes that many small appends need compaction every few thousand rows. Source: examples/text_embedding_lancedb/README.md:1-1. - Qdrant β HTTP/gRPC container, no secrets required by default. Source: examples/text_embedding_qdrant/README.md:1-1.
- Turbopuffer β remote namespace keyed by
TURBOPUFFER_API_KEY. Source: examples/text_embedding_turbopuffer/README.md:1-1. - Neo4j / FalkorDB β property-graph targets for knowledge graphs (
Document,Entity,Meeting,Person,Tasknodes plus typed relationships). Source: examples/docs_to_knowledge_graph/README.md:1-1 and examples/meeting_notes_graph_neo4j/README.md:1-1. - Local filesystem directory β
localfs.declare_file(...)writes/updates files and removes outputs whose source was deleted (seeexamples/pdf_to_markdown/and the RustDirTargetanalogue in examples/rust/files_transform/README.md:1-1).
Targets are reconciled by primary key plus fingerprint: if a row's fingerprint is unchanged, it is skipped; changed rows are upserted; absent rows are deleted. This is what makes the same flow safe to run repeatedly. Source: python/cocoindex/connectorkits/statediff.py:1-1.
Connector kit infrastructure
The connectorkits package is what lets new connectors be added with minimal boilerplate:
- target.py β base classes for mounted, key-based, and collected targets; defines the contract for
setup,apply, andteardown. Source: python/cocoindex/connectorkits/target.py:1-1. - statediff.py β computes the
to_upsert / to_delete / to_keeppartition used by every reconciling target. Source: python/cocoindex/connectorkits/statediff.py:1-1. - fingerprint.py β produces stable per-row fingerprints so memoized steps and target reconciliation can detect change. Source: python/cocoindex/connectorkits/fingerprint.py:1-1.
- async_adapters.py β bridges async SDKs (e.g. asyncpg, async S3 clients) into CocoIndex's sync per-row execution model. Source: python/cocoindex/connectorkits/async_adapters.py:1-1.
Common failure modes
- Stale fingerprints after schema changes. Changing a transform changes the emitted rows but not the source fingerprint; without re-running, the target will see unchanged fingerprints and skip updates. Re-running with a forced rebuild clears the snapshot.
- No live mode for cloud sources. S3 sources only support one-shot catch-up (
examples/amazon_s3_embedding/); long-running pipelines must invokecocoindex updateon a schedule. - LanceDB fragmentation. Many small commits require periodic compaction; issue #1429 tracks optimizing this directly inside CocoIndex targets. Source: examples/text_embedding_lancedb/README.md:1-1.
- MCP ingestion not yet supported. Community issue #160 (Support MCP) requests adding the Model Context Protocol as a first-class source so external tools can stream data in.
See Also
- CocoIndex README
- Examples directory: examples/
- Rust SDK mapping: examples/rust/files_transform/README.md
Source: https://github.com/cocoindex-io/cocoindex / Human Manual
Operations, Live Mode, and Advanced Topics
Related topics: Overview and Core Concepts, Architecture and Engine Internals, Connectors, Sources, and Targets
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Overview and Core Concepts, Architecture and Engine Internals, Connectors, Sources, and Targets
Operations, Live Mode, and Advanced Topics
Overview
CocoIndex provides a layered operations model: a Python CLI orchestrates pipelines, each pipeline runs in either catch-up or live mode, and a target-state engine reconciles source and output stores incrementally. This page consolidates the operational surface area β the cocoindex command, live updates, target state semantics, and the advanced extension points (e.g. exception handlers) β so practitioners can run CocoIndex reliably in production and one-off contexts alike.
The CLI entry point is defined in python/cocoindex/cli.py, with user-facing documentation in docs/src/content/docs/cli.mdx and command references in docs/src/content/docs/cli_commands.mdx. The Rust core that powers both Python and (eventually) native-Rust pipelines is exposed via rust/py/src/lib.rs, exposing classes such as PyApp, PyUpdateHandle, and PyDropHandle.
Source: https://github.com/cocoindex-io/cocoindex / Human Manual
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
Doramagic Pitfall Log
Found 6 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Capability evidence risk - Capability evidence risk requires verification.
1. Capability evidence risk: Capability evidence risk requires verification
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: capability.assumptions | https://github.com/cocoindex-io/cocoindex
2. Maintenance risk: Maintenance risk requires verification
- Severity: medium
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/cocoindex-io/cocoindex
3. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: downstream_validation.risk_items | https://github.com/cocoindex-io/cocoindex
4. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: risks.scoring_risks | https://github.com/cocoindex-io/cocoindex
5. Maintenance risk: Maintenance risk requires verification
- Severity: low
- Finding: issue_or_pr_quality=unknownγ
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/cocoindex-io/cocoindex
6. Maintenance risk: Maintenance risk requires verification
- Severity: low
- Finding: release_recency=unknownγ
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/cocoindex-io/cocoindex
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using cocoindex with real data or production workflows.
- [[FEATURE] allow users to specify type hints for component states](https://github.com/cocoindex-io/cocoindex/issues/2198) - github / github_issue
- v1.0.13 - github / github_release
- v1.0.12 - github / github_release
- v1.0.11 - github / github_release
- v1.0.10 - github / github_release
- v1.0.9 - github / github_release
- v1.0.8 - github / github_release
- v1.0.7 - github / github_release
- v1.0.6 - github / github_release
- v1.0.5 - github / github_release
- v1.0.4 - github / github_release
- Capability evidence risk requires verification - GitHub / issue
Source: Project Pack community evidence and pitfall evidence