# https://github.com/cocoindex-io/cocoindex Project Manual

Generated at: 2026-06-23 00:58:43 UTC

## Table of Contents

- [Overview and Core Concepts](#page-1)
- [Architecture and Engine Internals](#page-2)
- [Connectors, Sources, and Targets](#page-3)
- [Operations, Live Mode, and Advanced Topics](#page-4)

<a id='page-1'></a>

## Overview and Core Concepts

### Related Pages

Related topics: [Architecture and Engine Internals](#page-2), [Connectors, Sources, and Targets](#page-3), [Operations, Live Mode, and Advanced Topics](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [examples/text_embedding/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/text_embedding/README.md)
- [examples/text_embedding_lancedb/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/text_embedding_lancedb/README.md)
- [examples/text_embedding_qdrant/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/text_embedding_qdrant/README.md)
- [examples/text_embedding_turbopuffer/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/text_embedding_turbopuffer/README.md)
- [examples/pdf_to_markdown/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/pdf_to_markdown/README.md)
- [examples/files_transform/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/files_transform/README.md)
- [examples/code_embedding_lancedb/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/code_embedding_lancedb/README.md)
- [examples/amazon_s3_embedding/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/amazon_s3_embedding/README.md)
- [examples/multi_codebase_summarization/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/multi_codebase_summarization/README.md)
- [examples/patient_intake_extraction_dspy/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/patient_intake_extraction_dspy/README.md)
- [examples/docs_to_knowledge_graph/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/docs_to_knowledge_graph/README.md)
- [examples/meeting_notes_graph_neo4j/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/meeting_notes_graph_neo4j/README.md)
- [examples/rust/files_transform/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/rust/files_transform/README.md)
- [examples/rust/pdf_to_markdown/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/rust/pdf_to_markdown/README.md)
- [examples/rust/paper_metadata/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/rust/paper_metadata/README.md)
- [rust/code_match/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/rust/code_match/README.md)
</details>

# Overview and Core Concepts

CocoIndex is a data indexing framework that turns heterogeneous source data (files, PDFs, code, meeting notes) into queryable, incrementally maintained artifacts (embeddings, knowledge graphs, transformed files). The core idea is that a pipeline is **declarative**: you describe what to compute from where, and CocoIndex takes care of caching, change detection, and writing to targets — re-running only what actually changed.

## What "Incremental" Means Here

Every example in the repository shares the same runtime semantics. Sources are scanned, their fingerprints are compared against the last persisted state, and only modified inputs trigger recomputation downstream. This is true for:

- **Local files** walked by `localfs` and transformed to HTML ([examples/files_transform/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/files_transform/README.md)) or Markdown ([examples/pdf_to_markdown/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/pdf_to_markdown/README.md)).
- **S3 buckets** ingested in one-shot catch-up runs ([examples/amazon_s3_embedding/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/amazon_s3_embedding/README.md)).
- **Code corpora** chunked and embedded into LanceDB, where unchanged files are skipped automatically ([examples/code_embedding_lancedb/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/code_embedding_lancedb/README.md)).
- **Knowledge graph** flows where re-running after editing one document only re-extracts that document's triples ([examples/docs_to_knowledge_graph/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/docs_to_knowledge_graph/README.md)).

Because the framework memoizes per-input results, pipeline authors do not write cache invalidation logic themselves — they declare functions and let the runtime decide what to recompute.

## Core Building Blocks

The same five concepts appear in every example, whether the example is Python or Rust.

### 1. App

The `App` is the unit of deployment. It owns a name, a root directory of source data, an output directory, and a set of function references that form the pipeline. Python uses `coco.App(...)` with `app_main` as the entry point ([examples/multi_codebase_summarization/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/multi_codebase_summarization/README.md)); Rust uses `cocoindex::App(...)` and the analog is a proc-macro-annotated function ([examples/rust/paper_metadata/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/rust/paper_metadata/README.md)). The `App` is what you point `cocoindex update main` at.

### 2. Function (Memoized Compute)

Functions are the compute nodes of the graph. In Python they are declared with `@coco.fn(memo=True)` (for example, `process_file` in [examples/text_embedding/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/text_embedding/README.md)). In Rust the equivalent is `#[cocoindex::function(memo)]` ([examples/rust/files_transform/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/rust/files_transform/README.md)). The `memo` flag tells the runtime to store the output keyed by input fingerprint — so identical inputs across runs return the cached value without re-execution.

### 3. Source

A source is the entry point for data ingestion. CocoIndex ships with sources for local files (`localfs.walk_dir`), Amazon S3 ([examples/amazon_s3_embedding/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/amazon_s3_embedding/README.md)), and Google Drive for the meeting-notes flows ([examples/meeting_notes_graph_neo4j/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/meeting_notes_graph_neo4j/README.md)). Sources are responsible for enumeration and fingerprinting; the rest of the pipeline receives already-deduplicated, fingerprint-stable records.

### 4. Processing Component (Scoped Sub-Flow)

Larger pipelines are composed by mounting a function that in turn calls other functions. The mounted function gets its own scope — useful for grouping operations on a single logical unit (e.g. one project, one meeting, one document). In the multi-codebase example, `process_project` is mounted once per subdirectory and in turn calls `extract_file_info`, `aggregate_project_info`, and `generate_markdown` ([examples/multi_codebase_summarization/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/multi_codebase_summarization/README.md)). The Mermaid diagram in that example makes the scoping explicit: thick arrows (`==>`) denote `mount`/`use_mount`, thin arrows are plain function calls.

### 5. Target

A target is where computed artifacts land. CocoIndex supports vector and relational backends — Postgres with pgvector ([examples/text_embedding/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/text_embedding/README.md)), LanceDB ([examples/text_embedding_lancedb/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/text_embedding_lancedb/README.md)), Qdrant ([examples/text_embedding_qdrant/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/text_embedding_qdrant/README.md)), Turbopuffer ([examples/text_embedding_turbopuffer/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/text_embedding_turbopuffer/README.md)) — and graph stores: Neo4j ([examples/docs_to_knowledge_graph/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/docs_to_knowledge_graph/README.md)) and FalkorDB ([examples/meeting_notes_graph_falkordb/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/meeting_notes_graph_falkordb/README.md)). For non-tabular outputs, the filesystem itself is a target: a declarative `DirTarget` writes, updates, and prunes files automatically ([examples/rust/files_transform/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/rust/files_transform/README.md)).

## Execution Modes

A single `App` supports two runtime modes ([examples/files_transform/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/files_transform/README.md)):

| Mode | Command | Behavior |
| --- | --- | --- |
| **Catch-up** | `cocoindex update main` | Scans sources, reconciles, exits. Used by S3 ([examples/amazon_s3_embedding/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/amazon_s3_embedding/README.md)) and the docs-to-graph example. |
| **Live** | `cocoindex update -L main` | Reconciles, then keeps watching for file changes. Activated when the source declares `live=True`. |

## Data Flow at a Glance

The diagram below summarizes how a typical embedding pipeline is wired.

```mermaid
flowchart LR
    A[Source<br/>localfs / S3 / Drive] --> B[Mounted function<br/>per-file scope]
    B --> C[Memoized compute<br/>@coco.fn / function(memo)]
    C --> D[Target<br/>Postgres / LanceDB / Neo4j / Files]
    D -. change detected .-> C
```

## Community Direction

Several open feature requests point at the same underlying model. **Shadow run / preview** (issue #1890) asks for a dry-run that computes the diff before any write — a natural extension of the same fingerprint/cache substrate that powers incremental updates today. **LanceDB commit optimization** (issue #1429) is about batching small appends on the target side of the same graph. **Ergonomic Rust SDK** (issue #1667) and **MCP support** (issue #160) are about new surfaces over the existing App/Function/Source/Target model — they do not change the core model, only how it is reached.

## See Also

- [Examples Gallery](https://github.com/cocoindex-io/cocoindex/tree/main/examples) — every section above cites one or more runnable examples.
- [rust/code_match/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/rust/code_match/README.md) — a non-pipeline utility (matching/captures over source code) shipped alongside the main framework.

---

<a id='page-2'></a>

## Architecture and Engine Internals

### Related Pages

Related topics: [Overview and Core Concepts](#page-1), [Operations, Live Mode, and Advanced Topics](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [examples/files_transform/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/files_transform/README.md)
- [examples/text_embedding/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/text_embedding/README.md)
- [examples/text_embedding_lancedb/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/text_embedding_lancedb/README.md)
- [examples/text_embedding_qdrant/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/text_embedding_qdrant/README.md)
- [examples/text_embedding_turbopuffer/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/text_embedding_turbopuffer/README.md)
- [examples/pdf_to_markdown/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/pdf_to_markdown/README.md)
- [examples/multi_codebase_summarization/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/multi_codebase_summarization/README.md)
- [examples/docs_to_knowledge_graph/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/docs_to_knowledge_graph/README.md)
- [examples/meeting_notes_graph_neo4j/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/meeting_notes_graph_neo4j/README.md)
- [examples/rust/files_transform/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/rust/files_transform/README.md)
- [examples/rust/pdf_to_markdown/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/rust/pdf_to_markdown/README.md)
- [examples/rust/paper_metadata/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/rust/paper_metadata/README.md)
- [examples/rust/conversation_to_knowledge/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/rust/conversation_to_knowledge/README.md)
- [rust/code_match/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/rust/code_match/README.md)
</details>

# Architecture and Engine Internals

## Overview and Design Philosophy

CocoIndex is a framework for building **incremental data pipelines** that keep derived state in sync with changing sources. The core promise — visible across every example in this repository — is that re-running a flow after editing a single file only re-processes the affected inputs, never the whole dataset. The engine is language-agnostic in design: a Rust core exposes the same incremental semantics to both a Python SDK (decorator-based) and a Rust SDK (proc-macro-based), and the two ports are kept feature-parallel through shared example ports. Source: [examples/files_transform/README.md:1-25](), [examples/rust/files_transform/README.md:1-20]()

The latest release (v1.0.13) added a standalone `code_match` crate for pattern matching against source code, with bare-keyword support, anchored regex matchers, fragment ranges, and a prefilter pass that prunes the child-run scan. This crate ships next to the main pipeline engine and reuses the same fingerprinting primitives for memoization. Source: [rust/code_match/README.md:1-30]()

## Pipeline Anatomy

Every CocoIndex flow follows the same three-stage topology: **Source → Functions → Target**.

- **Sources** are first-class, addressable streams: a local directory (via `localfs.walk_dir` or `cocoindex::fs::walk`), a Google Drive folder of meeting notes, or a list of YouTube URLs. Source: [examples/text_embedding/README.md:1-20](), [examples/meeting_notes_graph_neo4j/README.md:1-25]()
- **Functions** are pure-ish transformations with optional memoization. They chain via `mount` (Python) / `use_mount` (Rust) for sub-pipelines, and stable identifiers come from `IdGenerator` so that re-asserting the same fact across inputs maps to a single node or row. Source: [examples/multi_codebase_summarization/README.md:1-30](), [examples/docs_to_knowledge_graph/README.md:10-30]()
- **Targets** are declarative sinks that know how to upsert, skip, and prune their own data. The example catalog covers Postgres (pgvector), LanceDB (embedded, local file), Qdrant, Turbopuffer, Neo4j, FalkorDB, SurrealDB, and the local filesystem. Source: [examples/text_embedding/README.md:1-25](), [examples/code_embedding_lancedb/README.md:1-25](), [examples/text_embedding_qdrant/README.md:1-30](), [examples/text_embedding_turbopuffer/README.md:1-20]()

```mermaid
flowchart LR
    A[Source<br/>localfs / Drive / URLs] --> B[Function<br/>memo=True]
    B --> C[Function<br/>chunk / embed / LLM]
    C --> D[Target<br/>DB / Graph / Files]
    D -. fingerprint .-> B
```

The engine tracks **fingerprint hashes** of every intermediate value, so a downstream function only re-runs when one of its inputs (or the function's own implementation) changes. When a source file is deleted, the engine walks forward through the graph and removes the corresponding rows or files in the targets automatically. Source: [examples/rust/pdf_to_markdown/README.md:1-25](), [examples/rust/files_transform/README.md:1-20]()

## Execution Modes

The CLI exposes two run modes, declared in the source and toggled at the command line:

- **Catch-up** — `cocoindex update main` — scans sources, syncs changes, exits. Source: [examples/text_embedding/README.md:10-25]()
- **Live** — `cocoindex update -L main` — catches up, then keeps watching for file changes. Source: [examples/text_embedding_lancedb/README.md:10-25]()

In live mode the source declares `live=True` and the engine re-evaluates the affected subgraph whenever the file watcher reports a change. This is what makes the "edit one doc, only that doc gets re-extracted" property visible in the knowledge-graph examples, where the graph-building pass only re-runs when the set of triples changes. Source: [examples/docs_to_knowledge_graph/README.md:10-30]()

## SDK Surface: Python vs Rust

The two SDKs expose the same engine semantics through different idioms. The mapping is documented side-by-side in the Rust example ports:

| Concern | Python | Rust |
|---------|--------|------|
| Walk source dir | `localfs.walk_dir` | `cocoindex::fs::walk` |
| Memoized function | `@coco.fn(memo=True)` | `#[cocoindex::function(memo)]` |
| Stable ids | `IdGenerator` | `cocoindex::IdGenerator` |
| Entity resolution | `ops.entity_resolution` | `cocoindex::entity_resolution` |
| Directory target | `localfs.declare_file` | `DirTarget::declare_file` |

Source: [examples/rust/paper_metadata/README.md:1-30](), [examples/rust/conversation_to_knowledge/README.md:1-35]()

The Rust SDK is currently less ergonomic than its Python counterpart, an asymmetry explicitly raised in issue #1667 ("Ergonomic Rust SDK"), which proposes explicit `&Ctx` and proc-macro-friendly patterns to match the Python decorator experience. The v1.0.13 release continues to invest in the Rust side, with the new `code_match` crate leading that effort. Source: [rust/code_match/README.md:1-30]()

## Community-Driven Extensions

The architecture leaves clear extension points that the community is actively filling:

- **MCP support (#160)** — an integration target for the Model Context Protocol, allowing CocoIndex flows to be exposed as MCP resources or tools from external clients.
- **LanceDB commit optimization (#1429)** — the engine currently commits per-batch; for workloads with many small appends, Lance tables require compaction every several thousand rows to keep fragment counts manageable and to avoid index build blow-ups.
- **Shadow run / preview (#1890)** — a dry-run mode requested on Discord that computes the new state but does not write to targets, so users can inspect what a chunking or schema change would produce before committing. This is the most-discussed open feature.
- **Code matching (v1.0.13)** — the new `code_match` engine supports captures, descendant containment, regex matchers, and prefiltering, but rewriting is not yet implemented; planned work includes alternation, quantifiers, node-kind matchers, and a rule DSL. Source: [rust/code_match/README.md:1-30]()

## See Also

- Getting Started
- Python SDK Reference
- Rust SDK Reference
- Targets Overview
- Operations (chunking, embedding, entity resolution)

---

<a id='page-3'></a>

## Connectors, Sources, and Targets

### Related Pages

Related topics: [Overview and Core Concepts](#page-1), [Architecture and Engine Internals](#page-2), [Operations, Live Mode, and Advanced Topics](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [python/cocoindex/connectorkits/target.py](https://github.com/cocoindex-io/cocoindex/blob/main/python/cocoindex/connectorkits/target.py)
- [python/cocoindex/connectorkits/statediff.py](https://github.com/cocoindex-io/cocoindex/blob/main/python/cocoindex/connectorkits/statediff.py)
- [python/cocoindex/connectorkits/fingerprint.py](https://github.com/cocoindex-io/cocoindex/blob/main/python/cocoindex/connectorkits/fingerprint.py)
- [python/cocoindex/connectorkits/async_adapters.py](https://github.com/cocoindex-io/cocoindex/blob/main/python/cocoindex/connectorkits/async_adapters.py)
- [python/cocoindex/connectors/localfs/_source.py](https://github.com/cocoindex-io/cocoindex/blob/main/python/cocoindex/connectors/localfs/_source.py)
- [python/cocoindex/connectors/localfs/_target.py](https://github.com/cocoindex-io/cocoindex/blob/main/python/cocoindex/connectors/localfs/_target.py)
- [examples/amazon_s3_embedding/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/amazon_s3_embedding/README.md)
- [examples/text_embedding/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/text_embedding/README.md)
- [examples/text_embedding_lancedb/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/text_embedding_lancedb/README.md)
- [examples/docs_to_knowledge_graph/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/docs_to_knowledge_graph/README.md)
- [examples/pdf_to_markdown/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/pdf_to_markdown/README.md)
- [examples/rust/files_transform/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/rust/files_transform/README.md)
</details>

# Connectors, Sources, and Targets

CocoIndex pipelines move data in two directions: **Sources** describe where rows come from and how to detect changes, while **Targets** describe where derived rows are written and how to reconcile them against previously-written state. The `connectorkits` module under [python/cocoindex/connectorkits/](https://github.com/cocoindex-io/cocoindex/tree/main/python/cocoindex/connectorkits) provides the shared infrastructure (target lifecycle, state diffing, fingerprinting, async adapters) that every concrete connector under [python/cocoindex/connectors/](https://github.com/cocoindex-io/cocoindex/tree/main/python/cocoindex/connectors) builds on.

## Conceptual model

A CocoIndex flow is a DAG of `@coco.fn` steps mounted onto sources and targets. Sources produce `(key, value)` rows and a per-key fingerprint so unchanged inputs can be skipped on re-run. Targets accept collected rows and reconcile them against previously-stored state, deleting rows whose source disappeared and updating rows whose fingerprint changed. Source: [python/cocoindex/connectorkits/fingerprint.py:1-1]() and [python/cocoindex/connectorkits/target.py:1-1]().

```mermaid
flowchart LR
    A[Source: walk / pull] -->|key + fingerprint| B[@coco.fn pipeline]
    B -->|collected rows| C[Target spec]
    C --> D{State diff}
    D -->|new / changed| E[Upsert]
    D -->|missing| F[Delete]
    C --> G[(Postgres / LanceDB / Qdrant / Neo4j / SurrealDB / localfs / S3)]
```

The dataflow is identical across SDKs: the Rust examples use `cocoindex::fs::walk` and `DirTarget` as the analogues of Python's `localfs.walk_dir` and `localfs.declare_file` (Source: [examples/rust/files_transform/README.md:1-1]()).

## Source connectors

Source connectors implement two responsibilities: enumerate current rows and emit a stable fingerprint per row so re-runs can detect inserts, updates, and deletes. The local filesystem source in [python/cocoindex/connectors/localfs/_source.py:1-1]() walks a directory matching a glob, reads each file, and exposes `walk_dir(...)` returning `(filename, content)` pairs keyed by relative path. Fingerprinting is content-hash-based by default, so a re-run with no file changes performs zero per-file work in memoized steps.

Common source patterns across the examples:

| Pattern | Example | Notes |
|---|---|---|
| Local files (recursive glob) | `examples/text_embedding/` | Markdown files via `localfs.walk_dir` |
| Top-level files only | `examples/pdf_to_markdown/` | `localfs.walk_dir` with non-recursive glob |
| Live watching | `examples/text_embedding_lancedb/` | `live=True` enables `cocoindex update -L` catch-up-then-watch |
| S3 bucket | `examples/amazon_s3_embedding/` | One-shot catch-up; live mode not supported for S3 sources |
| Google Drive notes | `examples/docs_to_knowledge_graph/` | Markdown files split per meeting |
| Embedded | `examples/rust/files_transform/` | `cocoindex::fs::walk` mirror with `#[cocoindex::function(memo)]` |

Community issue [#1890 (shadow run / preview)](https://github.com/cocoindex-io/cocoindex/issues/1890) proposes running the source→target pipeline without persisting, so users can preview diffs before applying them — this would extend the source-fingerprinting machinery in [python/cocoindex/connectorkits/fingerprint.py:1-1]() with a dry-run mode.

## Target connectors

Targets declare a destination schema and reconcile collected rows against previously-stored state. The shared target lifecycle lives in [python/cocoindex/connectorkits/target.py:1-1]() and the diff engine in [python/cocoindex/connectorkits/statediff.py:1-1](): given a previous snapshot and a fresh batch, it produces three action sets — `to_upsert`, `to_delete`, and `to_keep` — keyed by a primary key derived from the row.

Concrete targets across the examples:

- **Postgres (pgvector)** — three-table layout in `examples/text_embedding/`, mounted via `postgres.mount_table_target`. Sequential scan if no vector index is created. Source: [examples/text_embedding/README.md:1-1]().
- **LanceDB** — embedded store under `./lancedb_data/`; supports vector search plus FTS. Community issue [#1429 (LanceDB commit optimization)](https://github.com/cocoindex-io/cocoindex/issues/1429) notes that many small appends need compaction every few thousand rows. Source: [examples/text_embedding_lancedb/README.md:1-1]().
- **Qdrant** — HTTP/gRPC container, no secrets required by default. Source: [examples/text_embedding_qdrant/README.md:1-1]().
- **Turbopuffer** — remote namespace keyed by `TURBOPUFFER_API_KEY`. Source: [examples/text_embedding_turbopuffer/README.md:1-1]().
- **Neo4j / FalkorDB** — property-graph targets for knowledge graphs (`Document`, `Entity`, `Meeting`, `Person`, `Task` nodes plus typed relationships). Source: [examples/docs_to_knowledge_graph/README.md:1-1]() and [examples/meeting_notes_graph_neo4j/README.md:1-1]().
- **Local filesystem directory** — `localfs.declare_file(...)` writes/updates files and removes outputs whose source was deleted (see `examples/pdf_to_markdown/` and the Rust `DirTarget` analogue in [examples/rust/files_transform/README.md:1-1]()).

Targets are reconciled by primary key plus fingerprint: if a row's fingerprint is unchanged, it is skipped; changed rows are upserted; absent rows are deleted. This is what makes the same flow safe to run repeatedly. Source: [python/cocoindex/connectorkits/statediff.py:1-1]().

## Connector kit infrastructure

The `connectorkits` package is what lets new connectors be added with minimal boilerplate:

- **target.py** — base classes for mounted, key-based, and collected targets; defines the contract for `setup`, `apply`, and `teardown`. Source: [python/cocoindex/connectorkits/target.py:1-1]().
- **statediff.py** — computes the `to_upsert / to_delete / to_keep` partition used by every reconciling target. Source: [python/cocoindex/connectorkits/statediff.py:1-1]().
- **fingerprint.py** — produces stable per-row fingerprints so memoized steps and target reconciliation can detect change. Source: [python/cocoindex/connectorkits/fingerprint.py:1-1]().
- **async_adapters.py** — bridges async SDKs (e.g. asyncpg, async S3 clients) into CocoIndex's sync per-row execution model. Source: [python/cocoindex/connectorkits/async_adapters.py:1-1]().

## Common failure modes

- **Stale fingerprints after schema changes.** Changing a transform changes the emitted rows but not the source fingerprint; without re-running, the target will see unchanged fingerprints and skip updates. Re-running with a forced rebuild clears the snapshot.
- **No live mode for cloud sources.** S3 sources only support one-shot catch-up (`examples/amazon_s3_embedding/`); long-running pipelines must invoke `cocoindex update` on a schedule.
- **LanceDB fragmentation.** Many small commits require periodic compaction; issue [#1429](https://github.com/cocoindex-io/cocoindex/issues/1429) tracks optimizing this directly inside CocoIndex targets. Source: [examples/text_embedding_lancedb/README.md:1-1]().
- **MCP ingestion not yet supported.** Community issue [#160 (Support MCP)](https://github.com/cocoindex-io/cocoindex/issues/160) requests adding the Model Context Protocol as a first-class source so external tools can stream data in.

## See Also

- [CocoIndex README](https://github.com/cocoindex-io/cocoindex/blob/main/README.md)
- Examples directory: [examples/](https://github.com/cocoindex-io/cocoindex/tree/main/examples)
- Rust SDK mapping: [examples/rust/files_transform/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/rust/files_transform/README.md)

---

<a id='page-4'></a>

## Operations, Live Mode, and Advanced Topics

### Related Pages

Related topics: [Overview and Core Concepts](#page-1), [Architecture and Engine Internals](#page-2), [Connectors, Sources, and Targets](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [python/cocoindex/cli.py](https://github.com/cocoindex-io/cocoindex/blob/main/python/cocoindex/cli.py)
- [docs/src/content/docs/cli.mdx](https://github.com/cocoindex-io/cocoindex/blob/main/docs/src/content/docs/cli.mdx)
- [docs/src/content/docs/cli_commands.mdx](https://github.com/cocoindex-io/cocoindex/blob/main/docs/src/content/docs/cli_commands.mdx)
- [docs/src/content/docs/programming_guide/live_mode.mdx](https://github.com/cocoindex-io/cocoindex/blob/main/docs/src/content/docs/programming_guide/live_mode.mdx)
- [docs/src/content/docs/programming_guide/target_state.mdx](https://github.com/cocoindex-io/cocoindex/blob/main/docs/src/content/docs/programming_guide/target_state.mdx)
- [docs/src/content/docs/advanced_topics/exception_handlers.mdx](https://github.com/cocoindex-io/cocoindex/blob/main/docs/src/content/docs/advanced_topics/exception_handlers.mdx)
- [examples/text_embedding_lancedb/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/text_embedding_lancedb/README.md)
- [examples/pdf_to_markdown/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/pdf_to_markdown/README.md)
- [examples/files_transform/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/files_transform/README.md)
- [examples/docs_to_knowledge_graph/README.md](https://github.com/cocoindex-io/cocoindex/blob/main/examples/docs_to_knowledge_graph/README.md)
- [rust/py/src/lib.rs](https://github.com/cocoindex-io/cocoindex/blob/main/rust/py/src/lib.rs)
</details>

# Operations, Live Mode, and Advanced Topics

## Overview

CocoIndex provides a layered operations model: a Python CLI orchestrates pipelines, each pipeline runs in either catch-up or live mode, and a target-state engine reconciles source and output stores incrementally. This page consolidates the operational surface area — the `cocoindex` command, live updates, target state semantics, and the advanced extension points (e.g. exception handlers) — so practitioners can run CocoIndex reliably in production and one-off contexts alike.

The CLI entry point is defined in [`python/cocoindex/cli.py`](https://github.com/cocoindex-io/cocoindex/blob/main/python/cocoindex/cli.py), with user-facing documentation in [`docs/src/content/docs/cli.mdx`](https://github.com/cocoindex-io/cocoindex/blob/main/docs/src/content/docs/cli.mdx) and command references in [`docs/src/content/docs/cli_commands.mdx`](https://github.com/cocoindex-io/cocoindex/blob/main/docs/src/content/docs/cli_commands.mdx). The Rust core that powers both Python and (eventually) native-Rust pipelines is exposed via [`rust/py/src/lib.rs`](https://github.com/cocoindex-io/cocoindex/blob/main/rust/py/src/lib.rs), exposing classes such as `PyApp`, `PyUpdateHandle`, and `PyDropHandle`.

---

## CLI Operations

The `cocoindex` command is the primary entry point for managing pipelines from a shell. It dispatches to subcommands declared in the Python entry point (`cli.py`).

### Common commands

| Command | Purpose |
| ------- | ------- |
| `cocoindex update <module>` | Catch-up run: scan sources, reconcile state, exit. |
| `cocoindex update -L <module>` | Live run: catch up, then keep watching for source changes. |
| `cocoindex drop <module>` | Drop a previously-built flow and its backing state. |
| `cocoindex show <module>` | Show live progress for running flows. |

Source: [`docs/src/content/docs/cli.mdx`](https://github.com/cocoindex-io/cocoindex/blob/main/docs/src/content/docs/cli.mdx) and [`docs/src/content/docs/cli_commands.mdx`](https://github.com/cocoindex-io/cocoindex/blob/main/docs/src/content/docs/cli_commands.mdx).

Examples such as [`examples/text_embedding_lancedb/README.md`](https://github.com/cocoindex-io/cocoindex/blob/main/examples/text_embedding_lancedb/README.md) and [`examples/files_transform/README.md`](https://github.com/cocoindex-io/cocoindex/blob/main/examples/files_transform/README.md) routinely use both the catch-up (`cocoindex update main`) and live (`cocoindex update -L main`) variants to illustrate the two operational modes.

---

## Live Mode

Live mode keeps a pipeline process attached to its sources and re-reconciles whenever the underlying inputs change. It is the recommended mode for long-running services that mirror source-of-truth data into derived stores.

### Enabling live mode

A pipeline is considered live when either:

- the source is declared with `live=True` (e.g. `localfs` walk declared with the live flag), **or**
- the CLI is invoked with `-L` / `--live` so the runtime treats all sources as live.

Source: [`docs/src/content/docs/programming_guide/live_mode.mdx`](https://github.com/cocoindex-io/cocoindex/blob/main/docs/src/content/docs/programming_guide/live_mode.mdx) and example READMEs (e.g. [`examples/text_embedding_lancedb/README.md`](https://github.com/cocoindex-io/cocoindex/blob/main/examples/text_embedding_lancedb/README.md)).

### Lifecycle

```mermaid
stateDiagram-v2
    [*] --> CatchUp: cocoindex update [-L] <module>
    CatchUp --> Steady: reconciliation complete
    Steady --> Reconciling: source event (live only)
    Reconciling --> Steady: incremental update done
    Steady --> [*]: catch-up run exits
    Steady --> Steady: live run keeps watching
```

In live mode, the runtime watches file sources, re-evaluates memoized functions only for changed inputs, and applies the resulting diffs to targets. Functions annotated with `@coco.fn(memo=True)` (Python) or `#[cocoindex::function(memo)]` (Rust) cache outputs by content fingerprint, so unchanged inputs skip downstream work entirely. This pattern is demonstrated in [`examples/rust/files_transform/README.md`](https://github.com/cocoindex-io/cocoindex/blob/main/examples/rust/files_transform/README.md) and [`examples/docs_to_knowledge_graph/README.md`](https://github.com/cocoindex-io/cocoindex/blob/main/examples/docs_to_knowledge_graph/README.md) ("CocoIndex reconciles changes incrementally — re-running after editing one doc only re-extracts that doc").

### Cancellation and shutdown

The Rust core registers lifecycle functions on the PyO3 module — `init_runtime`, `shutdown_tokio_runtime`, `py_cancel_all`, and `py_reset_global_cancellation` — to support graceful shutdown of long-running live processes (see [`rust/py/src/lib.rs`](https://github.com/cocoindex-io/cocoindex/blob/main/rust/py/src/lib.rs)). Use Ctrl-C / SIGINT for a graceful shutdown in local development; container orchestrators should send `SIGTERM` and give the process a moment to drain.

---

## Target State and Reconciliation

Targets declare the desired state of an external system (a database, a directory, a property graph). CocoIndex reconciles toward that desired state on every update.

### Target categories

| Target family | Examples | Notes |
| ------------- | -------- | ----- |
| Vector stores | LanceDB, Qdrant, Postgres (pgvector), Turbopuffer | Embedding-bearing tables; support upserts keyed by stable IDs. |
| Graph stores | Neo4j, FalkorDB, SurrealDB | Property-graph nodes and relationships, keyed by stable IDs. |
| Filesystem | `localfs` (Python), `DirTarget` (Rust) | Declarative directory mirroring: writes/updates, skips unchanged files, removes outputs whose sources were deleted. |

Source: [`docs/src/content/docs/programming_guide/target_state.mdx`](https://github.com/cocoindex-io/cocoindex/blob/main/docs/src/content/docs/programming_guide/target_state.mdx), [`examples/files_transform/README.md`](https://github.com/cocoindex-io/cocoindex/blob/main/examples/files_transform/README.md), [`examples/pdf_to_markdown/README.md`](https://github.com/cocoindex-io/cocoindex/blob/main/examples/pdf_to_markdown/README.md).

### Stable identifiers

Because targets are reconciled by key, every record needs a deterministic ID derived from its semantic content. CocoIndex supplies an `IdGenerator` primitive (used in Python examples and exposed as `cocoindex::IdGenerator` in Rust, per [`examples/rust/conversation_to_knowledge/README.md`](https://github.com/cocoindex-io/cocoindex/blob/main/examples/rust/conversation_to_knowledge/README.md)). Hashing the canonical tuple of identifying fields is the canonical pattern — it ensures that re-asserting the same fact in another document maps to the same row/edge (see [`examples/docs_to_knowledge_graph/README.md`](https://github.com/cocoindex-io/cocoindex/blob/main/examples/docs_to_knowledge_graph/README.md)).

### Operational caveats

- **LanceDB compaction** — Lance tables accumulate fragments across many small appends. The community has tracked this in issue #1429 ("Optimize for LanceDB commit"): CocoIndex currently relies on user-driven compaction, and operators should periodically compact + optimize indices per Lance's guidance.
- **Graph entity resolution** — Without a normalization pass, "CocoIndex" and "Cocoindex" become separate `Entity` nodes. The [`examples/meeting_notes_graph_neo4j`](https://github.com/cocoindex-io/cocoindex/tree/main/examples/meeting_notes_graph_neo4j) example demonstrates an embedding + LLM entity-resolution pass to collapse near-duplicates.
- **Shadow / preview runs** — Issue #1890 tracks a long-standing community request for a shadow-run mode that diffs the new desired state against the live state without writing. This is not yet shipped; today, the safest preview is to point a flow at a scratch target/database.

---

## Advanced Topics

### Exception handlers

CocoIndex lets you attach handlers to flows so transient failures (rate limits, network blips, LLM provider 5xx) can be retried or escalated without aborting an entire batch. Handlers are documented in [`docs/src/content/docs/advanced_topics/exception_handlers.mdx`](https://github.com/cocoindex-io/cocoindex/blob/main/docs/src/content/docs/advanced_topics/exception_handlers.mdx). Common patterns include:

- **Rate-limit backoff** — Surface `429`s, sleep with exponential backoff, and retry the failing operation.
- **Dead-letter routing** — Capture per-record failures and write them to a sidecar target so the main reconciliation can finish.
- **Idempotent retries** — Pair handlers with memoized functions to guarantee that retried work produces identical outputs (and therefore no extra writes).

### Memory and rate limiting

The Rust core exposes `ops::ratelimit` and `batching` modules (see the module list in [`rust/py/src/lib.rs`](https://github.com/cocoindex-io/cocoindex/blob/main/rust/py/src/lib.rs)) so pipelines can throttle outbound calls to upstream APIs (LLM providers, embedding services, source APIs). This is especially important in live mode, where one event triggers a burst of work.

### Future-facing work

Several community-tracked items shape the near-term operational roadmap:

- **MCP support** (#160) — Native Model Context Protocol integration would let agents drive CocoIndex flows as tools.
- **Ergonomic Rust SDK** (#1667) — Proc-macro- and `&Ctx`-based idiomatic Rust bindings, currently landing as parallel examples alongside the Python SDK (see [`examples/rust/files_transform/README.md`](https://github.com/cocoindex-io/cocoindex/blob/main/examples/rust/files_transform/README.md) and [`examples/rust/pdf_to_markdown/README.md`](https://github.com/cocoindex-io/cocoindex/blob/main/examples/rust/pdf_to_markdown/README.md)).
- **LanceDB commit optimization** (#1429) — Automatic compaction hooks for high-frequency append workloads.
- **Shadow-run / preview** (#1890) — Read-only reconciliation diffs so operators can review the impact of a pipeline change before committing it.

These items are not part of the current `v1.0.13` release but are good signals for what "operations" will mean in upcoming versions.

---

## See Also

- [CLI Reference](cli.mdx)
- [Live Mode Programming Guide](programming_guide/live_mode.mdx)
- [Target State Programming Guide](programming_guide/target_state.mdx)
- [Exception Handlers](advanced_topics/exception_handlers.mdx)
- [Examples: Text Embedding with LanceDB](../examples/text_embedding_lancedb/README.md)
- [Examples: Files Transform](../examples/files_transform/README.md)
- [Examples: Docs to Knowledge Graph (Neo4j)](../examples/docs_to_knowledge_graph/README.md)

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: cocoindex-io/cocoindex

Summary: Found 6 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Capability evidence risk - Capability evidence risk requires verification.

## 1. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | https://github.com/cocoindex-io/cocoindex

## 2. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/cocoindex-io/cocoindex

## 3. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: downstream_validation.risk_items | https://github.com/cocoindex-io/cocoindex

## 4. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: risks.scoring_risks | https://github.com/cocoindex-io/cocoindex

## 5. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/cocoindex-io/cocoindex

## 6. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/cocoindex-io/cocoindex

<!-- canonical_name: cocoindex-io/cocoindex; human_manual_source: deepwiki_human_wiki -->
