# https://github.com/chunkhound/chunkhound Project Manual

Generated at: 2026-07-02 17:28:00 UTC

## Table of Contents

- [Architecture and System Overview](#page-1)
- [Search, Research, and Code Mapping](#page-2)
- [MCP Integration and Deployment](#page-3)
- [Parsers, Providers, and Extensibility](#page-4)

<a id='page-1'></a>

## Architecture and System Overview

### Related Pages

Related topics: [Search, Research, and Code Mapping](#page-2), [MCP Integration and Deployment](#page-3), [Parsers, Providers, and Extensibility](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/chunkhound/chunkhound/blob/main/README.md)
- [chunkhound/core/models/chunk.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/chunk.py)
- [chunkhound/core/models/file.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/file.py)
- [chunkhound/core/models/embedding.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/embedding.py)
- [chunkhound/core/models/__init__.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/__init__.py)
- [chunkhound/api/cli/commands/research.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/commands/research.py)
- [chunkhound/api/cli/commands/code_mapper.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/commands/code_mapper.py)
- [chunkhound/api/cli/commands/autodoc.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/commands/autodoc.py)
- [chunkhound/api/cli/parsers/research_parser.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/parsers/research_parser.py)
- [chunkhound/api/cli/parsers/code_mapper_parser.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/parsers/code_mapper_parser.py)
- [chunkhound/api/cli/utils/rich_output.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/utils/rich_output.py)
- [chunkhound/api/cli/utils/tree_progress.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/utils/tree_progress.py)
- [operations/README.md](https://github.com/chunkhound/chunkhound/blob/main/operations/README.md)
</details>

# Architecture and System Overview

ChunkHound is positioned as a "code research" tool that augments AI assistants with semantic understanding of a codebase. The README frames the project this way: *"Your AI assistant searches code but doesn't understand it. ChunkHound researches your codebase—extracting architecture, patterns, and institutional knowledge at any scale."* Source: [README.md:1-3](). The same file enumerates the primary capabilities: the [cAST chunking algorithm](https://arxiv.org/pdf/2506.15655), Multi-Hop Semantic Search, regex search (no API key required), support for 32 languages via Tree-sitter plus custom text-based parsers, MCP integration, and real-time indexing with explicit backend selection (`watchdog`, `watchman`, or `polling`). Source: [README.md:11-29]().

This page describes the moving parts that compose those capabilities and how they connect at a system level.

## High-Level System Architecture

The repository follows a layered structure: a thin CLI/API surface, a domain-model core, infrastructure adapters for parsing/embedding/storage, and integration endpoints (MCP and editors). The CLI commands `research`, `code_mapper`, and `autodoc` each act as orchestrators that wire these layers together.

```mermaid
flowchart LR
    subgraph Clients["Clients"]
        MCP["MCP Servers<br/>(Claude, Cursor, Zed, …)"]
        Editor["Editor configs<br/>(.chunkhound.json)"]
    end

    subgraph CLI["ChunkHound CLI"]
        Research["research command"]
        Mapper["code_mapper command"]
        Autodoc["autodoc command"]
    end

    subgraph Core["Domain Core"]
        File["File model"]
        Chunk["Chunk model"]
        Emb["Embedding model"]
    end

    subgraph Infra["Infrastructure"]
        Parse["Tree-sitter parsers"]
        Embed["Embedding providers<br/>(VoyageAI / OpenAI / Ollama)"]
        Store["Persistence layer"]
    end

    MCP --> Research
    Editor --> CLI
    Research --> Chunk
    Research --> Embed
    Mapper --> Chunk
    Mapper --> Embed
    Autodoc --> Mapper
    Chunk --- File
    Embed --- Chunk
    Parse --> Chunk
    Embed --> Store
```

The CLI's progress and rendering utilities (`RichOutputFormatter`, `TreeProgressDisplay`) are shared across commands, keeping terminal behavior consistent. Source: [rich_output.py:1-19](), [tree_progress.py:9-19]().

## Core Domain Models

The heart of the system is three immutable dataclasses declared in [chunkhound/core/models/__init__.py:1-26]() that re-export `File`, `Chunk`, `Embedding`, and `EmbeddingResult`.

| Model | Purpose | Key Fields | Defined In |
|---|---|---|---|
| `File` | Metadata for an indexed source file | `path`, `mtime`, `language`, `size_bytes`, `id`, `content_hash` | [file.py:1-50]() |
| `Chunk` | A semantically meaningful slice of a file | `symbol`, `start_line`, `end_line`, `code`, `chunk_type`, `file_id`, `language`, `start_byte`, `end_byte`, `metadata` | [chunk.py:1-79]() |
| `Embedding` | A vector embedding of a chunk | `chunk_id`, `provider`, `model`, `dims`, `vector`, `created_at` | [embedding.py:1-44]() |

`Chunk` distinguishes code from documentation via `is_code_chunk()` / `is_documentation_chunk()` helpers and exposes structural predicates such as `contains_line()`, `overlaps_with()`, and size checks (`is_small_chunk`, `is_large_chunk`). Source: [chunk.py:50-110](). The models validate themselves in `__post_init__`, raising `ValidationError` for invariants like positive `start_line` / `end_line`. Source: [chunk.py:60-78](), [embedding.py:38-44](). This validation-in-construction pattern means callers can rely on the core objects being well-formed the moment they are created.

## CLI, Research, and Code-Mapping Pipelines

The CLI exposes three top-level operations, each a long-running pipeline that streams progress through the tree-based display.

- **Research** — wires `EmbeddingManager` and an optional `LLMManager` to the deep-research implementation; missing or invalid embedding configuration is surfaced early with explicit remediation messages. Source: [research.py:21-59](). Its parser accepts git diff/commit-range arguments plus common and config-specific argument groups (database, embedding, llm, research). Source: [research_parser.py:1-19]().
- **Code Mapper** — runs a two-phase pipeline: a shallow research call that plans Points of Interest (the count keyed off a comprehensiveness setting), then a dedicated deep-research pass per PoI that is assembled into a single document. Source: [code_mapper.py:1-15](). The orchestrator supports concurrency via a `-j/--jobs` argument with env-var fallback to `CH_CODE_MAPPER_*`. Source: [code_mapper_parser.py:22-78](). Output directory handling warns when targeting a git-tracked path. Source: [autodoc.py:31-38]().
- **AutoDoc** — generates an Astro documentation site from Code Mapper outputs, with prompts/overrides for non-interactive runs (e.g. `--force`, `--assets-only`). Source: [autodoc.py:13-38](), [operations/README.md:1-19]().

Both `research` and `code_mapper` rely on `verify_database_exists` before proceeding, mirroring the database-as-prerequisite architecture used by the indexer. Source: [research.py:1-19](), [code_mapper.py:1-15]().

## Integration Points and Known Constraints

ChunkHound is editor- and assistant-agnostic at the integration boundary. The README lists MCP integrations with Claude, VS Code, Cursor, Windsurf, and Zed, and the hero terminal demo on the marketing site previews a `chunkhound research "authentication architecture"` invocation that emits a synthesized report. Source: [README.md:25-27](), [hero-terminal.ts:1-13](). Embedding and LLM providers are pluggable: VoyageAI (recommended), OpenAI, and local Ollama for embeddings; Claude Code CLI, Codex CLI, Anthropic, OpenAI, and Grok for LLMs. Source: [README.md:33-39]().

Several community discussions highlight the boundaries of this plug-in architecture and inform the design choices visible in the source:

- **Concurrent MCP instances** — Issue #53 reports that running multiple Claude Code windows with stdio-mode chunkhound MCP crashes, while HTTP-mode semantic search returns empty. The parser and command layers currently assume single-tenant process state, which constrains concurrent editors to a single configurable transport.
- **Embedding model compatibility** — Issue #41 (Ollama `dengcao/Qwen3-Embedding-8B:Q5_K_M`) flows through the `EmbeddingProviderFactory` used by `setup_embedding_llm`. Source: [research.py:26-32](). Custom or quantized embedding models require provider-level support; the abstraction is in place, but coverage is defined per-provider.
- **Language coverage** — Ruby support is requested in #35, consistent with the language list bundled with the parser registry; new languages require both a parser hookup and embedding-model mapping, since the `Chunk` model carries the resolved `language` enum into storage. Source: [chunk.py:42-50]().
- **New LLM providers** — OpenCode (Issue #113) has been requested as a CLI-flavored provider alongside Claude Code; the LLM manager pattern is symmetric to the embedding-manager pattern in `research.py`, so adapter additions slot in at that seam. Source: [research.py:33-58]().
- **Worktree efficiency** — Issue #83 reports successful use inside monorepo worktrees; the architecture supports this via real-time indexers with explicit `watchdog`/`watchman`/`polling` backend selection. Source: [README.md:25-29]().

The release notes for v5.1.0 note that MCP `search` responses now return lean markdown with similarity percentages instead of JSON, reducing token overhead for downstream models—a refinement to the same integration layer. Source: release notes excerpt in community context.

## See Also

- [Configuration Guide](https://chunkhound.ai/docs/configuration/)
- [MCP Protocol Specification](https://spec.modelcontextprotocol.io/)
- Issue [#53 — Concurrent MCP instances](https://github.com/chunkhound/chunkhound/issues/53)
- Issue [#41 — Embedding model compatibility](https://github.com/chunkhound/chunkhound/issues/41)
- Issue [#83 — Worktree support](https://github.com/chunkhound/chunkhound/issues/83)

---

<a id='page-2'></a>

## Search, Research, and Code Mapping

### Related Pages

Related topics: [Architecture and System Overview](#page-1), [MCP Integration and Deployment](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [chunkhound/core/models/chunk.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/chunk.py)
- [chunkhound/core/models/file.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/file.py)
- [chunkhound/core/models/embedding.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/embedding.py)
- [chunkhound/core/models/__init__.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/__init__.py)
- [chunkhound/api/cli/parsers/research_parser.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/parsers/research_parser.py)
- [chunkhound/api/cli/parsers/code_mapper_parser.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/parsers/code_mapper_parser.py)
- [chunkhound/api/cli/commands/research.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/commands/research.py)
- [chunkhound/api/cli/commands/code_mapper.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/commands/code_mapper.py)
- [chunkhound/api/cli/commands/autodoc.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/commands/autodoc.py)
- [chunkhound/api/cli/utils/tree_progress.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/utils/tree_progress.py)
- [chunkhound/api/cli/utils/rich_output.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/utils/rich_output.py)
- [README.md](https://github.com/chunkhound/chunkhound/blob/main/README.md)
- [operations/README.md](https://github.com/chunkhound/chunkhound/blob/main/operations/README.md)
</details>

# Search, Research, and Code Mapping

ChunkHound exposes three progressively richer ways of querying an indexed codebase: lightweight **search**, multi-step **research**, and architectural **code mapping**. All three share the same domain models (`File`, `Chunk`, `Embedding`) and the same database backing, but they differ in depth, output shape, and the infrastructure they depend on.

## Domain Models Backing Every Query

Every query path ultimately operates on the immutable domain models declared in `chunkhound/core/models/`. A `File` records path, mtime, language, size, and a `content_hash` used for change detection [Source: [chunkhound/core/models/file.py:18-39](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/file.py)]. A `Chunk` carries `symbol`, `start_line`, `end_line`, `code`, `chunk_type`, `language`, optional byte offsets, and language-specific `metadata`; helpers such as `contains_line()`, `overlaps_with()`, and `is_code` make range checks trivial [Source: [chunkhound/core/models/chunk.py:39-110](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/chunk.py)]. `Embedding` couples a chunk to a provider, model, dimension count, and vector, with validation enforcing non-empty provider and model strings [Source: [chunkhound/core/models/embedding.py:25-52](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/embedding.py)]. These three models are re-exported from `chunkhound.core.models` for typed use across the system [Source: [chunkhound/core/models/__init__.py:23-30](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/__init__.py)].

## Search

Search is the entry point advertised in the README: regex lookup works without any API key, while semantic lookup requires an embedding provider such as VoyageAI, OpenAI, or local Ollama [Source: [README.md:31-39](https://github.com/chunkhound/chunkhound/blob/main/README.md)]. As of v5.1.0, MCP `search` responses are returned as lean markdown—syntax-highlighted code fences with similarity percentages—instead of verbose JSON, which significantly reduces token usage on the client side [Source: [README.md](https://github.com/chunkhound/chunkhound/blob/main/README.md)]. Search relies on the `File`/`Chunk` model pair and on `Embedding` when a semantic query is requested; community report #53 notes that running multiple MCP clients against the same database can produce empty semantic results, a known limitation around concurrent access rather than a defect in the `Embedding` model itself.

## Research

Research elevates search from "find code matching X" to "explain how the codebase does X." The CLI subparser is registered by `add_research_subparser`, which attaches git diff/commit-range arguments, common arguments, and database/embedding/LLM/research configuration groups [Source: [chunkhound/api/cli/parsers/research_parser.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/parsers/research_parser.py)]. At runtime, `research_command` constructs an `EmbeddingManager`, registers a provider via `EmbeddingProviderFactory`, and conditionally builds an `LLMManager` before delegating to `deep_research_impl` [Source: [chunkhound/api/cli/commands/research.py:26-58](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/commands/research.py)].

Long-running research sessions are visualized with a hierarchical `TreeProgressDisplay` that streams `ProgressEvent` records (node_start, search_semantic, llm_call, node_complete, etc.) using box-drawing prefixes and relative timestamps [Source: [chunkhound/api/cli/utils/tree_progress.py:24-87](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/utils/tree_progress.py)]. Pretty output is rendered by `RichOutputFormatter`, which detects terminal capability and supports a quiet mode (`CHUNKHOUND_QUICKRESEARCH_QUIET`) that redirects console output to stderr so captured payloads stay clean [Source: [chunkhound/api/cli/utils/rich_output.py:20-42](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/utils/rich_output.py)]. Community issue #113 requests OpenCode as a new LLM provider—currently the system supports Claude Code CLI, Codex CLI, Anthropic, OpenAI, and xAI Grok [Source: [README.md:36-39](https://github.com/chunkhound/chunkhound/blob/main/README.md)]—and the embedding provider list has known gaps (e.g., Ollama's `dengcao/Qwen3-Embedding-8B:Q5_K_M` per issue #41).

## Code Mapping

Code mapping produces scoped architectural/operational documentation for a directory. The `add_map_subparser` defines a mandatory `--out` directory, an optional `--combined` markdown flag (falling back to `CH_CODE_MAPPER_WRITE_COMBINED=1`), a `-j/--jobs` concurrency cap, and prompt-only/plan-only modes [Source: [chunkhound/api/cli/parsers/code_mapper_parser.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/parsers/code_mapper_parser.py)]. The `code_mapper_command` runs a two-phase pipeline: a shallow deep-research pass to identify points of interest, then a dedicated deep-research pass per POI, with results assembled into per-topic markdown files plus an optional combined document [Source: [chunkhound/api/cli/commands/code_mapper.py:11-19](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/commands/code_mapper.py)]. Coverage statistics—referenced vs. unreferenced files—are computed via `compute_unreferenced_scope_files` and embedded in the generation stats [Source: [chunkhound/api/cli/commands/code_mapper.py:23-30](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/commands/code_mapper.py)].

Configuration flows from `.chunkhound.json` and `CHUNKHOUND_*` environment variables; workspace overrides let multiple projects share a single index under `/workspaces/.chunkhound.json` [Source: [operations/README.md](https://github.com/chunkhound/chunkhound/blob/main/operations/README.md)]. A dedicated HyDE planning provider/model/effort can be set via `map_hyde_*` keys or `CHUNKHOUND_LLM_MAP_HYDE_*` env vars; if unset, Code Mapper falls back to the synthesis provider [Source: [operations/README.md](https://github.com/chunkhound/chunkhound/blob/main/operations/README.md)]. Output is then handed to the `autodoc` command, which writes Astro docs assets to the configured output directory and can be re-run with `--assets-only` to refresh site assets without regenerating content [Source: [chunkhound/api/cli/commands/autodoc.py:24-48](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/commands/autodoc.py)].

## End-to-End Flow

```mermaid
flowchart LR
    Repo[Source Files] --> Index[chunkhound index]
    Index --> DB[(ChunkHound DB<br/>File / Chunk / Embedding)]
    DB --> Search[search<br/>regex + semantic]
    DB --> Research[research<br/>deep_research_impl]
    DB --> Mapper[map<br/>2-phase POI pipeline]
    Research --> Out1[Markdown report + tree progress]
    Mapper --> Out2[Per-topic + combined markdown]
    Out2 --> AutoDoc[autodoc → Astro site]
```

## See Also

- [README.md](https://github.com/chunkhound/chunkhound/blob/main/README.md) — overview, feature list, installation
- [operations/README.md](https://github.com/chunkhound/chunkhound/blob/main/operations/README.md) — operational CLI reference (research, map, autodoc)
- ChunkHound documentation site: [chunkhound.ai](https://chunkhound.ai)
- Community discussion on concurrent MCP access: issue #53
- Community discussion on embedding provider coverage: issue #41
- Community request for OpenCode LLM provider: issue #113

---

<a id='page-3'></a>

## MCP Integration and Deployment

### Related Pages

Related topics: [Architecture and System Overview](#page-1), [Search, Research, and Code Mapping](#page-2), [Parsers, Providers, and Extensibility](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [chunkhound/api/cli/commands/mcp.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/commands/mcp.py)
- [chunkhound/api/cli/parsers/mcp_parser.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/parsers/mcp_parser.py)
- [chunkhound/api/cli/utils/config_factory.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/utils/config_factory.py)
- [chunkhound/api/cli/utils/rich_output.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/utils/rich_output.py)
- [chunkhound/core/config/mcp_config.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/config/mcp_config.py)
- [chunkhound/core/models/__init__.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/__init__.py)
- [chunkhound/core/models/chunk.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/chunk.py)
- [chunkhound/core/models/file.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/file.py)
- [chunkhound/core/models/embedding.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/embedding.py)
- [README.md](https://github.com/chunkhound/chunkhound/blob/main/README.md)
- [site/src/components/configurator-data.ts](https://github.com/chunkhound/chunkhound/blob/main/site/src/components/configurator-data.ts)
</details>

# MCP Integration and Deployment

## Overview

ChunkHound exposes its codebase intelligence (regex search, semantic search, file/chunk/embedding queries, and LLM-driven research) through the [Model Context Protocol (MCP)](https://spec.modelcontextprotocol.io/), allowing AI editors such as Claude Code, VS Code, Cursor, Windsurf, Zed, and Roo to call ChunkHound as a tool provider ([README.md](https://github.com/chunkhound/chunkhound/blob/main/README.md)).

The MCP surface is launched via the `chunkhound mcp` subcommand. Source: [chunkhound/api/cli/commands/mcp.py:1-10](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/commands/mcp.py). The implementation is configurable across three orthogonal axes:

- **Transport**: stdio (default) vs. daemon-coordinated IPC.
- **Coordination**: single-process vs. multi-client daemon.
- **Write mode**: read-write vs. read-only (`--read-only` / `database.read_only`).

These axes are decoded in [`chunkhound/api/cli/commands/mcp.py`](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/commands/mcp.py) where the entry point sets `CHUNKHOUND_MCP_MODE=1`, pre-imports `numpy` for DuckDB threading safety, and decides whether to route through the `ClientProxy` daemon or stay in-process.

## Runtime Modes and Transport Selection

The MCP CLI supports four logically distinct run modes. Mode selection is driven by CLI flags, JSON configuration, and environment variables:

| Flag / Setting | Behavior |
|---|---|
| `chunkhound mcp` (default) | Spawns/route-attaches to the daemon, allowing multiple MCP clients to share one indexer |
| `--stdio` | Forces legacy single-process stdio mode (backwards compatible) |
| `--no-daemon` | Disables daemonization; runs in-process |
| `--read-only` (or `database.read_only=true`) | Forces single-process stdio, disables watcher/indexing (DuckDB only) |
| `CHUNKHOUND_DAEMON_MODE=false` | Environment override equivalent to `--no-daemon` |

Source: [chunkhound/api/cli/commands/mcp.py:30-60](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/commands/mcp.py).

The relationship between these flags can be visualized as:

```mermaid
flowchart TD
    A["chunkhound mcp invoked"] --> B{"read_only or --stdio?"}
    B -- yes --> S["Single-process stdio server<br/>(StdioMCPServer)"]
    B -- no --> C{"--no-daemon or<br/>CHUNKHOUND_DAEMON_MODE=false?"}
    C -- yes --> S
    C -- no --> D["Route via ClientProxy<br/>to daemon"]
    D --> E["Shared daemon process<br/>(Unix socket / TCP)"]
    E --> F["Multiple MCP clients<br/>share one indexer"]
```

When the daemon path is selected but the user has not opted out of it, the CLI prints a warning that read-only mode is being forced to single-process: "read-only mode forces single-process stdio (daemon coordination is for writers)." Source: [chunkhound/api/cli/commands/mcp.py:45-55](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/commands/mcp.py). The `--show-setup` flag short-circuits the start-up path to display per-editor configuration snippets and exits. Source: [chunkhound/api/cli/commands/mcp.py:18-22](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/commands/mcp.py).

## Configuration Surface

MCP behavior is composed from three layers: JSON config (`.chunkhound.json`), environment variables, and CLI flags. The unified factory in [`chunkhound/api/cli/utils/config_factory.py`](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/utils/config_factory.py) builds and validates a `Config` instance for the `mcp` command, returning any validation errors. When validation fails, `_fallback_config` constructs a minimal `Config` so the server can still start. Source: [chunkhound/api/cli/utils/config_factory.py:18-50](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/utils/config_factory.py).

The MCP command parser exposes the `path` positional argument (default: `.`), the daemon/read-only toggles, and delegates standard `--database`, `--embedding`, `--indexing`, `--llm`, and `--mcp` argument groups. Source: [chunkhound/api/cli/parsers/mcp_parser.py:14-60](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/parsers/mcp_parser.py).

Key runtime configuration values (defined under `database`, `embedding`, `llm`, and `mcp` blocks) propagate into the MCP server. The `--read-only` flag sets `database.read_only`, and the daemon-mode decision also reads `CHUNKHOUND_DAEMON_MODE`. The shared MCP mode flag `CHUNKHOUND_MCP_MODE=1` is set as the very first action of `mcp_command`. Source: [chunkhound/api/cli/commands/mcp.py:23-25](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/commands/mcp.py).

## Domain Models Exposed Over MCP

The MCP tools return the same immutable domain models used throughout ChunkHound. Three models back nearly every response payload:

- **File** — path, mtime, language, size, content hash, and timestamps. Source: [chunkhound/core/models/file.py:30-50](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/file.py).
- **Chunk** — symbol, line/byte ranges, code, chunk_type, language, optional `parent_header`, and metadata. Source: [chunkhound/core/models/chunk.py:30-60](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/chunk.py).
- **Embedding** — chunk_id, provider, model, dimensions, vector, and creation timestamp. Source: [chunkhound/core/models/embedding.py:30-50](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/embedding.py).

All three are exported via `chunkhound/core/models/__init__.py` and are constructed as frozen dataclasses. Source: [chunkhound/core/models/__init__.py:20-30](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/__init__.py). Starting in v5.1.0, `search` MCP responses are emitted as lean, syntax-highlighted Markdown fences with similarity percentages instead of verbose JSON, reducing token cost for LLM clients.

## Deployment: Editor Integration

The marketing site ships a "configurator" that materializes per-editor MCP snippets. Source: [site/src/components/configurator-data.ts:1-50](https://github.com/chunkhound/chunkhound/blob/main/site/src/components/configurator-data.ts). Each editor expects a different JSON file location and shape, but all invoke `chunkhound mcp` (optionally with `--stdio` for single-client setups). Project-local files (`.mcp.json`, `.vscode/mcp.json`) do not need a path argument; global configs (`~/.claude/`) require an absolute path. Source: [chunkhound/api/cli/commands/mcp.py:90-120](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/commands/mcp.py).

The CLI can also print these instructions interactively via `chunkhound mcp --show-setup` and copy the Claude Code variant to the clipboard through `pyperclip` if installed. Source: [chunkhound/api/cli/commands/mcp.py:100-130](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/commands/mcp.py). For Python consumers, ChunkHound is installed as a `uv tool`:

```bash
uv tool install chunkhound
chunkhound mcp
```

Source: [README.md:55-65](https://github.com/chunkhound/chunkhound/blob/main/README.md).

## Common Failure Modes

The community has surfaced two patterns worth noting during deployment:

1. **Multiple concurrent stdio clients on the same project.** Running several Claude Code windows each with `chunkhound mcp` in stdio mode against a shared directory causes crashes, because each stdio instance tries to acquire exclusive DuckDB locks. The daemon mode is the supported mitigation, but HTTP semantic search across instances has historically returned empty results. Source: community issue #53. The implementation distinguishes these cases by routing through `ClientProxy` whenever the daemon path is enabled. Source: [chunkhound/api/cli/commands/mcp.py:35-55](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/commands/mcp.py).

2. **Embedding provider incompatibility.** Embedding features silently degrade if the configured provider/model pair is not supported. For example, `dengcao/Qwen3-Embedding-8B:Q5_K_M` on Ollama has been reported as non-functional (community issue #41). Because `chunkhound mcp` still starts in this case, users should check `embeddings_disabled` propagation from `--no-embeddings` and the validated `Config` returned by the factory. Source: [chunkhound/api/cli/utils/config_factory.py:45-55](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/utils/config_factory.py).

3. **Read-only mode requires single-process.** Setting `database.read_only=true` (or `--read-only`) silently disables the daemon; users expecting to share an index across clients will find that no indexing happens. The CLI prints a warning to make this explicit. Source: [chunkhound/api/cli/commands/mcp.py:45-55](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/commands/mcp.py).

## See Also

- [Configuration Guide](https://chunkhound.ai/docs/configuration/)
- [Model Context Protocol specification](https://spec.modelcontextprotocol.io/)
- Operations notes: [operations/README.md](https://github.com/chunkhound/chunkhound/blob/main/operations/README.md)
- CLI research command surface: [chunkhound/api/cli/parsers/research_parser.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/parsers/research_parser.py)
- AutoDoc site generator: [chunkhound/api/cli/commands/autodoc.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/commands/autodoc.py)

---

<a id='page-4'></a>

## Parsers, Providers, and Extensibility

### Related Pages

Related topics: [Architecture and System Overview](#page-1), [MCP Integration and Deployment](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [chunkhound/parsers/universal_parser.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/parsers/universal_parser.py)
- [chunkhound/parsers/parser_factory.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/parsers/parser_factory.py)
- [chunkhound/providers/embeddings/openai_provider.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/providers/embeddings/openai_provider.py)
- [chunkhound/providers/llm/claude_code_cli_provider.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/providers/llm/claude_code_cli_provider.py)
- [chunkhound/providers/llm/opencode_cli_provider.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/providers/llm/opencode_cli_provider.py)
- [chunkhound/api/cli/parsers/main_parser.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/parsers/main_parser.py)
- [chunkhound/api/cli/parsers/research_parser.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/parsers/research_parser.py)
- [chunkhound/api/cli/parsers/code_mapper_parser.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/parsers/code_mapper_parser.py)
- [chunkhound/api/cli/commands/research.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/commands/research.py)
- [chunkhound/api/cli/utils/rich_output.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/utils/rich_output.py)
- [chunkhound/api/cli/utils/tree_progress.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/utils/tree_progress.py)
- [chunkhound/core/models/__init__.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/__init__.py)
- [chunkhound/core/models/chunk.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/chunk.py)
- [chunkhound/core/models/embedding.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/embedding.py)
</details>

# Parsers, Providers, and Extensibility

ChunkHound is built around three orthogonal extension points: **parsers** (turning source files into semantic `Chunk` records), **embedding providers** (vectorizing chunks for semantic search), and **LLM providers** (driving deep-research synthesis). Each follows a factory-driven, plug-in style so the system can grow without rewiring the CLI, MCP server, or persistence layer. This page documents the contracts you implement when adding a new parser, embedding backend, or LLM backend, and links each extension point to the surrounding wiring.

## 1. The Parser Subsystem

Parsers convert raw file bytes into typed `Chunk` objects that the indexer persists. The canonical entry point is `chunkhound/parsers/universal_parser.py`, which dispatches to language-specific implementations based on the `File.language` attribute resolved by `chunkhound/core/detection.py`. Selection is centralized in `chunkhound/parsers/parser_factory.py`, so adding a language means registering a new parser class with the factory rather than touching call sites. Source: [chunkhound/parsers/parser_factory.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/parsers/parser_factory.py).

The output contract is the frozen `Chunk` dataclass in `chunkhound/core/models/chunk.py`, which carries `symbol`, `start_line`, `end_line`, `code`, `chunk_type`, `file_id`, `language`, byte offsets, timestamps, and a free-form `metadata` dict for language-specific properties such as visibility or mutability. Validators run in `__post_init__` and raise `ValidationError` for invalid line numbers or empty symbols, which the indexer surfaces as recoverable errors. Source: [chunkhound/core/models/chunk.py:1-120](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/chunk.py).

Community requests frequently target this layer: issue #35 ("Ruby Support") asks for a new language parser, and issue #83 ("Efficient Worktree Support?") leans on parser change-detection to skip unchanged content quickly. Adding a parser therefore typically means (a) implementing the parser class, (b) extending `parser_factory`, and (c) ensuring the `File.language` detector recognizes the new extension.

## 2. Embedding Providers

Embedding providers implement the contract satisfied by `chunkhound/providers/embeddings/openai_provider.py` and its peers. They are wired through `EmbeddingProviderFactory` and registered with `EmbeddingManager` at CLI startup. The CLI research command in `chunkhound/api/cli/commands/research.py` illustrates the pattern:

```python
provider = EmbeddingProviderFactory.create_provider(config.embedding)
embedding_manager.register_provider(provider, set_default=True)
```

Source: [chunkhound/api/cli/commands/research.py:21-45](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/commands/research.py). Failures during provider construction are translated into friendly stderr messages via `RichOutputFormatter`, so a misconfigured provider does not silently disable semantic search.

The persistence-side counterpart is the `Embedding` model in `chunkhound/core/models/embedding.py`, which stores `chunk_id`, `provider`, `model`, `dims`, and the raw `vector`. Validators reject empty provider names and dimension/vector mismatches, so providers must keep `dims` consistent with the returned vector length. Source: [chunkhound/core/models/embedding.py:1-60](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/embedding.py).

Issue #41 ("dengcao/Qwen3-Embedding-8B:Q5_K_M model doesn't support embedding") demonstrates a recurring integration problem: local Ollama models sometimes return payloads in a shape the OpenAI-compatible client cannot parse, so providers typically need defensive response parsing or a thin shim that normalizes Ollama output before it reaches the vector store.

## 3. LLM Providers and the CLI Contract

LLM providers power deep-research synthesis and the Code Mapper/AutoDoc cleanup passes. The Claude Code CLI provider (`chunkhound/providers/llm/claude_code_cli_provider.py`) and the OpenCode CLI provider (`chunkhound/providers/llm/opencode_cli_provider.py`) both shell out to a local binary rather than calling an HTTP API, which means auth is delegated to the user's local CLI session. This design is what issue #113 ("OpenCode LLM provider") requests: an OpenCode-backed provider following the same shape as the Claude Code CLI provider.

Providers are surfaced through `LLMManager`, which the research command resolves after embeddings:

```python
if config.llm is not None:
    llm_manager = LLMManager()
    llm_manager.register_provider(...)
```

Source: [chunkhound/api/cli/commands/research.py:46-65](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/commands/research.py). Reasoning-effort and model overrides are exposed through environment variables (e.g., `CHUNKHOUND_LLM_MAP_HYDE_PROVIDER`, `CHUNKHOUND_LLM_MAP_HYDE_MODEL`, `CHUNKHOUND_LLM_MAP_HYDE_REASONING_EFFORT`) so AutoDoc and Code Mapper can use a cheaper planning model than the synthesis model.

The CLI surface for these providers is built compositionally in `chunkhound/api/cli/parsers/`. The `research_parser` registers database, embedding, LLM, and research options together via `add_config_arguments`, while `code_mapper_parser` and `autodoc_parser` add their own flags (such as `--combined`, `--map-comprehensiveness`, and `--audience`) on top of the shared `add_common_arguments`. Sources: [chunkhound/api/cli/parsers/research_parser.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/parsers/research_parser.py), [chunkhound/api/cli/parsers/code_mapper_parser.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/parsers/code_mapper_parser.py), [chunkhound/api/cli/parsers/main_parser.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/parsers/main_parser.py).

## 4. Extensibility Patterns at a Glance

```mermaid
flowchart LR
    CLI[CLI / MCP entrypoint] --> CMDS[commands/*.py]
    CMDS --> Factory[Parser / Provider factories]
    Factory --> Parsers[Parsers<br/>universal_parser + lang plugins]
    Factory --> Emb[Embedding providers<br/>openai, voyageai, ollama]
    Factory --> LLM[LLM providers<br/>claude-code-cli, opencode-cli]
    Parsers --> Models[core/models<br/>File, Chunk, Embedding]
    Emb --> Models
    LLM --> Output[RichOutputFormatter<br/>TreeProgressDisplay]
```

Three patterns repeat across all three extension points:

1. **Factory + registry.** `ParserFactory`, `EmbeddingProviderFactory`, and `LLMManager` all keep an internal map keyed by name; you register once and the rest of the system resolves by string.
2. **Frozen dataclass contracts.** Adding a parser or provider never requires schema migration because the output conforms to existing `Chunk` / `Embedding` models with `metadata` as an escape hatch. Source: [chunkhound/core/models/__init__.py](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/core/models/__init__.py).
3. **Rich, terminal-aware output.** Long-running operations emit structured events through `TreeProgressDisplay` and `RichOutputFormatter`, both of which detect terminal capability and redirect to stderr when stdout is captured (e.g., for MCP stdio). Sources: [chunkhound/api/cli/utils/rich_output.py:1-60](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/utils/rich_output.py), [chunkhound/api/cli/utils/tree_progress.py:1-80](https://github.com/chunkhound/chunkhound/blob/main/chunkhound/api/cli/utils/tree_progress.py).

| Extension point | Where to register | Output contract | Community signal |
|---|---|---|---|
| New language | `parsers/parser_factory.py` | `Chunk` list | #35 Ruby, #83 worktrees |
| New embedding backend | `EmbeddingProviderFactory` | `Embedding` records | #41 Qwen3/Ollama |
| New LLM backend | `LLMManager` | Plain text / structured completion | #113 OpenCode |

## See Also

- [Configuration Guide](https://chunkhound.ai/docs/configuration/)
- [MCP integration](https://spec.modelcontextprotocol.io/)
- Code Mapper and AutoDoc user guides (via `chunkhound map --help` and `chunkhound autodoc --help`)

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: chunkhound/chunkhound

Summary: Found 10 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

## 1. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/chunkhound/chunkhound/issues/352

## 2. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.host_targets | https://github.com/chunkhound/chunkhound

## 3. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | https://github.com/chunkhound/chunkhound

## 4. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/chunkhound/chunkhound

## 5. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: downstream_validation.risk_items | https://github.com/chunkhound/chunkhound

## 6. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: risks.scoring_risks | https://github.com/chunkhound/chunkhound

## 7. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/chunkhound/chunkhound/issues/349

## 8. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/chunkhound/chunkhound/issues/315

## 9. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/chunkhound/chunkhound

## 10. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/chunkhound/chunkhound

<!-- canonical_name: chunkhound/chunkhound; human_manual_source: deepwiki_human_wiki -->
