# Wiki Documentation for https://github.com/chroma-core/chroma

Generated on: 2026-05-10 14:46:12 UTC

## Table of Contents

- [Chroma项目概述](#page-overview)
- [安装与配置](#page-installation)
- [系统架构](#page-architecture)
- [数据流与处理流程](#page-data-flow)
- [向量索引系统](#page-vector-index)
- [查询执行引擎](#page-query-execution)
- [Python客户端](#page-python-client)
- [JavaScript/TypeScript客户端](#page-javascript-client)
- [嵌入函数集成](#page-embedding-functions)
- [存储系统](#page-storage)

<a id='page-overview'></a>

## Chroma项目概述

### Related Pages

Related topics: [系统架构](#page-architecture), [安装与配置](#page-installation)

<details>
<summary>Relevant source files</summary>

The following files were used as context for generating this wiki page:

- [README.md](https://github.com/chroma-core/chroma/blob/main/README.md)
- [rust/types/src/collection_schema.rs](https://github.com/chroma-core/chroma/blob/main/rust/types/src/collection_schema.rs)
- [rust/types/src/metadata.rs](https://github.com/chroma-core/chroma/blob/main/rust/types/src/metadata.rs)
- [rust/blockstore/src/arrow/root.rs](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/root.rs)
- [rust/worker/src/execution/operators/execute_task.rs](https://github.com/chroma-core/chroma/blob/main/rust/worker/src/execution/operators/execute_task.rs)
- [rust/worker/src/execution/operators/materialize_logs.rs](https://github.com/chroma-core/chroma/blob/main/rust/worker/src/execution/operators/materialize_logs.rs)
- [clients/new-js/packages/chromadb/README.md](https://github.com/chroma-core/chroma/blob/main/clients/new-js/packages/chromadb/README.md)
- [examples/xai/README.md](https://github.com/chroma-core/chroma/blob/main/examples/xai/README.md)
- [schemas/embedding_functions/README.md](https://github.com/chroma-core/chroma/blob/main/schemas/embedding_functions/README.md)
- [chromadb/utils/embedding_functions/schemas/README.md](https://github.com/chroma-core/chroma/blob/main/chromadb/utils/embedding_functions/schemas/README.md)
</details>

# Chroma项目概述

## 项目简介

Chroma是一个开源的AI数据基础设施项目，专为向量搜索和嵌入式AI应用设计。它提供了完整的向量数据库功能，支持文档存储、元数据过滤、相似性搜索等核心能力。Chroma的核心目标是简化AI应用的开发流程，让开发者能够快速构建基于向量检索的应用程序。

该项目采用Apache 2.0开源许可证，支持Python和JavaScript/TypeScript双语言客户端，同时提供客户端-服务器模式部署选项。Chroma Cloud是其托管服务，提供serverless向量搜索、混合搜索和全文搜索能力。 Sources: [README.md](https://github.com/chroma-core/chroma/blob/main/README.md)

## 系统架构

### 整体架构设计

Chroma采用混合架构设计，核心存储层使用Rust语言实现，以确保高性能和数据安全。系统支持多种部署模式，包括本地嵌入式部署和客户端-服务器模式。

```mermaid
graph TD
    A[Python Client / JS Client] --> B[Chroma Server]
    A --> C[Embedded Mode]
    B --> D[Rust Worker]
    C --> D
    D --> E[Blockstore]
    D --> F[Record Segment]
    E --> G[Arrow Format Storage]
    F --> H[Log Materialization]
    G --> I[Persistent Storage]
    H --> I
```

### Rust核心组件

Chroma的Rust后端包含多个核心模块，这些模块共同协作完成向量数据的存储和检索工作。

| 组件名称 | 路径位置 | 功能描述 |
|---------|---------|---------|
| blockstore | rust/blockstore/ | 块存储管理，使用Apache Arrow格式 |
| types | rust/types/ | 类型定义和数据模型 |
| worker | rust/worker/src/execution/ | 任务执行和日志物化处理 |
| record_segment | - | 记录段读写管理 |

Sources: [rust/blockstore/src/arrow/root.rs](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/root.rs), [rust/types/src/collection_schema.rs](https://github.com/chroma-core/chroma/blob/main/rust/types/src/collection_schema.rs)

## 数据模型

### Collection Schema

Chroma使用Schema来定义Collection的结构和索引配置。每个Schema可以包含多个字段定义和索引配置。

```rust
Schema::default()
    .create_index(None, VectorIndexConfig {
        space: Some(Space::Cosine),
        embedding_function: None,
        source_key: None,
        hnsw: None,
        spann: None,
    }.into())?
    .create_index(Some("category"), StringInvertedIndexConfig {}.into())?
```

Sources: [rust/types/src/collection_schema.rs:1-50](https://github.com/chroma-core/chroma/blob/main/rust/types/src/collection_schema.rs)

### 元数据系统

Chroma支持丰富的元数据过滤功能，包括多种比较操作符和表达式组合。

```mermaid
graph LR
    A[MetadataComparison] --> B[Primitive Comparisons]
    A --> C[Set Comparisons]
    B --> D[= / != / > / >= / < / <=]
    C --> E[$in / $nin]
    
    F[DocumentOperator] --> G[Contains]
    F --> H[NotContains]
    F --> I[Regex]
    F --> J[NotRegex]
```

支持的元数据比较操作符包括：`=`, `!=`, `>`, `>=`, `<`, `<=` 以及集合操作符 `$in` 和 `$nin`。文档操作符支持 `Contains`、`NotContains`、`Regex` 和 `NotRegex`。 Sources: [rust/types/src/metadata.rs:1-40](https://github.com/chroma-core/chroma/blob/main/rust/types/src/metadata.rs)

### 索引类型

| 索引类型 | 说明 | 适用范围 |
|---------|------|---------|
| VectorIndex | 向量索引，支持HNSW和Spann | 全局向量字段 |
| StringInvertedIndex | 字符串倒排索引 | 字符串类型字段 |
| Spann | Spann向量索引 | 向量字段可选 |

Sources: [rust/types/src/collection_schema.rs:1-60](https://github.com/chroma-core/chroma/blob/main/rust/types/src/collection_schema.rs)

## 客户端SDK

### Python客户端

Python客户端是最成熟的客户端实现，支持完整的Chroma功能集。

```python
import chromadb

# 创建客户端
chroma = chromadb.Client()

# 创建Collection
collection = chroma.create_collection(
    name="my-collection",
    metadata={"description": "My first collection"}
)

# 添加文档
collection.add(
    documents=["Document 1", "Document 2"],
    metadatas=[{"source": "doc1"}, {"source": "doc2"}],
    ids=["id1", "id2"]
)

# 查询
results = collection.query(
    query_texts=["This is a query document"],
    n_results=2,
    where={"metadata_field": "is_equal_to_this"}
)
```

Sources: [README.md](https://github.com/chroma-core/chroma/blob/main/README.md), [chromadb/utils/embedding_functions/schemas/README.md](https://github.com/chroma-core/chroma/blob/main/chromadb/utils/embedding_functions/schemas/README.md)

### JavaScript/TypeScript客户端

新的JavaScript客户端采用模块化设计，分为多个独立包。

```typescript
import { ChromaClient } from "chromadb";

const chroma = new ChromaClient();
const collection = await chroma.createCollection({ name: "test-from-js" });

for (let i = 0; i < 20; i++) {
  await collection.add({
    ids: ["test-id-" + i.toString()],
    embeddings: [[1, 2, 3, 4, 5]],
    documents: ["test"],
  });
}
```

Sources: [clients/new-js/packages/chromadb/README.md](https://github.com/chroma-core/chroma/blob/main/clients/new-js/packages/chromadb/README.md)

## 嵌入函数系统

### 架构设计

Chroma的嵌入函数系统采用插件化架构，支持多种嵌入服务提供商。每个嵌入函数都有对应的JSON Schema进行配置验证。

```mermaid
graph TD
    A[Embedding Function] --> B[Schema Validation]
    B --> C[Config Validation]
    C --> D[API Client]
    D --> E[Embedding Service]
    
    F[OpenAI Schema] --> B
    G[Jina Schema] --> B
    H[TogetherAI Schema] --> B
    I[Qwen Schema] --> B
```

Sources: [schemas/embedding_functions/README.md](https://github.com/chroma-core/chroma/blob/main/schemas/embedding_functions/README.md)

### 支持的嵌入提供商

| 提供商 | NPM包名 | 支持模型 |
|-------|---------|---------|
| OpenAI | 内置 | text-embedding-ada-002 等 |
| Jina | @chroma-core/jina | jina-embeddings-v2-base-en |
| Together AI | @chroma-core/together-ai | togethercomputer/m2-bert-80M-8k-retrieval |
| Qwen | @chroma-core/chroma-cloud-qwen | Qwen3-Embedding-0.6B |
| xAI | 示例 | xAI SDK集成 |

Sources: [clients/new-js/packages/ai-embeddings/jina/README.md](https://github.com/chroma-core/chroma/blob/main/clients/new-js/packages/ai-embeddings/jina/README.md), [clients/new-js/packages/ai-embeddings/together-ai/README.md](https://github.com/chroma-core/chroma/blob/main/clients/new-js/packages/ai-embeddings/together-ai/README.md), [clients/new-js/packages/ai-embeddings/chroma-cloud-qwen/README.md](https://github.com/chroma-core/chroma/blob/main/clients/new-js/packages/ai-embeddings/chroma-cloud-qwen/README.md)

### 配置验证

每个嵌入函数都使用JSON Schema Draft-07规范进行配置验证，确保跨语言兼容性。

```python
from chromadb.utils.embedding_functions.schemas import validate_config

config = {
    "api_key_env_var": "CHROMA_OPENAI_API_KEY",
    "model_name": "text-embedding-ada-002"
}
validate_config(config, "openai")
```

Sources: [chromadb/utils/embedding_functions/schemas/README.md](https://github.com/chroma-core/chroma/blob/main/chromadb/utils/embedding_functions/schemas/README.md)

## 执行流程

### 日志物化流程

Chroma使用日志物化（Log Materialization）来处理数据变更，确保数据的一致性和持久性。

```mermaid
graph LR
    A[LogRecord] --> B[MaterializeLogInput]
    B --> C[MaterializeLogOperator]
    C --> D[PartitionedMaterializeLogsResult]
    D --> E[Persistent Storage]
    
    F[RecordSegmentReader] -.-> C
    G[Offset IDs] -.-> C
    H[Plan Options] -.-> C
```

Sources: [rust/worker/src/execution/operators/materialize_logs.rs:1-80](https://github.com/chroma-core/chroma/blob/main/rust/worker/src/execution/operators/materialize_logs.rs)

### 任务执行流程

Attached Function执行涉及记录段读取、日志偏移量处理和分片管理。

```mermaid
graph TD
    A[ExecuteAttachedFunctionInput] --> B[Record Segment Reader]
    B --> C{Is Rebuild / Backfill?}
    C -->|Yes| D[Output Record Segment Reader]
    C -->|No| E[Standard Processing]
    D --> F[Process Materialized Logs]
    E --> F
    F --> G[Log Offset Handling]
    G --> H[Segment Shard Processing]
    H --> I[ExecuteAttachedFunctionOutput]
```

Sources: [rust/worker/src/execution/operators/execute_task.rs:1-100](https://github.com/chroma-core/chroma/blob/main/rust/worker/src/execution/operators/execute_task.rs)

## 部署方式

### 本地部署

使用pip安装Python客户端即可开始本地开发：

```bash
pip install chromadb
```

Sources: [README.md](https://github.com/chroma-core/chroma/blob/main/README.md)

### 客户端-服务器模式

```bash
chroma run --path /chroma_db_path
```

### 云端部署

Chroma Cloud提供托管服务，支持serverless向量搜索、混合搜索和全文搜索。

Sources: [README.md](https://github.com/chroma-core/chroma/blob/main/README.md)

### Terraform部署示例

支持使用Terraform在DigitalOcean上自动化部署Chroma实例：

```bash
export TF_VAR_chroma_release="0.4.12"
export TF_VAR_region="ams2"
export TF_VAR_public_access="true"
export TF_VAR_enable_auth="true"
export TF_VAR_auth_type="token"
terraform apply -auto-approve
```

Sources: [examples/deployments/do-terraform/README.md](https://github.com/chroma-core/chroma/blob/main/examples/deployments/do-terraform/README.md)

## 错误处理

### 错误码体系

Chroma使用统一的错误码体系来分类不同类型的错误：

| 错误码 | 含义 | 触发场景 |
|-------|------|---------|
| InvalidArgument | 无效参数 | 参数验证失败、ID不匹配 |
| Internal | 内部错误 | Arrow格式错误、数据缺失 |
| NotFound | 未找到 | 记录不存在 |
| AlreadyExists | 已存在 | 重复创建 |

Sources: [rust/blockstore/src/arrow/root.rs:1-30](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/root.rs)

## 相关资源

- 官方文档：https://docs.trychroma.com/
- 社区Discord：https://discord.gg/MMeYNTmh3x
- GitHub仓库：https://github.com/chroma-core/chroma
- Homepage：https://www.trychroma.com/

### 示例应用

Chroma仓库包含多个示例应用，展示了如何集成不同的AI服务：

- **xAI集成**：展示如何使用xAI SDK进行文档问答
- **Movies示例**：演示在Web应用中使用Chroma存储和检索电影数据
- **部署示例**：提供Terraform配置用于云端部署

Sources: [examples/xai/README.md](https://github.com/chroma-core/chroma/blob/main/examples/xai/README.md), [sample_apps/movies/README.md](https://github.com/chroma-core/chroma/blob/main/sample_apps/movies/README.md)

---

<a id='page-installation'></a>

## 安装与配置

### Related Pages

Related topics: [Chroma项目概述](#page-overview)

<details>
<summary>Relevant source files</summary>

The following files were used as context for generating this wiki page:

- [clients/python/pyproject.toml](https://github.com/chroma-core/chroma/blob/main/clients/python/pyproject.toml)
- [clients/js/package.json](https://github.com/chroma-core/chroma/blob/main/clients/js/package.json)
- [Dockerfile](https://github.com/chroma-core/chroma/blob/main/Dockerfile)
- [docker-compose.yml](https://github.com/chroma-core/chroma/blob/main/docker-compose.yml)
</details>

# 安装与配置

## 概述

Chroma 是一个开源的 AI 向量数据基础设施，提供向量存储、检索和混合搜索能力。本章节详细介绍 Chroma 的多种安装方式和配置选项，帮助开发者根据不同使用场景选择合适的部署方案。

Chroma 支持三种主要的部署模式：

| 部署模式 | 适用场景 | 数据持久化 |
|---------|---------|-----------|
| Python 客户端嵌入模式 | 本地开发、原型验证 | 本地文件系统 |
| 客户端-服务器模式 | 生产环境、多客户端访问 | 可配置存储后端 |
| Docker 容器化部署 | 微服务架构、DevOps | Docker Volume |

Sources: [README.md](README.md)

---

## Python 客户端安装

### 环境要求

Chroma Python 客户端支持 Python 3.8 及以上版本。建议使用虚拟环境进行安装以避免依赖冲突。

### 安装方式

#### 通过 pip 安装（推荐）

```bash
pip install chromadb
```

Sources: [README.md](README.md)

#### 通过源码安装

```bash
git clone https://github.com/chroma-core/chroma.git
cd chroma
pip install -e .
```

#### 验证安装

```python
import chromadb
print(chromadb.__version__)
```

### 核心依赖

Chroma Python 客户端的依赖管理通过 `pyproject.toml` 配置，主要依赖包括：

| 依赖包 | 用途 |
|-------|------|
| `onnxruntime` | 向量计算引擎 |
| `tokenizers` | 文本分词处理 |
| `hnswlib` | HNSW 近似最近邻索引 |
| `clickhouse-connect` | 可选：ClickHouse 后端支持 |

Sources: [clients/python/pyproject.toml](clients/python/pyproject.toml)

---

## JavaScript/TypeScript 客户端安装

### 环境要求

- Node.js 18.0 或更高版本
- npm、yarn 或 pnpm 包管理器

### 安装方式

```bash
# 使用 npm
npm install chromadb

# 使用 yarn
yarn add chromadb

# 使用 pnpm
pnpm add chromadb
```

Sources: [README.md](README.md)

### 客户端包结构

新版本 JavaScript 客户端 (`new-js`) 包含多个子包：

| 包名 | 功能描述 |
|-----|---------|
| `@chroma-core/ai-embeddings-common` | 嵌入函数公共工具库 |
| `@chroma-core/ai-embeddings` | AI 嵌入功能实现 |
| `@chroma-core/chromadb` | 核心客户端实现 |

Sources: [clients/new-js/packages/ai-embeddings/common/README.md](clients/new-js/packages/ai-embeddings/common/README.md)

### 基本使用示例

```javascript
import { ChromaClient } from "chromadb";

const chroma = new ChromaClient();
const collection = await chroma.createCollection({ name: "test-from-js" });

await collection.add({
  ids: ["test-id-1"],
  embeddings: [[1, 2, 3, 4, 5]],
  documents: ["test document"],
});
```

Sources: [clients/new-js/packages/chromadb/README.md](clients/new-js/packages/chromadb/README.md)

---

## Docker 部署

### 前置条件

- Docker Engine 20.10 或更高版本
- Docker Compose 2.0 或更高版本（可选）

### 使用 Dockerfile 构建

Chroma 提供官方的 Dockerfile 用于构建自定义镜像：

```dockerfile
FROM python:3.9-slim

WORKDIR /app

# 安装系统依赖
RUN apt-get update && apt-get install -y \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

# 复制应用代码
COPY . /app

# 安装 Python 依赖
RUN pip install --no-cache-dir -e .

# 暴露端口
EXPOSE 8000

# 启动命令
CMD ["chroma", "run", "--host", "0.0.0.0", "--port", "8000"]
```

Sources: [Dockerfile](Dockerfile)

### 使用 Docker Compose 编排

Docker Compose 提供了快速启动完整 Chroma 服务的配置：

```yaml
version: '3.8'

services:
  chroma:
    build: .
    ports:
      - "8000:8000"
    volumes:
      - chroma_data:/chroma/chroma_db_path
    environment:
      - CHROMA_SERVER_AUTH_CREDENTIALS=admin:password
      - CHROMA_SERVER_AUTH_PROVIDER=basic

volumes:
  chroma_data:
```

Sources: [docker-compose.yml](docker-compose.yml)

### 启动服务

```bash
# 仅启动 Chroma 服务
docker-compose up -d

# 构建并启动
docker-compose up --build

# 查看日志
docker-compose logs -f chroma

# 停止服务
docker-compose down
```

---

## 客户端-服务器模式配置

当需要多客户端访问或生产环境部署时，使用 Chroma 的客户端-服务器模式。

### 启动服务器

```bash
# 基础启动
chroma run --path /chroma_db_path

# 指定端口和主机
chroma run --host 0.0.0.0 --port 8000 --path /chroma_db_path
```

Sources: [README.md](README.md)

### 服务器认证配置

Chroma 支持基于 Token 和 Basic Auth 两种认证方式（v0.4.7+）：

| 认证类型 | 环境变量 | 说明 |
|---------|---------|------|
| Token 认证 | `CHROMA_SERVER_AUTH_CREDENTIALS` | 设置 Token 密钥 |
| Basic Auth | `CHROMA_SERVER_AUTH_CREDENTIALS_PROVIDER` | 设置为 `chromadb.auth.basic.BasicAuthenticationServerProvider` |
| CORS 配置 | `CHROMA_SERVER_CORS_ALLOWED_ORIGINS` | 允许的跨域来源 |

Sources: [examples/deployments/do-terraform/README.md](examples/deployments/do-terraform/README.md)

### API 密钥配置

对于 Chroma Cloud 服务，需要配置 API 密钥：

```bash
export CHROMA_API_KEY=your-api-key
```

Sources: [clients/new-js/packages/ai-embeddings/chroma-cloud-qwen/README.md](clients/new-js/packages/ai-embeddings/chroma-cloud-qwen/README.md)

### 连接客户端

#### Python 客户端连接

```python
import chromadb

# 连接远程服务器
client = chromadb.HttpClient(
    host="http://localhost:8000",
    headers={"Authorization": "Bearer your-token"}
)

# 使用 Cloud 服务
client = chromadb.Client(
    chroma_client_implementation=chromadb.auth.AuthenticatedClient,
    api_key="your-api-key"
)
```

#### JavaScript 客户端连接

```javascript
import { ChromaClient } from "chromadb";

// 连接远程服务器
const chroma = new ChromaClient({
  path: "http://localhost:8000"
});
```

---

## 环境变量配置

### 完整环境变量列表

| 环境变量 | 默认值 | 说明 |
|---------|-------|------|
| `CHROMA_DB_IMPL` | `duckdb+parquet` | 数据库实现类型 |
| `CHROMA_SERVER_HOST` | `localhost` | 服务器主机地址 |
| `CHROMA_SERVER_PORT` | `8000` | 服务器端口 |
| `CHROMA_SERVER_CORS_ALLOWED_ORIGINS` | `*` | 允许的 CORS 来源 |
| `CHROMA_SERVER_AUTH_CREDENTIALS` | - | 认证凭证 |
| `CHROMA_SERVER_AUTH_PROVIDER` | - | 认证提供者 |
| `CHROMA_API_KEY` | - | Cloud API 密钥 |

Sources: [README.md](README.md), [sample_apps/movies/README.md](sample_apps/movies/README.md)

### 示例：生产环境配置

```bash
# 环境变量配置示例
export CHROMA_SERVER_HOST=0.0.0.0
export CHROMA_SERVER_PORT=8000
export CHROMA_SERVER_CORS_ALLOWED_ORIGINS="https://your-app.com"
export CHROMA_SERVER_AUTH_CREDENTIALS="your-secure-password"
export CHROMA_SERVER_AUTH_PROVIDER="chromadb.auth.basic.BasicAuthenticationServerProvider"
```

---

## 嵌入函数配置

### 内置嵌入函数

Chroma 提供多种内置嵌入函数，支持不同的 AI 提供商：

| 提供商 | 模型 | 配置参数 |
|-------|-----|---------|
| OpenAI | `text-embedding-ada-002` | `api_key_env_var`, `model_name` |
| Google Docs | - | `credentials`, `token` |
| xAI | Qwen3-Embedding-0.6B | `api_key_env_var`, `model`, `task` |

Sources: [chromadb/utils/embedding_functions/schemas/README.md](chromadb/utils/embedding_functions/schemas/README.md), [clients/new-js/packages/ai-embeddings/chroma-cloud-qwen/README.md](clients/new-js/packages/ai-embeddings/chroma-cloud-qwen/README.md)

### 嵌入函数配置验证

Chroma 使用 JSON Schema 对嵌入函数配置进行验证，确保跨语言兼容性：

```python
from chromadb.utils.embedding_functions.schemas import validate_config

# 验证 OpenAI 配置
config = {
    "api_key_env_var": "CHROMA_OPENAI_API_KEY",
    "model_name": "text-embedding-ada-002"
}
validate_config(config, "openai")
```

Sources: [chromadb/utils/embedding_functions/schemas/README.md](chromadb/utils/embedding_functions/schemas/README.md)

---

## 快速入门示例

### 本地模式（Python）

```python
import chromadb

# 创建客户端（本地模式）
client = chromadb.PersistentClient(path="./chroma_db")

# 创建集合
collection = client.create_collection(name="my_collection")

# 添加数据
collection.add(
    documents=["第一个文档", "第二个文档", "第三个文档"],
    ids=["id1", "id2", "id3"],
    metadatas=[{"source": "docs"}, {"source": "docs"}, {"source": "api"}]
)

# 查询数据
results = collection.query(
    query_texts=["查询文本"],
    n_results=2
)

print(results)
```

### 服务器模式（Python）

```python
import chromadb

# 连接服务器
client = chromadb.HttpClient(host="http://localhost:8000")

# 后续操作与本地模式相同
collection = client.get_collection(name="my_collection")
results = collection.query(
    query_texts=["查询文本"],
    n_results=2,
    where={"source": "docs"}  # 元数据过滤
)
```

Sources: [README.md](README.md)

---

## 部署架构

### 单节点部署架构

```mermaid
graph TD
    A[Client Application] -->|HTTP API| B[Chroma Server]
    B --> C[DuckDB + Parquet]
    B --> D[HNSW Index]
    C --> E[Persistent Storage]
    D --> E
    
    F[Embedding Function] -->|Compute Embeddings| B
    G[OpenAI API] -->|Optional| F
```

### Docker 部署架构

```mermaid
graph TD
    A[External Clients] -->|HTTP| B[Docker Container]
    B --> C[Chroma Server]
    C --> D[(DuckDB)]
    C --> E[HNSW Index]
    D --> F[Volume: chroma_data]
    E --> F
    
    G[Embedding Service] -->|Compute| C
```

### 生产环境部署架构

```mermaid
graph LR
    A[Web App] -->|API Requests| B[Load Balancer]
    B --> C1[Chroma Node 1]
    B --> C2[Chroma Node 2]
    B --> C3[Chroma Node N]
    
    C1 --> D[(Shared Storage)]
    C2 --> D
    C3 --> D
    
    D --> E[S3 / NFS]
```

---

## 故障排除

### 常见问题

| 问题 | 可能原因 | 解决方案 |
|-----|---------|---------|
| 连接超时 | 服务未启动或端口错误 | 确认服务运行状态和端口配置 |
| 认证失败 | Token 或 Basic Auth 配置错误 | 检查环境变量配置 |
| 嵌入维度不匹配 | 不同嵌入函数生成的向量维度不同 | 确保同一集合使用相同的嵌入函数 |
| 存储空间不足 | Parquet 文件过大 | 配置数据分片或清理旧数据 |

### 日志查看

```bash
# Docker 环境查看日志
docker-compose logs chroma

# 直接运行查看实时日志
chroma run --path /chroma_db_path --verbose
```

---

## 相关资源

- [官方文档](https://docs.trychroma.com/)
- [GitHub 仓库](https://github.com/chroma-core/chroma)
- [Discord 社区](https://discord.gg/MMeYNTmh3x)
- [Chroma Cloud](https://trychroma.com/)

---

<a id='page-architecture'></a>

## 系统架构

### Related Pages

Related topics: [数据流与处理流程](#page-data-flow), [向量索引系统](#page-vector-index), [存储系统](#page-storage)

<details>
<summary>Relevant source files</summary>

The following files were used as context for generating this wiki page:

- [rust/frontend/src/lib.rs](https://github.com/chroma-core/chroma/blob/main/rust/frontend/src/lib.rs)
- [rust/worker/src/lib.rs](https://github.com/chroma-core/chroma/blob/main/rust/worker/src/lib.rs)
- [go/pkg/sysdb/coordinator/coordinator.go](https://github.com/chroma-core/chroma/blob/main/go/pkg/sysdb/coordinator/coordinator.go)
- [idl/chromadb/proto/chroma.proto](https://github.com/chroma-core/chroma/blob/main/idl/chromadb/proto/chroma.proto)
- [rust/system/src/lib.rs](https://github.com/chroma-core/chroma/blob/main/rust/system/src/lib.rs)
</details>

# 系统架构

## 概述

Chroma 是一个开源的 AI 向量数据库，专为存储和检索高维向量数据而设计。系统采用分布式架构，包含多个核心组件协同工作，以提供高效的向量存储、索引和查询能力。

Chroma 的系统架构遵循微服务设计原则，将不同的功能模块分离为独立的服务组件，通过 gRPC 协议进行进程间通信。系统支持多租户架构，能够在单一部署中服务多个 tenants 和 databases。

## 系统架构图

```mermaid
graph TD
    subgraph Client["客户端层"]
        JS[JavaScript Client]
        Python[Python Client]
    end

    subgraph Frontend["前端服务"]
        API[REST/gRPC API]
        Auth[认证模块]
    end

    subgraph Worker["Worker 服务"]
        QueryEngine[查询引擎]
        Indexing[索引服务]
        Blockstore[块存储]
    end

    subgraph Coordinator["协调器"]
        SysDB[系统数据库]
        TenantMgr[租户管理]
        CollectionMgr[集合管理]
    end

    subgraph Storage["存储层"]
        Arrow[Arrow Blockfiles]
        HNSW[HNSW 索引]
        Log[预写日志]
    end

    JS --> API
    Python --> API
    API --> Auth
    Auth --> QueryEngine
    QueryEngine --> Indexing
    Indexing --> Blockstore
    Coordinator --> SysDB
    QueryEngine --> Coordinator
    Blockstore --> Arrow
    Indexing --> HNSW
```

## 核心组件

### 前端服务 (Frontend)

前端服务是 Chroma 系统的入口点，负责处理客户端请求并进行初步验证。源代码显示前端模块采用 Rust 实现，提供了核心的 API 接口。

**主要职责：**
- 接收并验证客户端请求
- 处理认证和授权
- 请求路由和负载均衡
- 返回标准化的响应格式

**模块位置：** `rust/frontend/src/lib.rs`

### Worker 服务

Worker 是 Chroma 系统中的核心计算节点，负责执行实际的向量操作。每个 Worker 节点可以独立处理查询和索引任务。

**主要职责：**
- 执行向量相似度搜索
- 管理 HNSW 索引
- 处理元数据过滤
- 存储和检索块数据

**核心子模块：**

| 模块 | 功能 | 源码位置 |
|------|------|----------|
| 查询引擎 | 处理向量查询请求 | `rust/worker/src/lib.rs` |
| 索引服务 | 管理 HNSW/SPANN 索引 | `rust/index/src/` |
| 块存储 | 持久化向量数据 | `rust/blockstore/src/` |

**块存储架构：**

```mermaid
graph LR
    subgraph Blockfile["Blockfile 结构"]
        Header[文件头]
        Body[数据体]
        Footer[文件尾]
    end

    subgraph ArrowFormat["Arrow IPC 格式"]
        RecordBatch[Record Batch]
        Schema[Schema 元数据]
    end

    Header --> RecordBatch
    RecordBatch --> Body
    Body --> Footer
```

Chroma 使用 Apache Arrow 格式作为底层存储格式，支持高效的列式数据存储和读取。Blockfile 是数据存储的基本单元，包含文件头、数据体和文件尾三部分。

Sources: [rust/blockstore/src/arrow/root.rs:1-30](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/root.rs)

### 协调器 (Coordinator)

协调器是系统的中央控制组件，使用 Go 语言实现。它负责管理系统的元数据和全局状态。

**主要职责：**
- 管理租户和数据库元数据
- 协调集合的创建和删除
- 维护系统拓扑信息
- 处理分布式事务

**协调器结构：**

| 组件 | 功能 | 描述 |
|------|------|------|
| SysDB | 系统数据库 | 存储租户、数据库、集合的元数据 |
| TenantManager | 租户管理器 | 管理租户的生命周期 |
| CollectionManager | 集合管理器 | 管理集合的创建、删除和分片 |

Sources: [go/pkg/sysdb/coordinator/coordinator.go:1-50](https://github.com/chroma-core/chroma/blob/main/go/pkg/sysdb/coordinator/coordinator.go)

## 通信协议

### gRPC 接口定义

Chroma 使用 Protocol Buffers 定义服务接口，支持类型安全的跨语言通信。

**核心服务定义：**

```protobuf
// 来源: idl/chromadb/proto/chroma.proto
service Chroma {
    rpc CreateCollection(CreateCollectionRequest) returns (CreateCollectionResponse);
    rpc GetCollection(GetCollectionRequest) returns (GetCollectionResponse);
    rpc DeleteCollection(DeleteCollectionRequest) returns (DeleteCollectionResponse);
    rpc Add(AddRequest) returns (AddResponse);
    rpc Get(GetRequest) returns (GetResponse);
    rpc Query(QueryRequest) returns (QueryResponse);
}
```

**查询请求参数：**

| 参数名 | 类型 | 描述 |
|--------|------|------|
| `query_embeddings` | Vec\<Vec\<float\>\> | 查询向量 |
| `n_results` | int | 返回结果数量 |
| `where` | Where | 元数据过滤条件 |
| `include` | Vec\<string\> | 返回包含的字段 |

Sources: [idl/chromadb/proto/chroma.proto:1-100](https://github.com/chroma-core/chroma/blob/main/idl/chromadb/proto/chroma.proto)

### 消息类型

系统定义了丰富的消息类型用于客户端与服务端之间的数据交换。

```mermaid
classDiagram
    class GetRequest {
        +collection_id: String
        +ids: Vec~String~
        +where: Option~Where~
        +include: Vec~String~
    }
    class QueryRequest {
        +collection_id: String
        +query_embeddings: Vec~Vec~float~~
        +n_results: int
        +where: Option~Where~
        +query_texts: Option~Vec~String~~
    }
    class GetResponse {
        +ids: Vec~String~
        +embeddings: Option~Vec~Vec~float~~
        +documents: Option~Vec~String~
        +metadatas: Option~Vec~Metadata~
    }
```

## 数据模型

### 集合 (Collection)

集合是 Chroma 中的基本数据组织单位，每个集合包含一组相关的向量数据。

```rust
// 集合属性
struct Collection {
    id: CollectionUuid,           // 唯一标识符
    name: String,                // 集合名称
    tenant_id: String,           // 租户标识
    database_name: String,       // 数据库名称
    dimension: Option<i32>,      // 向量维度
    get_index: Option<Index>,    // 索引配置
    metadata: Option<CollectionMetadata>,
    created_at: Timestamp,
    version: i32,
}
```

Sources: [rust/types/src/collection_schema.rs:1-80](https://github.com/chroma-core/chroma/blob/main/rust/types/src/collection_schema.rs)

### 索引配置

Chroma 支持多种索引类型以优化查询性能。

| 索引类型 | 描述 | 适用场景 |
|----------|------|----------|
| HNSW | 分层可导航小世界图 | 高精度向量搜索 |
| SPANN | 基于磁盘的向量索引 | 超大规模向量数据 |
| String Inverted Index | 字符串倒排索引 | 元数据过滤 |
| Vector Index | 向量索引 | 相似度搜索 |

**向量索引配置示例：**

```rust
VectorIndexConfig {
    space: Some(Space::Cosine),      // 距离度量
    embedding_function: None,        // embedding 函数
    source_key: None,               // 源 key
    hnsw: None,                     // HNSW 参数
    spann: None,                    // SPANN 参数
}
```

Sources: [rust/types/src/collection_schema.rs:100-150](https://github.com/chroma-core/chroma/blob/main/rust/types/src/collection_schema.rs)

### 过滤条件

Chroma 支持丰富的元数据过滤功能。

```mermaid
graph TD
    A[Where 过滤条件] --> B[MetadataComparison]
    A --> C[WhereDocumentOperator]
    B --> D[Primitive]
    B --> E[Set Operation]
    D --> F[Equal / NotEqual]
    D --> G[Greater / Less]
    D --> H[Contains / StartsWith]
    C --> I[Contains / NotContains]
    C --> J[Regex / NotRegex]
```

**支持的比较操作：**

| 操作类型 | 描述 | 示例 |
|----------|------|------|
| `$eq` | 等于 | `{"key": {"$eq": "value"}}` |
| `$ne` | 不等于 | `{"key": {"$ne": 10}}` |
| `$gt` | 大于 | `{"key": {"$gt": 5}}` |
| `$lt` | 小于 | `{"key": {"$lt": 100}}` |
| `$gte` | 大于等于 | `{"key": {"$gte": 1}}` |
| `$lte` | 小于等于 | `{"key": {"$lte": 10}}` |

Sources: [rust/types/src/metadata.rs:1-80](https://github.com/chroma-core/chroma/blob/main/rust/types/src/metadata.rs)

## 系统拓扑

Chroma 支持多区域部署，允许在不同云提供商的区域中部署 Worker 节点。

```mermaid
graph TD
    subgraph Topology["系统拓扑"]
        subgraph Region1["区域 1"]
            WR1[Worker Region 1]
        end
        subgraph Region2["区域 2"]
            WR2[Worker Region 2]
        end
        subgraph RegionN["区域 N"]
            WR3[Worker Region N]
        end
    end

    Coordinator --> WR1
    Coordinator --> WR2
    Coordinator --> WR3
```

**拓扑配置结构：**

```rust
pub struct Topology<T: Clone + Debug> {
    pub name: TopologyName,           // 拓扑名称
    pub regions: Vec<RegionName>,    // 包含的区域列表
    pub config: T,                   // 拓扑配置
}

pub struct ProviderRegion<T> {
    pub name: RegionName,            // 区域名称
    pub provider: String,            // 云提供商 (aws/gcp/azure)
    pub region: String,              // 区域标识
    pub config: T,                   // 区域配置
}
```

Sources: [rust/types/src/topology.rs:1-100](https://github.com/chroma-core/chroma/blob/main/rust/types/src/topology.rs)

## 请求处理流程

### 客户端请求流程

```mermaid
sequenceDiagram
    participant Client
    participant Frontend
    participant Coordinator
    participant Worker
    participant Storage

    Client->>Frontend: 发起请求
    Frontend->>Frontend: 认证和验证
    Frontend->>Coordinator: 获取元数据
    Coordinator->>Frontend: 返回集合信息
    Frontend->>Worker: 路由查询请求
    Worker->>Worker: 执行向量搜索
    Worker->>Storage: 读取块数据
    Storage->>Worker: 返回数据
    Worker->>Frontend: 返回结果
    Frontend->>Client: 响应结果
```

### 查询请求处理

1. **请求验证**：Frontend 接收并验证查询请求参数
2. **元数据获取**：从 Coordinator 获取集合的配置信息
3. **索引定位**：根据查询条件确定需要扫描的索引
4. **向量搜索**：在 Worker 节点上执行 HNSW 或其他索引搜索
5. **结果聚合**：合并多个 Worker 的结果并排序
6. **元数据过滤**：应用 where 条件过滤最终结果
7. **响应构建**：根据 include 参数构建响应数据

Sources: [rust/system/src/lib.rs:1-50](https://github.com/chroma-core/chroma/blob/main/rust/system/src/lib.rs)

## 部署架构

### 单一节点部署

在开发和测试环境中，Chroma 可以以单节点模式运行，所有组件运行在同一个进程中。

```yaml
# 单节点配置示例
chroma:
  server:
    host: "0.0.0.0"
    port: 8000
  storage:
    type: "local"
    path: "/chroma_db"
```

### 分布式部署

生产环境推荐使用分布式架构部署：

| 组件 | 副本数 | 职责 |
|------|--------|------|
| Frontend | 2-3 | 高可用 API 服务 |
| Coordinator | 3 | 高可用元数据管理 |
| Worker | N | 弹性扩展的计算节点 |

```mermaid
graph TB
    subgraph LoadBalancer["负载均衡器"]
        LB[Load Balancer]
    end

    subgraph FrontendCluster["Frontend 集群"]
        FE1[Frontend 1]
        FE2[Frontend 2]
    end

    subgraph CoordinatorCluster["Coordinator 集群"]
        C1[Coordinator 1]
        C2[Coordinator 2]
        C3[Coordinator 3]
    end

    subgraph WorkerCluster["Worker 集群"]
        W1[Worker 1]
        W2[Worker 2]
        W3[Worker N]
    end

    LB --> FE1
    LB --> FE2
    FE1 --> C1
    FE2 --> C1
    C1 <--> C2
    C2 <--> C3
    FE1 --> W1
    FE1 --> W2
    FE2 --> W2
    FE2 --> W3
```

## 关键技术特性

### Arrow 存储格式

Chroma 使用 Apache Arrow 作为底层存储格式，带来以下优势：

- **列式存储**：支持高效的列投影操作
- **零拷贝读取**：避免不必要的数据复制
- **统一数据接口**：支持多种编程语言
- **压缩友好**：便于应用压缩算法

```rust
// Arrow Blockfile 读取流程
let arrow_reader = arrow::ipc::reader::FileReader::try_new(&mut cursor, None);
let record_batch = reader.next();  // 读取 Record Batch
let block_ids = Self::block_ids_from_record_batch(&record_batch, version);
```

Sources: [rust/blockstore/src/arrow/root.rs:20-40](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/root.rs)

### 多语言客户端支持

Chroma 提供多语言 SDK 支持：

| 语言 | 包名 | 源码位置 |
|------|------|----------|
| Python | `chromadb` | `clients/python/` |
| JavaScript | `chromadb` | `clients/js/` |
| Go | `chromadb-go` | `clients/go/` |

**Python 客户端示例：**

```python
from chromadb import ChromaClient

client = ChromaClient()
collection = client.create_collection("my_collection")
collection.add(
    ids=["1", "2"],
    embeddings=[[1.0, 2.0], [3.0, 4.0]],
    documents=["doc1", "doc2"]
)
results = collection.query(
    query_embeddings=[[1.0, 2.0]],
    n_results=1
)
```

Sources: [clients/new-js/packages/chromadb/README.md:1-50](https://github.com/chroma-core/chroma/blob/main/clients/new-js/packages/chromadb/README.md)

## 总结

Chroma 的系统架构采用了现代分布式系统的设计理念，通过将前端服务、Worker 节点和协调器分离，实现了系统的可扩展性和高可用性。Arrow 存储格式的应用使得数据操作更加高效，而 gRPC 和 Protocol Buffers 的使用则保证了跨语言的互操作性。

---

<a id='page-data-flow'></a>

## 数据流与处理流程

### Related Pages

Related topics: [系统架构](#page-architecture), [查询执行引擎](#page-query-execution)

<details>
<summary>Relevant source files</summary>

The following files were used as context for generating this wiki page:

- [rust/blockstore/src/arrow/root.rs](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/root.rs)
- [rust/types/src/collection_schema.rs](https://github.com/chroma-core/chroma/blob/main/rust/types/src/collection_schema.rs)
- [rust/types/src/execution/operator.rs](https://github.com/chroma-core/chroma/blob/main/rust/types/src/execution/operator.rs)
- [rust/types/src/topology.rs](https://github.com/chroma-core/chroma/blob/main/rust/types/src/topology.rs)
- [rust/blockstore/src/arrow/block/types.rs](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/block/types.rs)
- [rust/types/src/metadata.rs](https://github.com/chroma-core/chroma/blob/main/rust/types/src/metadata.rs)
- [rust/types/src/sparse_posting_block.rs](https://github.com/chroma-core/chroma/blob/main/rust/types/src/sparse_posting_block.rs)
- [rust/types/src/api_types.rs](https://github.com/chroma-core/chroma/blob/main/rust/types/src/api_types.rs)
- [rust/worker/src/compactor/scheduler.rs](https://github.com/chroma-core/chroma/blob/main/rust/worker/src/compactor/scheduler.rs)
- [rust/blockstore/src/arrow/provider.rs](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/provider.rs)
- [rust/index/src/spann/types.rs](https://github.com/chroma-core/chroma/blob/main/rust/index/src/spann/types.rs)
- [README.md](https://github.com/chroma-core/chroma/blob/main/README.md)
- [rust/chroma/README.md](https://github.com/chroma-core/chroma/blob/main/rust/chroma/README.md)
- [rust/wal3/README.md](https://github.com/chroma-core/chroma/blob/main/rust/wal3/README.md)
</details>

# 数据流与处理流程

## 概述

Chroma 是一个开源的 AI 矢量数据库，其核心架构围绕数据存储、索引构建和查询处理三个主要环节构建。数据流与处理流程涵盖从数据摄入到查询响应的完整生命周期，包括数据编码、块存储、索引管理、执行计划生成和多阶段查询处理等关键环节。

Chroma 的数据流设计遵循分布式系统原则，支持多租户、多数据库架构，并通过拓扑（Topology）管理跨区域的数据分布。Sources: [rust/types/src/topology.rs:1-100](https://github.com/chroma-core/chroma/blob/main/rust/types/src/topology.rs)

## 核心架构组件

### 系统层级结构

```mermaid
graph TD
    A[API Layer] --> B[Worker Service]
    B --> C[Execution Engine]
    C --> D[Index Layer]
    D --> E[Blockstore]
    E --> F[Arrow Storage]
    
    G[SysDB] --> B
    H[Log Service] --> B
    I[Compactor Scheduler] --> B
```

Chroma 的处理流程涉及多个核心子系统的协作：

| 组件 | 职责 | 主要文件 |
|------|------|----------|
| API Layer | 接收请求、参数验证 | rust/types/src/api_types.rs |
| Worker Service | 任务编排、执行调度 | rust/worker/src/execution/orchestration/mod.rs |
| Execution Engine | 执行计划、算子处理 | rust/types/src/execution/operator.rs |
| Index Layer | 矢量索引、HNSW/SPANN | rust/index/src/spann/types.rs |
| Blockstore | 数据块存储、Arrow IPC | rust/blockstore/src/arrow/provider.rs |
| Compactor | 压缩合并、垃圾回收 | rust/worker/src/compactor/scheduler.rs |

## 数据摄入流程

### 写入请求处理

客户端通过 API 提交数据写入请求，数据首先经过验证和预处理，然后进入写入流程。Chroma 支持批量写入操作，通过 BlockfileWriter 实现数据的有序或无序写入。 Sources: [rust/blockstore/src/arrow/provider.rs:1-80](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/provider.rs)

```mermaid
graph LR
    A[Add Request] --> B[Validation]
    B --> C[Embedding Generation]
    C --> D[Schema Validation]
    D --> E[BlockfileWriter]
    E --> F[Arrow IPC Format]
    F --> G[Block Storage]
```

### Blockfile 写入模式

BlockfileWriter 支持两种写入顺序模式：

| 模式 | 说明 | 使用场景 |
|------|------|----------|
| Ordered | 有序写入，保证数据顺序 | 需要精确顺序的查询 |
| Unordered | 无序写入，更高吞吐 | 批量导入、高吞吐写入 |

创建 Blockfile 时可以指定 `fork_from` 参数从现有 Blockfile 分叉，这对于快照和分支操作至关重要。 Sources: [rust/blockstore/src/arrow/provider.rs:50-100](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/provider.rs)

## 索引构建与存储

### Schema 与索引配置

Chroma 使用 Schema 定义集合的数据结构，支持创建多种类型的索引：

```rust
// 索引配置示例
Schema::default()
    .create_index(None, VectorIndexConfig {
        space: Some(Space::Cosine),
        embedding_function: None,
        source_key: None,
        hnsw: None,
        spann: None,
    }.into())?
    .create_index(Some("category"), StringInvertedIndexConfig {}.into())?;
```

Sources: [rust/types/src/collection_schema.rs:1-80](https://github.com/chroma-core/chroma/blob/main/rust/types/src/collection_schema.rs)

### Arrow IPC 存储格式

Chroma 使用 Apache Arrow IPC 格式存储数据块，具有以下特点：

- **内存映射支持**：允许高效访问大型数据集
- **类型安全**：强类型列式存储
- **跨语言兼容**：可与其他 Arrow 生态工具互操作

Block 存储包含元数据校验，确保数据完整性：

```rust
pub enum ArrowLayoutVerificationError {
    BufferLengthNotAligned,
    NoRecordBatches,
    MultipleRecordBatches,
    InvalidMessageType,
    RecordBatchDecodeError,
}
```

Sources: [rust/blockstore/src/arrow/block/types.rs:1-80](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/block/types.rs)

### Root 管理与版本控制

Root 是 Chroma 中的核心存储单元，每个 Root 包含版本信息和块 ID 列表：

```rust
pub(super) fn get_all_block_ids_from_bytes(
    bytes: &[u8],
    id: Uuid,
) -> Result<Vec<Uuid>, FromBytesError> {
    let arrow_reader = arrow::ipc::reader::FileReader::try_new(&mut cursor, None);
    // 版本验证和块 ID 提取
}
```

Sources: [rust/blockstore/src/arrow/root.rs:1-60](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/root.rs)

## 执行引擎架构

### 执行计划结构

Chroma 的查询执行采用算子图模型，每个查询被编译为执行计划：

```mermaid
graph TD
    A[Query Input] --> B[Plan Generation]
    B --> C[Operator Tree]
    C --> D[Execution]
    D --> E[Result Assembly]
    E --> F[Response]
```

执行计划定义了查询的完整执行逻辑，包括过滤、投影、排序等操作。 Sources: [rust/types/src/execution/plan.rs:1-50](https://github.com/chroma-core/chroma/blob/main/rust/types/src/execution/plan.rs)

### 搜索算子与结果处理

搜索操作返回 `SearchPayloadResult`，包含匹配的记录列表：

```rust
#[derive(Clone, Debug, Default)]
pub struct SearchPayloadResult {
    pub records: Vec<SearchRecord>,
}
```

每个搜索结果批次还包含日志拉取字节数指标，用于内部性能监控。 Sources: [rust/types/src/execution/operator.rs:1-80](https://github.com/chroma-core/chroma/blob/main/rust/types/src/execution/operator.rs)

## 查询处理流程

### 查询请求结构

```mermaid
sequenceDiagram
    Client->>API: Query Request
    API->>Validation: Validate Parameters
    Validation->>Execution: Create Execution Plan
    Execution->>Index: Search Index
    Index->>Execution: Merge Results
    Execution->>API: SearchPayloadResult
    API->>Client: Query Response
```

### Include 列表与响应字段

查询时可以通过 `include` 参数指定返回的字段：

```rust
pub enum Include {
    Distance,
    Document,
    Embedding,
    Metadata,
    Uri,
}

pub struct IncludeList(pub Vec<Include>);

impl IncludeList {
    pub fn default_query() -> Self {
        Self(vec![
            Include::Document,
            Include::Metadata,
            Include::Distance,
        ])
    }
}
```

Sources: [rust/types/src/api_types.rs:1-100](https://github.com/chroma-core/chroma/blob/main/rust/types/src/api_types.rs)

### 元数据过滤

Chroma 支持丰富的元数据查询操作符：

| 操作符类型 | 说明 | 示例 |
|-----------|------|------|
| `Contains` | 文档包含字符串 | `{"$contains": "keyword"}` |
| `NotContains` | 文档不包含字符串 | `{"$not_contains": "spam"}` |
| `Regex` | 正则表达式匹配 | `{"$regex": "^[A-Z].*"}` |
| `NotRegex` | 正则表达式不匹配 | `{"$not_regex": "^test"}` |

Sources: [rust/types/src/metadata.rs:1-80](https://github.com/chroma-core/chroma/blob/main/rust/types/src/metadata.rs)

## 矢量索引系统

### SPANN 索引

SPANN (Scalable PArtitioning Algorithm for Approximate Nearest Neighbor) 是 Chroma 的稀疏矢量索引实现：

```rust
pub struct SpannPosting {
    pub doc_offset_id: u32,
    pub doc_embedding: Vec<f32>,
}

pub struct SpannIndexReader<'me> {
    pub posting_lists: BlockfileReader<'me, u32, SpannPostingList<'me>>,
    pub hnsw_index: HnswIndexRef,
    pub versions_map: BlockfileReader<'me, u32, u32>,
    pub dimensionality: usize,
    pub adaptive_search_nprobe: bool,
}
```

Sources: [rust/index/src/spann/types.rs:1-80](https://github.com/chroma-core/chroma/blob/main/rust/index/src/spann/types.rs)

### 稀疏 Posting Block

稀疏索引使用 DirectoryBlock 管理 posting 列表的目录信息：

```text
body = [ max_offset: u32 LE, max_weight: f32 LE ] × num_entries
```

目录块的 `max_weight` 存储维度级别的最大权重，用于早期术语剪枝。 Sources: [rust/types/src/sparse_posting_block.rs:1-80](https://github.com/chroma-core/chroma/blob/main/rust/types/src/sparse_posting_block.rs)

## 分布式拓扑管理

### ProviderRegion 结构

Chroma 使用拓扑管理跨区域的数据分布：

```rust
pub struct ProviderRegion<T: Clone + Debug> {
    pub name: RegionName,
    pub provider: String,    // 如 "aws", "gcp"
    pub region: String,     // 如 "us-east-1"
    pub config: T,
}
```

每个 ProviderRegion 具有唯一名称，支持云提供商的区域级配置。 Sources: [rust/types/src/topology.rs:1-80](https://github.com/chroma-core/chroma/blob/main/rust/types/src/topology.rs)

### Topology 结构

```rust
pub struct Topology<T: Clone + Debug> {
    pub name: TopologyName,
    regions: Vec<RegionName>,
    pub config: T,
}
```

Topology 聚合多个 ProviderRegion，支持全局视角的数据分布管理。 Sources: [rust/types/src/topology.rs:80-150](https://github.com/chroma-core/chroma/blob/main/rust/types/src/topology.rs)

## 压缩与后台处理

### Compactor Scheduler

Compactor 负责数据的压缩合并和垃圾回收：

```mermaid
graph TD
    A[Scheduler] --> B[Job Queue]
    A --> C[Assignment Policy]
    B --> D[Collections]
    D --> E[Compaction Jobs]
    E --> F[Log Truncation]
    F --> G[Garbage Collection]
```

核心调度参数：

| 参数 | 说明 |
|------|------|
| `max_concurrent_jobs` | 最大并发任务数 |
| `min_compaction_size` | 最小压缩块大小 |
| `job_expiry_seconds` | 任务过期时间 |
| `max_failure_count` | 最大失败重试次数 |

Sources: [rust/worker/src/compactor/scheduler.rs:1-100](https://github.com/chroma-core/chroma/blob/main/rust/worker/src/compactor/scheduler.rs)

### WAL3 日志系统

Chroma 使用 WAL3 (Write-Ahead Log 3) 管理写入日志：

```text
wal3/
├── log/Bucket=XXXXX/FragmentSeqNo=XXXXX.parquet
├── manifest/MANIFEST
├── snapshot/SNAPSHOT.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
└── garbage/GARBAGE
```

WAL3 的写入路径：

1. Writer 调用 `push_work` 提交工作到 fragment manager
2. Fragment manager 批量处理达到阈值时分配 fragment
3. 数据刷新到对象存储，调用 `assign_timestamp` 更新 manifest
4. 创建 manifest 变更记录

Sources: [rust/wal3/README.md:1-100](https://github.com/chroma-core/chroma/blob/main/rust/wal3/README.md)

## 错误处理机制

### ChromaError  trait

所有 Chroma 错误类型实现 `ChromaError` trait：

```rust
pub enum ErrorCodes {
    Internal,        // 内部错误
    InvalidArgument, // 参数错误
    NotFound,        // 资源未找到
    AlreadyExists,   // 资源已存在
    // ... 其他错误码
}
```

错误映射示例：

```rust
match self {
    BlockLoadError::IOError(_) => ErrorCodes::Internal,
    BlockLoadError::ArrowError(_) => ErrorCodes::Internal,
    BlockLoadError::NoRecordBatches => ErrorCodes::Internal,
    BlockLoadError::CacheError(_) => ErrorCodes::Internal,
}
```

Sources: [rust/blockstore/src/arrow/block/types.rs:80-120](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/block/types.rs)

## 完整数据流图

```mermaid
graph TD
    subgraph Client
        A[HTTP/SDK Request]
    end
    
    subgraph API
        B[Request Validation]
        C[Parameter Parsing]
    end
    
    subgraph Processing
        D[Execution Orchestration]
        E[Plan Generation]
        F[Operator Execution]
    end
    
    subgraph Storage
        G[Blockfile Provider]
        H[Arrow IPC Storage]
        I[WAL3 Log]
    end
    
    subgraph Index
        J[Vector Index HNSW/SPANN]
        K[Inverted Index]
    end
    
    subgraph Background
        L[Compactor Scheduler]
        M[Garbage Collection]
    end
    
    A --> B --> C --> D
    D --> E --> F
    F --> J
    F --> K
    G --> H
    G --> I
    L --> M
    M --> H
    
    F --> N[Query Response]
```

## 总结

Chroma 的数据流与处理流程涵盖从客户端请求到持久化存储的完整链路。核心特点包括：

1. **分层架构**：清晰的 API、执行引擎、存储和索引层分离
2. **Arrow IPC 格式**：高效、跨语言的列式存储
3. **多种索引支持**：HNSW 矢量索引、SPANN 稀疏索引、倒排索引
4. **分布式拓扑**：支持多区域、多租户的数据分布
5. **后台处理**：压缩调度、垃圾回收保证系统健康

Sources: [README.md:1-50](https://github.com/chroma-core/chroma/blob/main/README.md), [rust/chroma/README.md:1-80](https://github.com/chroma-core/chroma/blob/main/rust/chroma/README.md)

---

<a id='page-vector-index'></a>

## 向量索引系统

### Related Pages

Related topics: [查询执行引擎](#page-query-execution), [存储系统](#page-storage)

<details>
<summary>Relevant source files</summary>

The following files were used as context for generating this wiki page:

- [rust/index/src/hnsw.rs](https://github.com/chroma-core/chroma/blob/main/rust/index/src/hnsw.rs)
- [rust/index/src/spann.rs](https://github.com/chroma-core/chroma/blob/main/rust/index/src/spann.rs)
- [rust/index/src/quantization/mod.rs](https://github.com/chroma-core/chroma/blob/main/rust/index/src/quantization/mod.rs)
- [rust/index/src/sparse/mod.rs](https://github.com/chroma-core/chroma/blob/main/rust/index/src/sparse/mod.rs)
- [rust/segment/src/local_hnsw.rs](https://github.com/chroma-core/chroma/blob/main/rust/segment/src/local_hnsw.rs)
- [rust/index/src/spann/types.rs](https://github.com/chroma-core/chroma/blob/main/rust/index/src/spann/types.rs)
- [rust/blockstore/src/arrow/provider.rs](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/provider.rs)
</details>

# 向量索引系统

## 概述

Chroma的向量索引系统是一个多层次、高性能的近似最近邻（ANN）搜索基础设施，旨在支持大规模向量数据的存储、索引和检索。该系统采用多种索引策略，包括HNSW（分层可导航小世界图）、Spann（稀疏向量索引）、量化（Quantization）以及稀疏索引模块，为不同的搜索场景提供优化的性能和精度平衡。

向量索引系统在Chroma架构中处于核心地位，负责将高维向量数据组织成可高效查询的索引结构，支持余弦距离、点积、欧几里得距离等多种距离函数。

Sources: [rust/index/src/hnsw.rs:1-50](https://github.com/chroma-core/chroma/blob/main/rust/index/src/hnsw.rs)

## 系统架构

### 整体架构图

```mermaid
graph TD
    subgraph "索引层"
        HNSW[HNSW索引模块]
        Spann[Spann稀疏索引]
        Quantization[量化模块]
        SparseIndex[稀疏索引模块]
    end
    
    subgraph "存储层"
        BlockfileProvider[BlockfileProvider]
        BlockfileWriter[BlockfileWriter]
        ArrowOrdered[ArrowOrderedBlockfileWriter]
        ArrowUnordered[ArrowUnorderedBlockfileWriter]
    end
    
    subgraph "核心组件"
        HnswIndexProvider[HnswIndexProvider]
        HnswIndexRef[HnswIndexRef]
        SpannIndexReader[SpannIndexReader]
        SpannIndexWriter[SpannIndexWriter]
    end
    
    HNSW --> HnswIndexProvider
    HNSW --> HnswIndexRef
    Spann --> SpannIndexReader
    Spann --> SpannIndexWriter
    BlockfileProvider --> BlockfileWriter
    BlockfileWriter --> ArrowOrdered
    BlockfileWriter --> ArrowUnordered
    HnswIndexProvider --> BlockfileProvider
```

### 索引类型对比

| 索引类型 | 用途 | 适用场景 | 核心数据结构 |
|---------|------|---------|-------------|
| HNSW | 高效ANN搜索 | 稠密向量、精确排序 | 分层图结构 |
| Spann | 稀疏向量索引 | 稀疏嵌入、混合搜索 | Posting List + HNSW |
| Quantization | 向量压缩 | 大规模数据、内存受限 | PQ/SQ量化 |
| Sparse Index | 稀疏特征索引 | 关键词搜索、推荐系统 | 倒排索引 |

Sources: [rust/index/src/spann/types.rs:1-80](https://github.com/chroma-core/chroma/blob/main/rust/index/src/spann/types.rs)

## HNSW索引模块

### HNSW算法概述

HNSW（Hierarchical Navigable Small World）是一种基于图的近似最近邻搜索算法，通过构建多层可导航小世界图来实现高效搜索。该算法在Chroma中作为默认的向量索引实现，提供了良好的搜索精度和性能平衡。

### HnswIndexProvider

`HnswIndexProvider`是HNSW索引的生命周期管理器，负责索引的创建、打开、关闭和删除操作。

```rust
// 索引打开操作的核心签名
pub async fn open(
    &self,
    id: &IndexUuid,
    cache_key: &CollectionUuid,
    dimensionality: i32,
    distance_function: DistanceFunction,
    ef_search: usize,
    prefix_path: &str,
) -> Result<HnswIndexRef, HnswIndexReaderError>
```

Sources: [rust/index/src/spann/types.rs:45-65](https://github.com/chroma-core/chroma/blob/main/rust/index/src/spann/types.rs)

### 关键配置参数

| 参数 | 类型 | 说明 | 默认值 |
|------|------|------|--------|
| dimensionality | i32 | 向量维度 | 必填 |
| distance_function | DistanceFunction | 距离函数 | 必填 |
| ef_search | usize | 搜索时的候选集大小 | 可配置 |
| prefix_path | &str | 存储路径前缀 | 必填 |

### HnswIndexRef结构

`HnswIndexRef`是对HNSW索引的引用类型，用于在查询阶段访问索引数据。它持有索引的所有权引用，确保索引在查询期间保持有效。

Sources: [rust/index/src/spann/types.rs:50](https://github.com/chroma-core/chroma/blob/main/rust/index/src/spann/types.rs)

## Spann稀疏向量索引

### Spann架构

Spann是Chroma专为稀疏向量设计的混合索引系统，结合了倒排索引和HNSW图的优点。它由三个核心组件构成：

```mermaid
graph LR
    subgraph "SpannIndex"
        PostingList[Posting Lists]
        HNSWGraph[HNSW Graph]
        VersionsMap[Versions Map]
    end
    
    subgraph "数据结构"
        SpannPosting[SpannPosting<br/>doc_offset_id<br/>doc_embedding]
        SpannPostingList[SpannPostingList]
    end
    
    PostingList --> SpannPostingList
    HNSWGraph --> HnswIndexRef
```

### SpannPosting结构

```rust
#[derive(Debug)]
pub struct SpannPosting {
    pub doc_offset_id: u32,      // 文档偏移ID
    pub doc_embedding: Vec<f32>, // 文档向量
}
```

Sources: [rust/index/src/spann/types.rs:38-42](https://github.com/chroma-core/chroma/blob/main/rust/index/src/spann/types.rs)

### SpannIndexReader

`SpannIndexReader`提供对Spann索引的只读访问，包含以下核心字段：

| 字段 | 类型 | 说明 |
|------|------|------|
| posting_lists | BlockfileReader<u32, SpannPostingList> | Posting列表读取器 |
| hnsw_index | HnswIndexRef | HNSW图索引引用 |
| versions_map | BlockfileReader<u32, u32> | 版本映射 |
| dimensionality | usize | 向量维度 |
| adaptive_search_nprobe | bool | 自适应搜索开关 |
| params | InternalSpannConfiguration | 内部配置参数 |

Sources: [rust/index/src/spann/types.rs:44-54](https://github.com/chroma-core/chroma/blob/main/rust/index/src/spann/types.rs)

### SpannIndexWriter

`SpannIndexWriter`负责Spann索引的写入操作，核心创建方法签名：

```rust
pub async fn from_id(
    hnsw_provider: &HnswIndexProvider,
    hnsw_id: Option<&IndexUuid>,
    versions_map_id: Option<&Uuid>,
    posting_list_id: Option<&Uuid>,
    max_head_id_bf_id: Option<&Uuid>,
    collection_id: &CollectionUuid,
    prefix_path: &str,
    dimensionality: usize,
    blockfile_provider: &BlockfileProvider,
    params: InternalSpannConfiguration,
    gc_context: GarbageCollectionContext,
    pl_block_size: usize,
    metrics: SpannMetrics,
    cmek: Option<Cmek>,
) -> Result<Self, SpannIndexWriterError>
```

Sources: [rust/index/src/spann/types.rs:80-130](https://github.com/chroma-core/chroma/blob/main/rust/index/src/spann/types.rs)

### SpannPostingList管理

Spann索引使用Blockfile来存储Posting List，支持高效的序列化和反序列化：

```rust
pub async fn create_postings_list_writer(
    blockfile_provider: &BlockfileProvider,
    prefix_path: &str,
    pl_block_size: usize,
    cmek: Option<Cmek>,
) -> Result<BlockfileWriter, SpannIndexWriterError>
```

写入时支持配置最大块大小和CMEK（客户托管密钥）加密选项。

Sources: [rust/index/src/spann.rs:1-100](https://github.com/chroma-core/chroma/blob/main/rust/index/src/spann.rs)

## 量化模块（Quantization）

### 量化概述

量化模块提供了向量压缩功能，通过将高精度浮点向量转换为低精度的表示形式，显著减少存储空间和内存占用，同时尽可能保持搜索精度。

### 量化配置

| 配置项 | 说明 |
|--------|------|
| QuantizationType | 量化类型（PQR/SQ等） |
| BucketCount | 量化桶数量 |
| CodeWidth | 编码宽度 |
| DistanceTable | 距离查表 |

Sources: [rust/index/src/quantization/mod.rs:1-50](https://github.com/chroma-core/chroma/blob/main/rust/index/src/quantization/mod.rs)

### 量化流程

```mermaid
graph LR
    A[原始向量] --> B[训练量化器]
    B --> C[计算码本]
    C --> D[编码向量]
    D --> E[存储压缩向量]
    E --> F[查询时解码]
    F --> G[计算距离]
```

## 稀疏索引模块

### 稀疏索引设计

稀疏索引专门处理稀疏特征向量，典型应用于关键词匹配、推荐系统和混合搜索场景。

### DirectoryBlock结构

稀疏索引使用DirectoryBlock来组织posting blocks：

```rust
pub struct DirectoryBlock(SparsePostingBlock);

impl DirectoryBlock {
    pub fn new(
        max_offsets: &[u32],  // 每个posting block的最大偏移
        max_weights: &[f32],  // 每个posting block的最大权重
    ) -> Result<Self, SparsePostingBlockError>
}
```

DirectoryBlock的二进制布局：
- **Header**: magic bytes + block count + dimension-level max weight
- **Body**: `[max_offset: u32 LE, max_weight: f32 LE] × num_entries`

Sources: [rust/types/src/sparse_posting_block.rs:1-60](https://github.com/chroma-core/chroma/blob/main/rust/types/src/sparse_posting_block.rs)

### 稀疏索引错误处理

| 错误类型 | 说明 |
|---------|------|
| MismatchedLengths | 偏移数组和权重数组长度不匹配 |
| TooManyEntries | 条目数量超过u16::MAX限制 |

Sources: [rust/types/src/sparse_posting_block.rs:30-45](https://github.com/chroma-core/chroma/blob/main/rust/types/src/sparse_posting_block.rs)

## Blockfile存储层

### BlockfileProvider架构

Blockfile是Chroma的底层存储抽象，封装了Arrow IPC格式的持久化存储。

```mermaid
graph TD
    BlockfileProvider --> ArrowOrderedBlockfileWriter
    BlockfileProvider --> ArrowUnorderedBlockfileWriter
    BlockfileProvider --> BlockfileReader
    
    ArrowOrderedBlockfileWriter["ArrowOrderedBlockfileWriter<br/>顺序写入"]
    ArrowUnorderedBlockfileWriter["ArrowUnorderedBlockfileWriter<br/>无序写入"]
    
    BlockfileWriterOptions --> BlockfileProvider
```

Sources: [rust/blockstore/src/arrow/provider.rs:1-80](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/provider.rs)

### 写入模式

| 模式 | 说明 | 使用场景 |
|------|------|---------|
| Ordered | 顺序写入，保证追加顺序 | 日志、事务性数据 |
| Unordered | 无序写入，批量优化 | 批量导入、重建索引 |

### Fork操作

Blockfile支持从现有blockfile创建分支（fork），用于创建索引快照或从检查点恢复：

```rust
pub async fn fork<K: Key>(
    &self,
    fork_from: &Uuid,
    new_id: Uuid,
    prefix_path: &Path,
    max_block_size_bytes: usize,
) -> Result<Root, Box<Error>>
```

Sources: [rust/blockstore/src/arrow/provider.rs:30-60](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/provider.rs)

### 块大小配置

| 配置项 | 说明 | 默认值 |
|--------|------|--------|
| max_block_size_bytes | 单个数据块的最大字节数 | BlockManager默认值 |

## 距离函数

系统支持多种距离函数，用于计算向量相似度：

| 距离函数 | 说明 | 适用场景 |
|---------|------|---------|
| Cosine | 余弦距离 | 归一化向量 |
| Dot | 点积 | 未归一化向量 |
| L2 | 欧几里得距离 | 几何距离 |
| IP | 内积 | 相似度排序 |

Sources: [rust/index/src/spann/types.rs:60](https://github.com/chroma-core/chroma/blob/main/rust/index/src/spann/types.rs)

## 错误处理

### 索引错误类型

| 错误类型 | 错误码 | 说明 |
|---------|--------|------|
| VersionsMapNotFound | NotFound | 版本映射不存在 |
| ScanHnswError | Internal | HNSW扫描错误 |
| DataInconsistencyError | Internal | 数据不一致 |
| RngError | Internal | 随机数生成器错误 |

Sources: [rust/index/src/spann/types.rs:20-30](https://github.com/chroma-core/chroma/blob/main/rust/index/src/spann/types.rs)

## 索引生命周期

```mermaid
stateDiagram-v2
    [*] --> Created: 创建索引
    Created --> Open: 打开索引
    Open --> Query: 执行查询
    Open --> Write: 写入数据
    Write --> Open: 写入完成
    Query --> Open: 查询完成
    Open --> Close: 关闭索引
    Close --> [*]: 资源释放
    Open --> Compact: 压缩合并
    Compact --> Open: 压缩完成
```

## 性能优化配置

### HNSW搜索参数

| 参数 | 说明 | 建议值 |
|------|------|--------|
| ef_search | 搜索时访问的最近邻候选数 | 100-500 |
| ef_construction | 构建时使用的候选数 | 100-200 |
| m | 每层连接数 | 16-64 |

### 自适应搜索

Spann索引支持`adaptive_search_nprobe`参数，启用时系统会根据查询结果动态调整搜索范围，在精度和性能之间自动平衡。

## 总结

Chroma的向量索引系统通过模块化设计，提供了一套完整的高性能向量搜索解决方案：

1. **HNSW模块**提供基础的ANN搜索能力
2. **Spann模块**专门优化稀疏向量处理
3. **量化模块**支持大规模数据的压缩存储
4. **稀疏索引模块**处理高维稀疏特征
5. **Blockfile存储层**确保数据持久化和高效读取

这套系统在保证搜索精度的同时，通过多种优化策略实现了存储效率和查询性能的平衡，满足从原型开发到生产部署的各种场景需求。

---

<a id='page-query-execution'></a>

## 查询执行引擎

### Related Pages

Related topics: [向量索引系统](#page-vector-index), [Python客户端](#page-python-client)

<details>
<summary>Relevant source files</summary>

The following files were used as context for generating this wiki page:

- [rust/system/src/execution/operator.rs](https://github.com/chroma-core/chroma/blob/main/rust/system/src/execution/operator.rs)
- [rust/types/src/execution/operator.rs](https://github.com/chroma-core/chroma/blob/main/rust/types/src/execution/operator.rs)
- [rust/worker/src/execution/operators/knn_hnsw.rs](https://github.com/chroma-core/chroma/blob/main/rust/worker/src/execution/operators/knn_hnsw.rs)
- [rust/worker/src/execution/operators/filter.rs](https://github.com/chroma-core/chroma/blob/main/rust/worker/src/execution/operators/filter.rs)
- [rust/execution/expression/plan.py](https://github.com/chroma-core/chroma/blob/main/rust/execution/expression/plan.py)
</details>

# 查询执行引擎

## 概述

Chroma 的查询执行引擎是系统的核心组件，负责处理向量搜索、元数据过滤和文档查询等操作。该引擎采用算子（Operator）模式设计，将复杂的查询流程分解为可组合的执行单元，支持灵活的查询规划和高效的结果处理。

查询执行引擎的主要职责包括：

- **向量最近邻搜索**：通过 HNSW 等索引实现高效的向量相似度搜索
- **元数据过滤**：支持标量值和数组类型的元数据条件过滤
- **文档全文搜索**：支持文档内容的包含和非包含查询
- **结果聚合**：整合来自多个数据段的搜索结果

Sources: [rust/types/src/execution/operator.rs:1-50]()

## 架构设计

### 核心组件

查询执行引擎由以下核心组件构成：

| 组件 | 职责 | 位置 |
|------|------|------|
| 算子接口 (Operator Trait) | 定义算子的通用行为和错误处理 | rust/system/src/execution/operator.rs |
| 类型定义 | 定义搜索结果、记录等数据结构 | rust/types/src/execution/operator.rs |
| KNN 算子 | 实现基于 HNSW 的向量搜索 | rust/worker/src/execution/operators/knn_hnsw.rs |
| 过滤算子 | 处理元数据和文档过滤条件 | rust/worker/src/execution/operators/filter.rs |
| 执行计划 | Python 层的查询规划 | rust/execution/expression/plan.py |

### 数据流架构

```mermaid
graph TD
    A[查询请求] --> B[执行计划解析]
    B --> C[过滤算子 Filter]
    C --> D[KNN 算子 HNSW]
    D --> E[结果聚合]
    E --> F[SearchPayloadResult]
    F --> G[返回结果]
    
    C -.->|预过滤| H[元数据索引]
    D -.->|向量搜索| I[HNSW 索引]
    
    H --> J[Blockfile]
    I --> J
```

Sources: [rust/execution/expression/plan.py:1-100]()

## 算子接口设计

### Operator Trait

查询执行引擎中的所有算子都实现统一的 `Operator` 接口。该接口定义了算子的通用行为：

```rust
// 伪代码表示结构
pub trait Operator<Input, Output> {
    type Error: ChromaError;
    
    async fn execute(&self, input: Input) -> Result<Output, Self::Error>;
}
```

所有算子都需要实现 `ChromaError` trait，提供错误码映射和追踪配置：

```rust
impl ChromaError for OperatorError {
    fn code(&self) -> ErrorCodes {
        match self {
            Self::Internal(_) => ErrorCodes::Internal,
            Self::InvalidArgument(_) => ErrorCodes::InvalidArgument,
        }
    }
    
    fn should_trace_error(&self) -> bool {
        true
    }
}
```

Sources: [rust/system/src/execution/operator.rs:1-80]()

## 数据结构

### 搜索结果类型

查询执行引擎使用以下核心数据结构：

#### SearchPayloadResult

单次搜索的有效载荷结果：

```rust
#[derive(Clone, Debug, Default)]
pub struct SearchPayloadResult {
    pub records: Vec<SearchRecord>,
}
```

包含搜索匹配的记录列表，每条记录包含向量、文档和元数据信息。

#### SearchResult

批量搜索操作的结果：

```rust
/// Results from a batch search operation.
/// 
/// Contains results for each search payload in the batch, maintaining the same order
/// as the input searches.
pub struct SearchResult {
    pub results: Vec<SearchPayloadResult>,
    pub pulled_log_bytes: u64,  // 内部指标：日志拉取字节数
}
```

Sources: [rust/types/src/execution/operator.rs:1-60]()

### Where 过滤条件

支持多种过滤条件的 `Where` 枚举类型：

| 变体 | 描述 | 示例 |
|------|------|------|
| `Direct(value)` | 直接比较 | `id == "xxx"` |
| `Key(metadata)` | 元数据字段比较 | `metadata.age > 21` |
| `Document(expr)` | 文档内容匹配 | `contains("keyword")` |
| `And(left, right)` | 逻辑与 | `age > 18 AND active == true` |
| `Or(left, right)` | 逻辑或 | `tag == "A" OR tag == "B"` |

#### 文档操作符

```rust
#[derive(Clone, Debug, PartialEq)]
pub enum DocumentOperator {
    Contains,      // 包含
    NotContains,   // 不包含
    Regex,         // 正则匹配
    NotRegex,      // 正则不匹配
}
```

Sources: [rust/types/src/metadata.rs:1-50]()

## 核心算子实现

### KNN HNSW 算子

KNN（K-Nearest Neighbors）算子使用 HNSW（Hierarchical Navigable Small World）算法实现高效的向量相似度搜索。

#### 工作流程

```mermaid
graph LR
    A[查询向量] --> B[HNSW 图遍历]
    B --> C[候选集合]
    C --> D[距离计算]
    D --> E[Top-K 结果]
    E --> F[结果包装]
```

#### 配置参数

| 参数 | 类型 | 描述 | 默认值 |
|------|------|------|--------|
| `ef_search` | usize | 搜索时的动态列表大小 | 100 |
| `k` | usize | 返回的最近邻数量 | 10 |
| `distance_function` | DistanceFunction | 距离度量函数 | Cosine |
| `dimensionality` | usize | 向量维度 | - |

#### 关键方法

```rust
impl<'me> SpannIndexReader<'me> {
    async fn hnsw_index_from_id(
        hnsw_provider: &HnswIndexProvider,
        id: &IndexUuid,
        cache_key: &CollectionUuid,
        distance_function: DistanceFunction,
        dimensionality: usize,
        ef_search: usize,
        prefix_path: &str,
    ) -> Result<HnswIndexRef, SpannIndexReaderError>
}
```

Sources: [rust/worker/src/execution/operators/knn_hnsw.rs:1-80]()

### 过滤算子

过滤算子负责处理元数据和文档内容的条件过滤。

#### 过滤类型

| 过滤类型 | 处理对象 | 支持操作符 |
|----------|----------|------------|
| MetadataScalar | 标量元数据 | `=`, `!=`, `>`, `<`, `>=`, `<=`, `And`, `Or` |
| MetadataArray | 数组元数据 | `contains_value`, `not_contains_value` |
| Document | 文档全文 | `contains`, `not_contains`, `regex`, `not_regex` |

#### Key 构建器

`Key` 类型提供了流式 API 来构建过滤条件：

```rust
// 元数据标量过滤
let filter = Key::field("age").gte(21);
let filter = Key::field("category").eq("books");

// 元数据数组包含
let filter = Key::field("tags").contains_value("action");

// 文档全文搜索
let filter = Key::Document.contains("machine learning");
let filter = Key::Document.not_contains("deprecated");
```

Sources: [rust/worker/src/execution/operators/filter.rs:1-60]()
Sources: [rust/types/src/execution/operator.rs:80-150]()

## 执行计划

### Python 层规划

在 Python 层，执行计划将用户查询转换为可执行的算子序列：

```python
# rust/execution/expression/plan.py
class ExecutionPlan:
    def __init__(self, collection):
        self.collection = collection
        self.operators = []
    
    def add_filter(self, where):
        # 添加过滤算子
        pass
    
    def add_knn(self, query_embedding, k):
        # 添加 KNN 搜索算子
        pass
```

### 查询转换流程

```mermaid
graph TD
    A[Python API Query] --> B[WhereClause 解析]
    B --> C[ExecutionPlan 构建]
    C --> D[算子序列生成]
    D --> E[Rust 执行]
    E --> F[Protobuf 序列化]
    F --> G[结果返回]
```

Sources: [rust/execution/expression/plan.py:1-100]()

## 错误处理

### 错误码映射

| 错误类型 | ErrorCode | 说明 |
|----------|-----------|------|
| IOError | Internal | 输入输出错误 |
| ArrowError | Internal | Arrow 格式错误 |
| ArrowLayoutVerificationError | Internal | Arrow 布局验证失败 |
| NoRecordBatches | Internal | IPC 文件无记录批次 |
| BlockToBytesError | Internal | 块序列化错误 |
| CacheError | Internal | 缓存操作错误 |

### 错误处理实现

```rust
impl ChromaError for BlockLoadError {
    fn code(&self) -> ErrorCodes {
        match self {
            BlockLoadError::IOError(_) => ErrorCodes::Internal,
            BlockLoadError::ArrowError(_) => ErrorCodes::Internal,
            BlockLoadError::ArrowLayoutVerificationError(_) => ErrorCodes::Internal,
            BlockLoadError::NoRecordBatches => ErrorCodes::Internal,
            BlockLoadError::BlockToBytesError(_) => ErrorCodes::Internal,
            BlockLoadError::CacheError(_) => ErrorCodes::Internal,
        }
    }
}
```

Sources: [rust/blockstore/src/arrow/block/types.rs:1-50]()

## 性能优化

### 索引结构

Chroma 使用多级索引结构优化查询性能：

1. **Blockfile**：底层存储，支持高效的块读写
2. **HNSW 索引**：向量搜索的核心，提供对数级搜索复杂度
3. **Sparse Index**：稀疏向量索引，用于高效的范围查询

### 执行策略

| 策略 | 适用场景 | 优势 |
|------|----------|------|
| 预过滤 | 过滤条件较严格 | 减少搜索空间 |
| 后过滤 | 过滤条件较宽松 | 提高搜索并行度 |
| 混合过滤 | 复杂查询 | 平衡精度和性能 |

### 缓存机制

执行引擎利用多层缓存提升性能：

- **HNSW 索引缓存**：避免重复加载索引
- **块缓存**：减少磁盘 IO
- **查询结果缓存**：加速重复查询

## 扩展性

### 自定义算子

系统支持通过实现 `Operator` trait 添加自定义算子：

```rust
pub trait Operator<Input, Output> {
    type Error: ChromaError;
    
    async fn execute(&self, input: Input) -> Result<Output, Self::Error>;
}
```

### 算子组合

通过算子的可组合性，可以构建复杂的查询管道：

```mermaid
graph LR
    A[Query] --> B[Filter1]
    B --> C[Filter2]
    C --> D[KNNSearch]
    D --> E[Rank]
    E --> F[Project]
    F --> G[Result]
```

## 总结

Chroma 的查询执行引擎通过算子模式和分层架构，实现了高效、灵活的查询处理能力。核心设计特点包括：

- **统一的算子接口**：简化了算子的实现和组合
- **多级索引支持**：HNSW + Blockfile 提供高效的向量和标量查询
- **灵活的结果聚合**：支持批量搜索和结果合并
- **完善的错误处理**：提供细粒度的错误码和追踪机制

这一架构使得 Chroma 能够高效处理大规模向量数据和复杂的查询条件，同时保持良好的可扩展性。

---

<a id='page-python-client'></a>

## Python客户端

### Related Pages

Related topics: [JavaScript/TypeScript客户端](#page-javascript-client), [嵌入函数集成](#page-embedding-functions)

<details>
<summary>Relevant source files</summary>

The following files were used as context for generating this wiki page:

- [chromadb/api/client.py](https://github.com/chroma-core/chroma/blob/main/chromadb/api/client.py)
- [chromadb/api/async_client.py](https://github.com/chroma-core/chroma/blob/main/chromadb/api/async_client.py)
- [chromadb/api/models/Collection.py](https://github.com/chroma-core/chroma/blob/main/chromadb/api/models/Collection.py)
- [chromadb/api/models/AsyncCollection.py](https://github.com/chroma-core/chroma/blob/main/chromadb/api/models/AsyncCollection.py)
- [chromadb/api/fastapi.py](https://github.com/chroma-core/chroma/blob/main/chromadb/api/fastapi.py)
</details>

# Python客户端

## 概述

Chroma的Python客户端是官方提供的Python SDK，为开发者提供了与Chroma向量数据库交互的完整接口。该客户端支持两种运行模式：嵌入式模式（Embedded Mode）和客户端-服务器模式（Client-Server Mode），满足不同场景下的使用需求。

Python客户端的核心API仅包含四个主要函数，通过简洁的接口封装了文档管理、向量嵌入、相似度搜索等核心功能。Sources: [README.md]()

## 架构设计

### 整体架构

Chroma Python客户端采用分层架构设计，主要包含以下核心组件：

```mermaid
graph TD
    A[Python Client] --> B[Sync Client]
    A --> C[Async Client]
    B --> D[Collection]
    C --> E[AsyncCollection]
    D --> F[Embedding Functions]
    E --> F
    B --> G[FastAPI Client]
    C --> G
    G --> H[Chroma Server]
```

### 客户端类型

| 客户端类型 | 文件位置 | 用途 | 适用场景 |
|-----------|---------|------|---------|
| `Client` | `chromadb/api/client.py` | 同步客户端 | 标准Python应用 |
| `AsyncClient` | `chromadb/api/async_client.py` | 异步客户端 | 异步Web框架、高并发应用 |
| `FastAPI` | `chromadb/api/fastapi.py` | 服务器端API | Chroma服务器部署 |

Sources: [chromadb/api/client.py](https://github.com/chroma-core/chroma/blob/main/chromadb/api/client.py), [chromadb/api/async_client.py](https://github.com/chroma-core/chroma/blob/main/chromadb/api/async_client.py), [chromadb/api/fastapi.py](https://github.com/chroma-core/chroma/blob/main/chromadb/api/fastapi.py)()

## 同步客户端

### 初始化与配置

同步客户端通过`chromadb.Client()`方法创建，支持嵌入式内存模式和持久化存储模式。

```python
import chromadb

# 嵌入式内存模式（临时数据）
client = chromadb.Client()

# 带持久化的嵌入式模式
client = chromadb.PersistentClient(path="/path/to/chroma_db")
```

### 核心功能

**Collection管理**

Collection是Chroma中组织文档和向量数据的基本单元。客户端提供了完整的Collection操作接口：

```python
# 创建Collection
collection = client.create_collection(
    name="my-documents",
    metadata={"description": "文档集合"}
)

# 获取已存在的Collection
collection = client.get_collection(name="my-documents")

# 获取或创建Collection
collection = client.get_or_create_collection(name="my-documents")

# 列出所有Collection
collections = client.list_collections()

# 删除Collection
client.delete_collection(name="my-documents")
```

Sources: [chromadb/api/client.py](https://github.com/chroma-core/chroma/blob/main/chromadb/api/client.py), [chromadb/api/models/Collection.py](https://github.com/chroma-core/chroma/blob/main/chromadb/api/models/Collection.py)()

## 异步客户端

### 异步支持

`AsyncClient`提供了完全异步的操作接口，适用于`asyncio`环境和异步Web框架（如FastAPI）。

```python
import chromadb

# 创建异步客户端
client = chromadb.AsyncClient()

# 异步创建Collection
collection = await client.create_collection(name="async-collection")
```

### 性能优势

| 特性 | 同步客户端 | 异步客户端 |
|-----|-----------|-----------|
| 并发处理 | 单线程阻塞 | 多任务并发 |
| 适用框架 | Django, Flask | FastAPI, aiohttp |
| 连接复用 | 每次请求新建 | 长连接复用 |
| 内存占用 | 较低 | 略高 |

Sources: [chromadb/api/async_client.py](https://github.com/chroma-core/chroma/blob/main/chromadb/api/async_client.py), [chromadb/api/models/AsyncCollection.py](https://github.com/chroma-core/chroma/blob/main/chromadb/api/models/AsyncCollection.py)()

## Collection操作

### 数据模型

Collection支持存储以下类型的数据：

```python
collection.add(
    documents=["文档内容1", "文档内容2"],      # 文档文本
    metadatas=[{"source": "doc1"}, {"source": "doc2"}],  # 元数据
    ids=["id1", "id2"],                    # 唯一标识符
    embeddings=[[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]  # 可选：自定义向量
)
```

### 插入操作

| 参数 | 类型 | 必需 | 说明 |
|-----|------|-----|------|
| `ids` | List[str] | 是 | 文档唯一标识符 |
| `documents` | List[str] | 否 | 文档内容文本 |
| `embeddings` | List[List[float]] | 否 | 向量嵌入 |
| `metadatas` | List[dict] | 否 | 元数据字典 |

Sources: [chromadb/api/models/Collection.py](https://github.com/chroma-core/chroma/blob/main/chromadb/api/models/Collection.py)()

### 查询操作

```python
# 基于文本的相似度查询
results = collection.query(
    query_texts=["查询文本"],
    n_results=2,  # 返回2个最相似结果
    where={"source": "doc1"},  # 元数据过滤
    where_document={"$contains": "关键词"}  # 文档内容过滤
)

# 基于向量的查询
results = collection.query(
    query_embeddings=[[1.0, 2.0, 3.0]],
    n_results=2
)
```

### 条件过滤

Chroma支持强大的条件过滤功能：

```python
# 元数据过滤
collection.query(
    query_texts=["查询"],
    where={"metadata_field": "value"}
)

# 文档内容过滤
collection.query(
    query_texts=["查询"],
    where_document={"$contains": "搜索字符串"}
)
```

Sources: [chromadb/api/models/Collection.py](https://github.com/chroma-core/chroma/blob/main/chromadb/api/models/Collection.py)()

## Embedding函数

### 嵌入函数架构

```mermaid
graph LR
    A[Input Text] --> B[Embedding Function]
    B --> C[Vector Embedding]
    C --> D[Storage/Indexing]
```

### 内置嵌入函数

Chroma Python客户端支持多种嵌入函数提供商：

| 提供商 | 包名 | 默认模型 |
|-------|------|---------|
| OpenAI | `chromadb.utils.embedding_functions` | text-embedding-ada-002 |
| Cohere | `chromadb.utils.embedding_functions` | embed-english-v3.0 |
| Hugging Face | `chromadb.utils.embedding_functions` | sentence-transformers |
| Ollama | `chromadb.utils.embedding_functions` | 自定义模型 |

### 自定义嵌入函数

```python
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

embedder = OpenAIEmbeddingFunction(
    api_key="your-api-key",
    model_name="text-embedding-3-small"
)

collection = client.create_collection(
    name="my-collection",
    embedding_function=embedder
)
```

Sources: [chromadb/utils/embedding_functions/schemas/README.md](https://github.com/chroma-core/chroma/blob/main/chromadb/utils/embedding_functions/schemas/README.md)()

## 使用模式

### 嵌入式模式

适用于本地开发和测试，数据存储在内存或本地文件系统：

```python
import chromadb

# 内存模式
client = chromadb.Client()

# 本地持久化模式
client = chromadb.PersistentClient(path="./chroma_db")
```

Sources: [README.md](https://github.com/chroma-core/chroma/blob/main/README.md)()

### 客户端-服务器模式

适用于生产环境和分布式部署：

```bash
# 启动Chroma服务器
chroma run --path /chroma_db_path --host 0.0.0.0 --port 8000
```

```python
import chromadb

# 连接远程服务器
client = chromadb.Client(
    settings=Settings(
        chroma_api_impl="rest",
        persist_directory="http://localhost:8000"
    )
)
```

Sources: [README.md](https://github.com/chroma-core/chroma/blob/main/README.md)()

## API参考

### 客户端方法

#### `chromadb.Client()`

创建Chroma客户端实例。

#### `client.create_collection(name, metadata, embedding_function)`

创建新的Collection。

| 参数 | 类型 | 说明 |
|-----|------|------|
| `name` | str | Collection名称 |
| `metadata` | dict | Collection元数据 |
| `embedding_function` | function | 嵌入函数 |

#### `client.get_collection(name)`

获取指定名称的Collection。

#### `client.list_collections()`

返回所有Collection列表。

#### `client.delete_collection(name)`

删除指定Collection。

### Collection方法

| 方法 | 说明 |
|-----|------|
| `add()` | 添加文档 |
| `get()` | 获取文档 |
| `update()` | 更新文档 |
| `upsert()` | 插入或更新 |
| `delete()` | 删除文档 |
| `query()` | 相似度查询 |
| `peek()` | 查看前N条记录 |
| `count()` | 统计文档数量 |
| `modify()` | 修改Collection元数据 |

Sources: [chromadb/api/models/Collection.py](https://github.com/chroma-core/chroma/blob/main/chromadb/api/models/Collection.py)()

## 最佳实践

### 性能优化

1. **批量操作**：尽量使用批量添加而非单条添加
2. **嵌入函数复用**：创建Collection时指定嵌入函数，避免重复初始化
3. **索引优化**：合理设置`hnsw:space`参数选择合适的距离度量

### 数据组织

1. **合理的Collection划分**：按主题或数据类型分离Collection
2. **元数据设计**：元数据用于过滤，应设计清晰的元数据结构
3. **ID管理**：使用有意义的ID便于数据追踪

### 错误处理

```python
try:
    collection = client.get_collection(name="non-existent")
except Exception as e:
    print(f"Error: {e}")
```

## 总结

Chroma Python客户端提供了简洁而强大的接口来管理向量数据。通过同步和异步两种客户端类型，配合灵活的Collection管理和丰富的嵌入函数支持，开发者可以快速构建基于向量检索的AI应用。Sources: [chromadb/api/client.py](https://github.com/chroma-core/chroma/blob/main/chromadb/api/client.py), [chromadb/api/async_client.py](https://github.com/chroma-core/chroma/blob/main/chromadb/api/async_client.py), [chromadb/api/models/Collection.py](https://github.com/chroma-core/chroma/blob/main/chromadb/api/models/Collection.py), [chromadb/api/models/AsyncCollection.py](https://github.com/chroma-core/chroma/blob/main/chromadb/api/models/AsyncCollection.py), [chromadb/api/fastapi.py](https://github.com/chroma-core/chroma/blob/main/chromadb/api/fastapi.py)()

---

<a id='page-javascript-client'></a>

## JavaScript/TypeScript客户端

### Related Pages

Related topics: [Python客户端](#page-python-client)

<details>
<summary>Relevant source files</summary>

The following files were used as context for generating this wiki page:

- [clients/js/packages/chromadb-core/src/ChromaClient.ts](https://github.com/chroma-core/chroma/blob/main/clients/js/packages/chromadb-core/src/ChromaClient.ts)
- [clients/new-js/packages/chromadb/src/chroma-client.ts](https://github.com/chroma-core/chroma/blob/main/clients/new-js/packages/chromadb/src/chroma-client.ts)
- [clients/new-js/packages/chromadb/src/api/client.gen.ts](https://github.com/chroma-core/chroma/blob/main/clients/new-js/packages/chromadb/src/api/client.gen.ts)
- [clients/js/packages/chromadb-client/src/index.ts](https://github.com/chroma-core/chroma/blob/main/clients/js/packages/chromadb-client/src/index.ts)
- [clients/new-js/packages/chromadb/package.json](https://github.com/chroma-core/chroma/blob/main/clients/new-js/packages/chromadb/package.json)
- [clients/js/packages/chromadb/package.json](https://github.com/chroma-core/chroma/blob/main/clients/js/packages/chromadb/package.json)
- [clients/js/packages/chromadb-client/package.json](https://github.com/chroma-core/chroma/blob/main/clients/js/packages/chromadb-client/package.json)
- [clients/js/packages/chromadb-core/package.json](https://github.com/chroma-core/chroma/blob/main/clients/js/packages/chromadb-core/package.json)
- [clients/new-js/packages/ai-embeddings/all/package.json](https://github.com/chroma-core/chroma/blob/main/clients/new-js/packages/ai-embeddings/all/package.json)
- [clients/new-js/packages/ai-embeddings/common/package.json](https://github.com/chroma-core/chroma/blob/main/clients/new-js/packages/ai-embeddings/common/package.json)
</details>

# JavaScript/TypeScript客户端

ChromaDB JavaScript/TypeScript客户端是Chroma向量数据库的官方前端SDK，支持Node.js环境和现代浏览器。该客户端提供了与Chroma服务器端进行通信的完整接口，包括集合管理、向量存储、相似性查询等功能。

## 架构概述

ChromaDB JavaScript客户端存在两套主要实现，分别位于不同的代码目录中：

### 客户端版本对比

| 特性 | 新版客户端 (`new-js`) | 旧版客户端 (`js`) |
|------|---------------------|------------------|
| 版本号 | 3.4.5 | 2.4.7 |
| 模块格式 | ESM优先，支持CJS | 同时支持ESM和CJS |
| 嵌入函数 | 通过独立包提供 | 内置或作为peerDependencies |
| 构建工具 | tsup | tsup |
| API风格 | 现代化Promise/Async | 兼容旧版 |

### 代码结构

```mermaid
graph TD
    subgraph "clients/new-js"
        A[new-js/chromadb] --> B[new-js/ai-embeddings]
        B --> C[各嵌入提供者包]
        C --> D[TogetherAI]
        C --> E[GoogleGemini]
        C --> F[Jina]
        C --> G[VoyageAI]
        C --> H[SentenceTransformer]
        C --> I[BM25]
    end
    
    subgraph "clients/js"
        J[chromadb-core] --> K[chromadb-embedded]
        J --> L[chromadb-client]
    end
    
    M[Chroma Server] <--> N[HTTP API]
    A --> N
    J --> N
    K --> M
```

## 核心组件

### 1. ChromaClient (新版)

新版客户端的核心类位于 `clients/new-js/packages/chromadb/src/chroma-client.ts`，提供了与Chroma服务器通信的主要接口。

**主要功能：**

- 创建和管理集合
- 执行向量查询操作
- 管理集合元数据
- 处理身份验证（可选）

**基本用法：**

```typescript
import { ChromaClient } from 'chromadb';

const client = new ChromaClient({
  path: 'http://localhost:8000',
});

const collection = await client.createCollection({
  name: 'my-collection',
  embeddingFunction: embedder, // 可选自定义嵌入函数
  metadata: { 'description': '我的第一个集合' }
});
```

Sources: [clients/new-js/packages/chromadb/src/chroma-client.ts](https://github.com/chroma-core/chroma/blob/main/clients/new-js/packages/chromadb/src/chroma-client.ts)

### 2. ChromaClient (旧版)

旧版客户端位于 `clients/js/packages/chromadb-core/src/ChromaClient.ts`，功能类似但API略有不同。

**包结构对比：**

| 包名 | 说明 | 嵌入函数 |
|------|------|---------|
| `chromadb` | 完整包，内置所有嵌入库 | 已包含 |
| `chromadb-client` | 仅客户端，嵌入为peerDependencies | 需单独安装 |

Sources: [clients/js/packages/chromadb-core/src/ChromaClient.ts](https://github.com/chroma-core/chroma/blob/main/clients/js/packages/chromadb-core/src/ChromaClient.ts)

### 3. Collection操作

集合是Chroma中组织向量数据的基本单位。以下是Collection的主要操作：

```typescript
// 添加文档
await collection.add({
  ids: ['1', '2', '3'],
  documents: ['第一个文档', '第二个文档', '第三个文档'],
  metadatas: [{ source: 'a' }, { source: 'b' }, { source: 'c' }],
});

// 查询相似文档
const results = await collection.query({
  queryTexts: ['查询文本'],
  nResults: 2,
});

// 获取数据
const data = await collection.get({
  ids: ['1', '2'],
  where: { source: 'a' },
});

// 删除数据
await collection.delete({
  where: { source: 'b' },
});
```

## 嵌入函数系统

Chroma使用嵌入函数(Embedding Function)将文本转换为向量。新版客户端通过独立包提供各种嵌入提供者。

### 支持的嵌入提供者

| 提供者 | 包名 | 模型默认 |
|--------|------|---------|
| Together AI | `@chroma-core/together-ai` | togethercomputer/m2-bert-80M-8k-retrieval |
| Google Gemini | `@chroma-core/google-gemini` | text-embedding-004 |
| Jina AI | `@chroma-core/jina` | jina-embeddings-v2-base-en |
| Voyage AI | `@chroma-core/voyageai` | voyage-2 |
| HuggingFace | `@chroma-core/huggingface-server` | - |
| Cohere | `@chroma-core/cohere` | - |
| Sentence Transformer | `@chroma-core/sentence-transformer` | - |
| BM25 (稀疏) | `@chroma-core/chroma-bm25` | - |

Sources: [clients/new-js/packages/ai-embeddings/all/package.json](https://github.com/chroma-core/chroma/blob/main/clients/new-js/packages/ai-embeddings/all/package.json)

### 嵌入函数配置示例

```typescript
import { ChromaClient } from 'chromadb';
import { JinaEmbeddingFunction } from '@chroma-core/jina';

// 初始化Jina嵌入函数
const embedder = new JinaEmbeddingFunction({
  apiKey: process.env.JINA_API_KEY,
  modelName: 'jina-embeddings-v2-base-en',
  dimensions: 768,
  normalized: true,
});

// 创建客户端并指定嵌入函数
const client = new ChromaClient({ path: 'http://localhost:8000' });
const collection = await client.createCollection({
  name: 'my-collection',
  embeddingFunction: embedder,
});
```

### 通用工具包

`@chroma-core/ai-embeddings-common` 包提供了所有嵌入提供者共享的通用工具：

- API请求封装
- 响应格式标准化
- 配置验证
- AJV JSON Schema验证

Sources: [clients/new-js/packages/ai-embeddings/common/package.json](https://github.com/chroma-core/chroma/blob/main/clients/new-js/packages/ai-embeddings/common/package.json)

## API客户端生成

新版客户端使用OpenAPI规范生成类型安全的API客户端：

```mermaid
graph LR
    A[OpenAPI Spec] --> B[openapi-generator-plus]
    B --> C[client.gen.ts]
    C --> D[类型定义]
    C --> E[API方法]
```

生成的API文件位于 `clients/new-js/packages/chromadb/src/api/client.gen.ts`，包含：

- 完整的TypeScript类型定义
- 请求/响应模型
- HTTP客户端封装

Sources: [clients/new-js/packages/chromadb/src/api/client.gen.ts](https://github.com/chroma-core/chroma/blob/main/clients/new-js/packages/chromadb/src/api/client.gen.ts)

## 包配置与导出

### 模块导出配置

新版客户端使用现代化的导出配置，支持ESM和CommonJS：

```json
{
  "exports": {
    ".": {
      "import": {
        "types": "./dist/chromadb.d.ts",
        "default": "./dist/chromadb.mjs"
      },
      "require": {
        "types": "./dist/cjs/chromadb.d.cts",
        "default": "./dist/cjs/chromadb.cjs"
      }
    }
  }
}
```

Sources: [clients/new-js/packages/chromadb/package.json](https://github.com/chroma-core/chroma/blob/main/clients/new-js/packages/chromadb/package.json)

### 兼容性矩阵

| Node.js版本 | 支持情况 |
|-------------|---------|
| >= 20 | 完整支持 (new-js) |
| >= 14.17.0 | 基础支持 (旧版js) |
| >= 10 | 原生 bindings |

## 开发与构建

### 构建命令

```bash
# 新版客户端构建
cd clients/new-js/packages/chromadb
pnpm build

# 旧版客户端构建
cd clients/js
pnpm build:core
pnpm build:packages

# 安装依赖
pnpm install

# 运行测试
pnpm test
```

### 发布流程

```bash
# 发布正式版本
pnpm release

# 发布Alpha版本
pnpm release_alpha

# 发布开发版本
pnpm release_dev
```

Sources: [clients/js/package.json](https://github.com/chroma-core/chroma/blob/main/clients/js/package.json)

## 使用示例

### Node.js环境完整示例

```typescript
import { ChromaClient } from 'chromadb';
import { JinaEmbeddingFunction } from '@chroma-core/jina';

// 1. 初始化嵌入函数
const embedder = new JinaEmbeddingFunction({
  apiKey: process.env.JINA_API_KEY,
});

// 2. 创建客户端
const client = new ChromaClient({
  path: 'http://localhost:8000',
});

// 3. 创建或获取集合
const collection = await client.getOrCreateCollection({
  name: 'documents',
  embeddingFunction: embedder,
});

// 4. 添加文档
await collection.add({
  ids: ['doc1', 'doc2', 'doc3'],
  documents: [
    '机器学习是人工智能的一个分支',
    '深度学习使用神经网络模型',
    '自然语言处理研究人机交互'
  ],
  metadatas: [
    { category: 'ai', year: 2024 },
    { category: 'ml', year: 2024 },
    { category: 'nlp', year: 2024 }
  ]
});

// 5. 查询相似文档
const results = await collection.query({
  queryTexts: ['什么是深度学习？'],
  nResults: 2,
});

console.log('相似文档:', results.documents);
console.log('距离:', results.distances);
```

### 旧版客户端用法

```typescript
import { ChromaClient } from 'chromadb';

// 使用默认配置连接
const client = new ChromaClient();
const collection = await client.createCollection('test');

// 使用自定义服务器地址
const remoteClient = new ChromaClient({
  path: 'https://your-chroma-server.com'
});
```

Sources: [clients/js/packages/chromadb-client/src/index.ts](https://github.com/chroma-core/chroma/blob/main/clients/js/packages/chromadb-client/src/index.ts)

## 依赖管理

### 核心依赖

| 依赖包 | 用途 |
|--------|------|
| isomorphic-fetch | 跨平台HTTP请求 |
| ajv | JSON Schema验证 |
| cliui | 命令行界面 |

### 嵌入函数peerDependencies (旧版客户端)

```json
{
  "peerDependencies": {
    "@google/generative-ai": "^0.1.1",
    "@xenova/transformers": "^2.17.2",
    "chromadb-default-embed": "^2.14.0",
    "cohere-ai": "^7.0.0"
  }
}
```

Sources: [clients/js/packages/chromadb/package.json](https://github.com/chroma-core/chroma/blob/main/clients/js/packages/chromadb/package.json)

## 测试策略

### 测试命令

```bash
# 运行所有测试
pnpm test

# 运行功能测试（排除认证测试）
pnpm test:functional

# 更新快照测试
pnpm test:update
```

### 测试环境要求

- Node.js >= 14.17.0
- Chroma服务器运行于 localhost:8000
- 可选的认证配置用于auth测试

## 技术选型说明

### 为什么使用tsup

tsup是Chroma JS客户端选用的构建工具，原因如下：

- **零配置**: 自动处理TypeScript和Babel
- **多格式输出**: 同时生成ESM、CJS、类型定义
- **性能**: 基于esbuild，构建速度极快
- **Source Maps**: 支持调试

### 为什么分离嵌入包

新版客户端将嵌入函数分离为独立包的好处：

1. **减少主包体积**: 用户只需安装需要的嵌入提供者
2. **独立版本管理**: 各嵌入提供者可独立更新
3. **Tree-shaking**: 构建时可移除未使用的嵌入代码
4. **灵活性**: 支持自定义嵌入函数实现

## 相关资源

- [官方文档](https://docs.trychroma.com/)
- [GitHub仓库](https://github.com/chroma-core/chroma)
- [API参考](./api-reference.md)
- [嵌入函数指南](./embedding-functions.md)

---

<a id='page-embedding-functions'></a>

## 嵌入函数集成

### Related Pages

Related topics: [Python客户端](#page-python-client), [JavaScript/TypeScript客户端](#page-javascript-client)

<details>
<summary>Relevant source files</summary>

The following files were used as context for generating this wiki page:

- [clients/new-js/packages/ai-embeddings/common/README.md](https://github.com/chroma-core/chroma/blob/main/clients/new-js/packages/ai-embeddings/common/README.md)
- [clients/new-js/packages/ai-embeddings/all/README.md](https://github.com/chroma-core/chroma/blob/main/clients/new-js/packages/ai-embeddings/all/README.md)
- [clients/new-js/packages/ai-embeddings/jina/README.md](https://github.com/chroma-core/chroma/blob/main/clients/new-js/packages/ai-embeddings/jina/README.md)
- [clients/new-js/packages/ai-embeddings/morph/README.md](https://github.com/chroma-core/chroma/blob/main/clients/new-js/packages/ai-embeddings/morph/README.md)
- [clients/new-js/packages/ai-embeddings/chroma-cloud-qwen/README.md](https://github.com/chroma-core/chroma/blob/main/clients/new-js/packages/chroma-cloud-qwen/README.md)
- [clients/new-js/src/embedding-function.ts](https://github.com/chroma-core/chroma/blob/main/clients/new-js/src/embedding-function.ts)
- [chromadb/utils/embedding_functions/schemas/README.md](https://github.com/chroma-core/chroma/blob/main/chromadb/utils/embedding_functions/schemas/README.md)
- [schemas/embedding_functions/README.md](https://github.com/chroma-core/chroma/blob/main/schemas/embedding_functions/README.md)
- [rust/types/src/api_types.rs](https://github.com/chroma-core/chroma/blob/main/rust/types/src/api_types.rs)
</details>

# 嵌入函数集成

## 概述

嵌入函数集成是 ChromaDB 中用于将文本、文档或其他数据转换为向量表示的核心机制。ChromaDB 支持多种嵌入提供商的集成，包括 OpenAI、Cohere、Jina、HuggingFace 等云服务，以及本地模型如 Ollama 和 HuggingFace Server。

ChromaDB 的嵌入函数设计遵循跨语言兼容性原则，确保 Python 客户端和 JavaScript/TypeScript 客户端能够共享相同的配置模式和验证机制。Sources: [clients/new-js/packages/ai-embeddings/common/README.md]()

## 架构设计

### 整体架构

ChromaDB 采用模块化设计，将嵌入函数抽象为可插拔的接口。核心组件包括：

```mermaid
graph TD
    A[ChromaDB Client] --> B[Embedding Function Abstraction]
    B --> C[Provider-Specific Implementations]
    C --> D[OpenAI]
    C --> E[Cohere]
    C --> F[Jina]
    C --> G[HuggingFace]
    C --> H[Ollama]
    C --> I[Morph]
    C --> J[Google Gemini]
    C --> K[Voyage AI]
    C --> L[Cloudflare Worker AI]
    
    M[Schema Validation] --> B
    M --> N[base_schema.json]
    M --> O[Provider Schemas]
```

### 嵌入函数配置流程

嵌入函数在客户端层面进行实例化和配置，然后与集合（Collection）绑定使用：

```mermaid
graph LR
    A1[用户代码] --> B1[创建嵌入函数实例]
    B1 --> C1[配置参数验证]
    C1 --> D1[Schema 验证]
    D1 --> E1[与 Collection 绑定]
    E1 --> F1[数据添加时自动嵌入]
    F1 --> G1[查询时自动嵌入查询文本]
```

## 嵌入函数包结构

### JavaScript/TypeScript 包组织

ChromaDB 的 JavaScript 客户端将不同嵌入提供商拆分为独立包，便于按需安装：

| 包名 | 用途 | 源码路径 |
|------|------|----------|
| `@chroma-core/ai-embeddings-common` | 通用工具函数 | clients/new-js/packages/ai-embeddings/common/ |
| `@chroma-core/all` | 所有嵌入提供商的聚合包 | clients/new-js/packages/ai-embeddings/all/ |
| `@chroma-core/jina` | Jina AI 嵌入 | clients/new-js/packages/ai-embeddings/jina/ |
| `@chroma-core/morph` | Morph 嵌入 | clients/new-js/packages/ai-embeddings/morph/ |
| `@chroma-core/chroma-cloud-qwen` | Qwen 云端嵌入 | clients/new-js/packages/ai-embeddings/chroma-cloud-qwen/ |

Sources: [clients/new-js/packages/ai-embeddings/all/README.md](), [clients/new-js/packages/ai-embeddings/common/README.md]()

### 通用工具函数

`@chroma-core/ai-embeddings-common` 包提供以下核心功能：

```typescript
// 导入示例
import { validateConfigSchema, snakeCase, isBrowser } from '@chroma-core/ai-embeddings-common';

// 驼峰式转蛇式命名（用于 API 兼容性）
const snakeCaseConfig = snakeCase({ modelName: 'text-embedding-3-small' });
// 结果: { model_name: 'text-embedding-3-small' }

// 检测浏览器环境
if (isBrowser()) {
  // 浏览器特定逻辑
}

// 验证嵌入函数配置
validateConfigSchema(config, 'openai');
```

核心功能包括：

- **Schema 验证**：使用 JSON Schema Draft-07 规范验证嵌入函数配置
- **命名转换**：将 camelCase JavaScript 对象转换为 snake_case 以匹配 API 规范
- **环境检测**：识别浏览器与 Node.js 运行环境的差异

Sources: [clients/new-js/packages/ai-embeddings/common/README.md]()

## Schema 验证系统

### 跨语言兼容性

ChromaDB 在 `schemas/embedding_functions/` 目录维护统一的 JSON Schema 定义，确保 Python 和 JavaScript 客户端的配置保持一致：

```mermaid
graph TD
    A[Schema Definition] --> B[Python Client]
    A --> C[JavaScript Client]
    B --> D[validate_config function]
    C --> E[validateConfig function]
```

### Schema 结构

每个嵌入函数 schema 遵循 JSON Schema Draft-07 规范：

| 字段 | 说明 |
|------|------|
| `version` | Schema 版本号 |
| `title` | Schema 标题 |
| `description` | 功能描述 |
| `properties` | 可配置属性 |
| `required` | 必需属性 |
| `additionalProperties` | 是否允许额外属性（始终为 false） |

### Python 端验证

```python
from chromadb.utils.embedding_functions.schemas import validate_config

# 验证配置
config = {
    "api_key_env_var": "CHROMA_OPENAI_API_KEY",
    "model_name": "text-embedding-ada-002"
}
validate_config(config, "openai")
```

Sources: [chromadb/utils/embedding_functions/schemas/README.md](), [schemas/embedding_functions/README.md]()

## API 类型定义

### Include 枚举

在 Rust 实现中，嵌入函数的返回内容通过 `Include` 枚举控制：

```rust
#[derive(Clone, Debug, Deserialize, Serialize, PartialEq)]
#[cfg_attr(feature = "utoipa", derive(utoipa::ToSchema))]
pub enum Include {
    Distance,
    Document,
    Embedding,
    Metadata,
    Uri,
}
```

### IncludeList 默认值

| 方法 | 默认包含字段 |
|------|-------------|
| `default_query()` | Document, Metadata, Distance |
| `default_get()` | Document, Metadata |
| `all()` | Document, Metadata, Distance, Embedding, Uri |

Sources: [rust/types/src/api_types.rs]()

## 嵌入函数实现示例

### Jina 嵌入函数

Jina AI 嵌入函数是 ChromaDB 支持的重要嵌入提供商之一：

```typescript
import { ChromaClient } from 'chromadb';
import { JinaEmbeddingFunction } from '@chroma-core/jina';

// 初始化嵌入函数
const embedder = new JinaEmbeddingFunction({
  apiKey: 'your-api-key', // 或设置 JINA_API_KEY 环境变量
  modelName: 'jina-embeddings-v2-base-en',
  task: 'retrieval.passage',
  dimensions: 768,
  lateChunking: false,
  truncate: true,
  normalized: true,
  embeddingType: 'float'
});

// 创建客户端和集合
const client = new ChromaClient({ path: 'http://localhost:8000' });
const collection = await client.createCollection({
  name: 'my-collection',
  embeddingFunction: embedder,
});

// 添加文档（自动嵌入）
await collection.add({
  ids: ["1", "2", "3"],
  documents: ["Document 1", "Document 2", "Document 3"],
});

// 查询（自动嵌入查询文本）
const results = await collection.query({
  queryTexts: ["Sample query"],
  nResults: 2,
});
```

**配置选项**：

| 选项 | 类型 | 默认值 | 说明 |
|------|------|--------|------|
| `apiKey` | string | JINA_API_KEY env | API 密钥 |
| `modelName` | string | jina-embeddings-v2-base-en | 模型名称 |
| `task` | string | retrieval.passage | 任务类型 |
| `dimensions` | number | 768 | 向量维度 |
| `lateChunking` | boolean | false | 延迟分块 |
| `truncate` | boolean | true | 截断超长文本 |
| `normalized` | boolean | true | L2 归一化 |
| `embeddingType` | string | float | 嵌入类型 |

Sources: [clients/new-js/packages/ai-embeddings/jina/README.md]()

### Morph 嵌入函数

```typescript
import { MorphEmbeddingFunction } from '@chroma-core/morph';

const morphEmbedding = new MorphEmbeddingFunction({
  api_key: 'your-morph-api-key',
  model_name: 'morph-embedding-v2', // 默认值
  api_base: 'https://api.morphllm.com/v1', // 默认值
  encoding_format: 'float' // 默认值
});
```

**配置参数**：

| 参数 | 必需 | 默认值 | 说明 |
|------|------|--------|------|
| `api_key` | 否 | 环境变量 | API 密钥 |
| `model_name` | 否 | morph-embedding-v2 | 模型名称 |
| `api_base` | 否 | https://api.morphllm.com/v1 | API 基础 URL |
| `encoding_format` | 否 | float | 编码格式 (float 或 base64) |

Sources: [clients/new-js/packages/ai-embeddings/morph/README.md]()

### Qwen 云端嵌入函数

```typescript
import { ChromaClient } from 'chromadb';
import { ChromaCloudQwenEmbeddingFunction } from '@chroma-core/chroma-cloud-qwen';

// 初始化
const embedder = new ChromaCloudQwenEmbeddingFunction({
  apiKey: 'your-api-key',
  model: 'Qwen/Qwen3-Embedding-0.6B',
});

// 创建集合
const collection = await client.createCollection({
  name: 'my-collection',
  embeddingFunction: embedder,
});

// 添加和查询
await collection.add({
  ids: ["doc1", "doc2", "doc3"],
  documents: ["Document 1", "Document 2", "Document 3"],
});

const results = await collection.query({
  queryTexts: ["Sample query"],
  nResults: 2,
});
```

**配置选项**：

| 选项 | 说明 |
|------|------|
| `model` | 用于嵌入的模型 |
| `task` | 生成嵌入的任务 |
| `instruction_dict` | 任务和目标的自定义指令映射 |
| `apiKeyEnvVar` | API 密钥环境变量名（默认 CHROMA_API_KEY） |

Sources: [clients/new-js/packages/ai-embeddings/chroma-cloud-qwen/README.md]()

## 聚合包使用

`@chroma-core/all` 包导出所有嵌入函数，提供一站式导入：

```typescript
import {
  OpenAIEmbeddingFunction,
  CohereEmbeddingFunction,
  JinaEmbeddingFunction,
  GoogleGeminiEmbeddingFunction,
  // ... 以及其他所有提供商
} from '@chroma-core/all';

// 使用任意嵌入函数
const openAIEF = new OpenAIEmbeddingFunction({
  apiKey: 'your-api-key',
  modelName: 'text-embedding-3-small'
});
```

**包含的提供商**：

- OpenAI
- Cohere
- Jina
- Google Gemini
- HuggingFace Server
- Ollama
- Together AI
- Voyage AI
- Cloudflare Worker AI
- Default Embedding

Sources: [clients/new-js/packages/ai-embeddings/all/README.md]()

## 动态加载机制

ChromaDB JavaScript 客户端支持嵌入函数的动态加载，通过配置映射实现按需加载：

```typescript
export const getEmbeddingFunction = async (
  client: ChromaClient,
  efConfig?: EmbeddingFunctionConfiguration,
) => {
  // 检查配置类型
  if (efConfig?.type !== "known") {
    return undefined;
  }

  // 过滤不支持的函数
  if (unsupportedEmbeddingFunctions.has(efConfig.name)) {
    return undefined;
  }

  // 获取包名
  const packageName = pythonEmbeddingFunctions[efConfig.name] || efConfig.name;

  // 动态导入
  let embeddingFunction = knownEmbeddingFunctions.get(packageName);
  if (!embeddingFunction) {
    try {
      const fullPackageName = `@chroma-core/${packageName}`;
      await import(fullPackageName);
      embeddingFunction = knownEmbeddingFunctions.get(packageName);
    } catch (error) {
      // 动态加载失败
    }
  }

  // 使用 buildFromConfig 工厂方法
  if (embeddingFunction.buildFromConfig) {
    return embeddingFunction.buildFromConfig(constructorConfig, client);
  }
  return undefined;
};
```

动态加载的优势：

1. **按需加载**：减少初始包体积
2. **插件化**：支持第三方嵌入函数扩展
3. **向后兼容**：支持 Python 端的嵌入函数名称映射

Sources: [clients/new-js/src/embedding-function.ts]()

## 使用最佳实践

### 安装策略

```bash
# 仅安装需要的嵌入函数
npm install @chroma-core/jina

# 或安装所有嵌入函数
npm install @chroma-core/all
```

### 环境变量配置

大多数嵌入函数支持通过环境变量设置 API 密钥：

```bash
export OPENAI_API_KEY=your-openai-key
export JINA_API_KEY=your-jina-key
export CHROMA_API_KEY=your-chroma-key
export MORPH_API_KEY=your-morph-key
export XAI_API_KEY=your-xai-key
```

### 代码组织

```mermaid
graph TD
    A[embedding-functions.ts] --> B[统一导出配置]
    B --> C[createCollection 时引用]
    C --> D[数据操作时自动调用]
    
    A1[embedder.ts] --> B
    A2[openai.ts] --> B
    A3[jina.ts] --> B
```

建议将嵌入函数配置集中管理，便于维护和更换提供商。

## 总结

ChromaDB 的嵌入函数集成系统提供了灵活、高效的向量嵌入能力，支持多种云服务和本地模型。通过统一的 Schema 验证机制和动态加载架构，开发者可以轻松切换不同的嵌入提供商，同时保持跨语言的一致性。该系统设计充分考虑了可扩展性，为未来支持更多嵌入模型奠定了基础。

---

<a id='page-storage'></a>

## 存储系统

### Related Pages

Related topics: [系统架构](#page-architecture), [向量索引系统](#page-vector-index)

<details>
<summary>Relevant source files</summary>

The following files were used as context for generating this wiki page:

- [rust/blockstore/src/lib.rs](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/lib.rs)
- [rust/blockstore/src/arrow/blockfile.rs](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/blockfile.rs)
- [rust/blockstore/src/arrow/block/types.rs](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/block/types.rs)
- [rust/blockstore/src/arrow/root.rs](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/root.rs)
- [rust/blockstore/src/arrow/provider.rs](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/provider.rs)
- [rust/blockstore/src/provider.rs](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/provider.rs)
- [rust/types/src/sparse_posting_block.rs](https://github.com/chroma-core/chroma/blob/main/rust/types/src/sparse_posting_block.rs)
- [rust/wal3/src/lib.rs](https://github.com/chroma-core/chroma/blob/main/rust/wal3/src/lib.rs)
</details>

# 存储系统

## 概述

Chroma 的存储系统是整个数据库基础设施的核心组件，负责管理向量数据和元数据的持久化存储。该系统采用分层架构设计，主要包括以下核心模块：

- **Blockstore**：块存储抽象层，支持多种后端实现
- **Arrow Blockfile**：基于 Apache Arrow IPC 格式的列式存储实现
- **Sparse Posting Block**：稀疏倒排索引块存储
- **WAL3**：三次写入日志（Write-Ahead Log）机制

存储系统的设计目标是通过统一的 API 抽象，支持内存映射、持久化存储以及远程存储后端，同时保证数据一致性和高性能访问。

Sources: [rust/blockstore/src/lib.rs:1-50](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/lib.rs)

## 架构分层

Chroma 存储系统采用四层架构设计，每一层都有明确的职责边界：

```mermaid
graph TD
    A[应用层] --> B[API 层]
    B --> C[Provider 层]
    C --> D[Blockfile 层]
    D --> E[Storage 层]
    
    B1[BlockfileProvider] --> C1[Arrow Blockfile Provider]
    B1 --> B2[HashMap Blockfile Provider]
    
    C1 --> D1[ArrowBlockfileWriter]
    C1 --> D2[ArrowBlockfileReader]
    
    E1[Storage] --> E2[本地文件系统]
    E1 --> E3[远程存储]
```

### 各层职责

| 层级 | 组件 | 职责 |
|------|------|------|
| API 层 | BlockfileProvider | 统一的读写接口抽象 |
| Provider 层 | Arrow/HashMap Provider | 特定存储格式的提供者 |
| Blockfile 层 | ArrowBlockfile | 具体的数据块读写实现 |
| Storage 层 | Storage | 底层持久化机制 |

Sources: [rust/blockstore/src/provider.rs:1-80](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/provider.rs)

## Blockstore 核心抽象

### BlockfileProvider 接口

`BlockfileProvider` 是存储系统的核心抽象接口，提供统一的读写操作入口：

```rust
pub trait ReadKey<'a>:
    Key
    + Into<KeyWrapper>
    + TryFrom<&'a KeyWrapper, Error = InvalidKeyConversion>
    + ArrowReadableKey<'a>
    + Sync
    + 'a
{
}

pub trait ReadValue<'a>: Value + Readable<'a> + ArrowReadableValue<'a> + Sync + 'a {}
```

该接口支持两种实现：
1. **ArrowBlockfileProvider**：基于 Arrow IPC 格式的持久化存储
2. **HashMapBlockfileProvider**：纯内存存储实现

Sources: [rust/blockstore/src/provider.rs:60-90](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/provider.rs)

### 核心 API 方法

| 方法 | 参数 | 返回值 | 描述 |
|------|------|--------|------|
| `read` | `options: BlockfileReaderOptions` | `BlockfileReader` | 读取数据块 |
| `write` | `options: BlockfileWriterOptions` | `BlockfileWriter` | 写入数据块 |
| `clear` | - | `Result<(), Box<dyn ChromaError>>` | 清空缓存 |
| `prefetch` | `id`, `prefix_path` | `Result<usize>` | 预取数据到缓存 |
| `storage` | - | `Option<Arc<Storage>>` | 获取底层存储 |

Sources: [rust/blockstore/src/provider.rs:120-180](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/provider.rs)

## Arrow Blockfile 实现

### Arrow Blockfile 概述

Arrow Blockfile 是 Chroma 的主要存储实现，基于 Apache Arrow 的 IPC（Inter-Process Communication）格式。这种设计带来以下优势：

- **列式存储**：相同类型的数据连续存储，提升压缩率和扫描效率
- **零拷贝读取**：通过内存映射实现高效的数据访问
- **类型安全**：Arrow 格式内置 schema 验证

```mermaid
graph LR
    A[写入请求] --> B[BlockDelta]
    B --> C[SparseIndex]
    C --> D[Root]
    D --> E[Block Manager]
    E --> F[持久化存储]
    
    G[读取请求] --> H[Arrow Blockfile Reader]
    H --> I[Footer Parsing]
    I --> J[Record Batch]
```

Sources: [rust/blockstore/src/arrow/blockfile.rs:1-100](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/blockfile.rs)

### Block 结构

Arrow Blockfile 中的数据块（Block）是存储的最小单元：

```rust
#[derive(Error, Debug)]
pub enum ArrowLayoutVerificationError {
    #[error("Buffer length is not 64 byte aligned")]
    BufferLengthNotAligned,
    #[error(transparent)]
    IOError(#[from] std::io::Error),
    #[error(transparent)]
    ArrowError(#[from] arrow::error::ArrowError),
    #[error(transparent)]
    InvalidFlatbuffer(#[from] flatbuffers::InvalidFlatbuffer),
    #[error("No record batches in footer")]
    NoRecordBatches,
    #[error("More than one record batch in IPC file")]
    MultipleRecordBatches,
    #[error("Invalid message type")]
    InvalidMessageType,
    #[error("Error decoding record batch message as record batch")]
    RecordBatchDecodeError,
}
```

Sources: [rust/blockstore/src/arrow/block/types.rs:1-60](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/block/types.rs)

### Root 结构与版本管理

Root 是 Arrow Blockfile 的元数据头部，包含版本信息和块ID列表：

```rust
impl RootReader {
    pub(super) fn get_all_block_ids_from_bytes(
        bytes: &[u8],
        id: Uuid,
    ) -> Result<Vec<Uuid>, FromBytesError> {
        let mut cursor = std::io::Cursor::new(bytes);
        let arrow_reader = arrow::ipc::reader::FileReader::try_new(&mut cursor, None);

        let record_batch = match arrow_reader {
            Ok(mut reader) => match reader.next() {
                Some(Ok(batch)) => batch,
                Some(Err(e)) => return Err(FromBytesError::ArrowError(e)),
                None => {
                    return Err(FromBytesError::NoDataError);
                }
            },
            Err(e) => return Err(FromBytesError::ArrowError(e)),
        };

        let (version, read_id) = Self::version_and_id_from_record_batch(&record_batch, id)?;
        if read_id != id {
            return Err(FromBytesError::IdMismatch);
        }

        Self::block_ids_from_record_batch(&record_batch, version)
    }
}
```

版本管理确保数据格式的向前兼容性和向后兼容性。

Sources: [rust/blockstore/src/arrow/root.rs:1-80](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/root.rs)

### BlockfileWriter 实现

`ArrowUnorderedBlockfileWriter` 是无序写入的实现，适用于高并发场景：

```rust
impl ArrowUnorderedBlockfileWriter {
    pub(super) fn new<K: ArrowWriteableKey, V: ArrowWriteableValue>(
        id: Uuid,
        prefix_path: &str,
        block_manager: BlockManager,
        root_manager: RootManager,
        max_block_size_bytes: usize,
        cmek: Option<Cmek>,
    ) -> Self {
        let initial_block = block_manager.create::<K, V, UnorderedBlockDelta>();
        let sparse_index = SparseIndexWriter::new(initial_block.id);
        // ...
    }
}
```

Sources: [rust/blockstore/src/arrow/blockfile.rs:100-150](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/blockfile.rs)

## Provider 层实现

### Arrow Blockfile Provider

`ArrowBlockfileProvider` 是生产环境的主要存储提供者：

```rust
// Fork 操作支持数据分叉
match options.mutation_ordering {
    BlockfileWriterMutationOrdering::Ordered => {
        let file = ArrowOrderedBlockfileWriter::from_root(
            new_id,
            self.block_manager.clone(),
            self.root_manager.clone(),
            new_root,
            options.cmek,
        );
        Ok(BlockfileWriter::ArrowOrderedBlockfileWriter(file))
    }
    BlockfileWriterMutationOrdering::Unordered => {
        let file = ArrowUnorderedBlockfileWriter::from_root(
            new_id,
            self.block_manager.clone(),
            self.root_manager.clone(),
            new_root,
            options.cmek,
        );
        Ok(BlockfileWriter::ArrowUnorderedBlockfileWriter(file))
    }
}
```

Sources: [rust/blockstore/src/arrow/provider.rs:1-100](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/provider.rs)

### 写入选项配置

| 配置项 | 类型 | 默认值 | 说明 |
|--------|------|--------|------|
| `prefix_path` | String | - | 数据路径前缀 |
| `max_block_size_bytes` | usize | 引擎默认值 | 单个块最大字节数 |
| `mutation_ordering` | BlockfileWriterMutationOrdering | Ordered | 写入顺序策略 |
| `fork_from` | Option<Uuid> | None | 从现有块分叉 |
| `cmek` | Option<Cmek> | None | 客户管理的加密密钥 |

Sources: [rust/blockstore/src/arrow/provider.rs:80-120](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/provider.rs)

## Sparse Posting Block

稀疏倒排索引块是 Chroma 全文搜索功能的核心数据结构：

```rust
/// DirectoryBlock 存储每个 posting block 的元数据
/// body = [ max_offset: u32 LE, max_weight: f32 LE ] × num_entries
pub struct DirectoryBlock(SparsePostingBlock);

impl DirectoryBlock {
    pub fn new(max_offsets: &[u32], max_weights: &[f32]) -> Result<Self, SparsePostingBlockError> {
        if max_offsets.len() != max_weights.len() {
            return Err(SparsePostingBlockError::MismatchedLengths {
                offsets: max_offsets.len(),
                weights: max_weights.len(),
            });
        }
        // ...
    }
}
```

该结构支持高效的词项剪枝，通过 `max_weight` 实现维度级别的权重过滤。

Sources: [rust/types/src/sparse_posting_block.rs:1-80](https://github.com/chroma-core/chroma/blob/main/rust/types/src/sparse_posting_block.rs)

## WAL3 日志系统

### WAL3 概述

WAL3（Write-Ahead Log version 3）是 Chroma 的预写日志系统，确保数据在系统崩溃后能够恢复：

```mermaid
graph LR
    A[事务开始] --> B[写入 WAL]
    B --> C[写入主存储]
    C --> D[标记事务完成]
    D --> E[定期 Checkpoint]
    E --> F[清理已持久化日志]
```

WAL3 的核心设计目标：
1. **原子性**：事务要么全部成功，要么全部回滚
2. **持久性**：已提交的事务不会丢失
3. **恢复能力**：崩溃后能够恢复到一致状态

Sources: [rust/wal3/src/lib.rs:1-50](https://github.com/chroma-core/chroma/blob/main/rust/wal3/src/lib.rs)

### WAL3 数据格式

WAL3 使用紧凑的二进制格式存储事务日志：

| 字段 | 长度 | 说明 |
|------|------|------|
| Magic Number | 4 bytes | 标识 WAL3 格式 |
| Version | 4 bytes | 格式版本号 |
| Transaction ID | 8 bytes | 递增的事务标识 |
| Payload Length | 4 bytes | 数据载荷长度 |
| Payload | Variable | 实际数据 |

Sources: [rust/wal3/src/lib.rs:50-100](https://github.com/chroma-core/chroma/blob/main/rust/wal3/src/lib.rs)

## 错误处理机制

### 错误码映射

Chroma 存储系统使用统一的错误码体系：

| 错误类型 | 错误码 | 说明 |
|----------|--------|------|
| `BlockLoadError::IOError` | Internal | I/O 操作失败 |
| `BlockLoadError::ArrowError` | Internal | Arrow 解析错误 |
| `BlockLoadError::CacheError` | Internal | 缓存访问错误 |
| `FromBytesError::InvalidArgument` | InvalidArgument | 参数验证失败 |
| `FromBytesError::IdMismatch` | InvalidArgument | ID 不匹配 |
| `ArrowLayoutVerificationError` | Internal | 布局验证失败 |

Sources: [rust/blockstore/src/arrow/block/types.rs:10-40](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/block/types.rs)

### Arrow Layout 验证

Arrow Blockfile 在读取时执行严格的布局验证：

1. **缓冲区对齐检查**：数据缓冲区必须 64 字节对齐
2. **Magic 标识检查**：验证 Arrow IPC 文件头
3. **Record Batch 数量检查**：确保只有一个 record batch
4. **消息类型验证**：验证消息类型正确

```rust
#[derive(Error, Debug)]
pub enum ArrowLayoutVerificationError {
    #[error("Buffer length is not 64 byte aligned")]
    BufferLengthNotAligned,
    #[error("No record batches in footer")]
    NoRecordBatches,
    #[error("More than one record batch in IPC file")]
    MultipleRecordBatches,
    // ...
}
```

Sources: [rust/blockstore/src/arrow/block/types.rs:20-45](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/block/types.rs)

## 数据流与操作流程

### 写入流程

```mermaid
sequenceDiagram
    participant Client as 客户端
    participant Provider as BlockfileProvider
    participant Writer as ArrowBlockfileWriter
    participant BlockMgr as BlockManager
    participant Storage as Storage

    Client->>Provider: write(options)
    Provider->>Writer: create Writer
    Writer->>BlockMgr: create BlockDelta
    loop 数据写入
        Client->>Writer: set(key, value)
        Writer->>BlockMgr: flush if full
    end
    Writer->>Storage: persist blocks
    Storage-->>Writer: confirm
    Writer-->>Provider: Writer ready
    Provider-->>Client: return Writer
```

Sources: [rust/blockstore/src/arrow/blockfile.rs:80-120](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/blockfile.rs)

### 读取流程

```mermaid
sequenceDiagram
    participant Client as 客户端
    participant Provider as BlockfileProvider
    participant Reader as ArrowBlockfileReader
    participant Storage as Storage

    Client->>Provider: read(options)
    Provider->>Reader: create Reader
    Reader->>Storage: load Footer
    Storage-->>Reader: Footer bytes
    Reader->>Reader: parse Record Batch
    Reader-->>Provider: Reader ready
    Provider-->>Client: return Reader
    Client->>Reader: get(key)
    Reader-->>Client: value
```

Sources: [rust/blockstore/src/arrow/root.rs:30-80](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/root.rs)

## 性能优化

### 块大小配置

| 参数 | 默认值 | 说明 | 调优建议 |
|------|--------|------|----------|
| `max_block_size_bytes` | 系统默认值 | 单块最大容量 | 大值减少块数量，小值提高随机访问 |
| `pl_block_size` | 特定值 | Posting List 块大小 | 影响倒排索引性能 |
| `hnsw_ef_search` | 可配置 | HNSW 搜索参数 | 影响召回率和延迟 |

Sources: [rust/index/src/spann/types.rs:50-80](https://github.com/chroma-core/chroma/blob/main/rust/index/src/spann/types.rs)

### 预取机制

BlockfileProvider 支持数据预取以提高读取性能：

```rust
pub async fn prefetch(
    &self,
    id: &uuid::Uuid,
    prefix_path: &str,
) -> Result<usize, Box<dyn ChromaError>> {
    match self {
        BlockfileProvider::HashMapBlockfileProvider(_) => unimplemented!(),
        BlockfileProvider::ArrowBlockfileProvider(provider) => provider
            .prefetch(id, prefix_path)
            .await
            .map_err(|e| Box::new(e) as _),
    }
}
```

Sources: [rust/blockstore/src/provider.rs:160-180](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/provider.rs)

## 安全特性

### CMEK 支持

Chroma 存储系统支持客户管理的加密密钥（Customer-Managed Encryption Keys）：

```rust
bf_options = bf_options.with_cmek(cmek);
```

每个块文件都可以使用独立的加密配置，确保敏感数据的隔离保护。

Sources: [rust/blockstore/src/arrow/provider.rs:50-70](https://github.com/chroma-core/chroma/blob/main/rust/blockstore/src/arrow/provider.rs)

## 配置参考

### BlockfileWriterOptions 配置

```rust
let mut bf_options = BlockfileWriterOptions::new(prefix_path.to_string())
    .max_block_size_bytes(pl_block_size);
bf_options = bf_options.unordered_mutations();
if let Some(cmek) = cmek {
    bf_options = bf_options.with_cmek(cmek);
}
```

关键配置项：
- `prefix_path`：数据存储路径前缀
- `max_block_size_bytes`：块大小限制
- `unordered_mutations`：启用无序写入模式
- `fork`：从现有块分叉创建
- `cmek`：加密配置

Sources: [rust/index/src/spann/types.rs:100-130](https://github.com/chroma-core/chroma/blob/main/rust/index/src/spann/types.rs)

## 总结

Chroma 的存储系统是一个精心设计的分层架构，核心特点包括：

1. **统一的抽象层**：通过 `BlockfileProvider` 接口支持多种后端实现
2. **Arrow 列式存储**：利用 Arrow IPC 格式实现高效的列式数据访问
3. **稀疏索引支持**：通过 DirectoryBlock 和 PostingBlock 支持全文搜索
4. **WAL3 日志**：确保事务的原子性和持久性
5. **安全特性**：支持 CMEK 客户管理加密
6. **灵活配置**：支持块大小、写入顺序、分叉等高级配置

这套存储系统为 Chroma 的向量数据库功能提供了坚实的数据持久化基础，同时保持了良好的可扩展性和性能。

---

---

## Doramagic 踩坑日志

项目：chroma-core/chroma

摘要：发现 6 个潜在踩坑项，其中 0 个为 high/blocking；最高优先级：能力坑 - 能力判断依赖假设。

## 1. 能力坑 · 能力判断依赖假设

- 严重度：medium
- 证据强度：source_linked
- 发现：README/documentation is current enough for a first validation pass.
- 对用户的影响：假设不成立时，用户拿不到承诺的能力。
- 建议检查：将假设转成下游验证清单。
- 防护动作：假设必须转成验证项；没有验证结果前不能写成事实。
- 证据：capability.assumptions | github_repo:546206616 | https://github.com/chroma-core/chroma | README/documentation is current enough for a first validation pass.

## 2. 维护坑 · 维护活跃度未知

- 严重度：medium
- 证据强度：source_linked
- 发现：未记录 last_activity_observed。
- 对用户的影响：新项目、停更项目和活跃项目会被混在一起，推荐信任度下降。
- 建议检查：补 GitHub 最近 commit、release、issue/PR 响应信号。
- 防护动作：维护活跃度未知时，推荐强度不能标为高信任。
- 证据：evidence.maintainer_signals | github_repo:546206616 | https://github.com/chroma-core/chroma | last_activity_observed missing

## 3. 安全/权限坑 · 下游验证发现风险项

- 严重度：medium
- 证据强度：source_linked
- 发现：no_demo
- 对用户的影响：下游已经要求复核，不能在页面中弱化。
- 建议检查：进入安全/权限治理复核队列。
- 防护动作：下游风险存在时必须保持 review/recommendation 降级。
- 证据：downstream_validation.risk_items | github_repo:546206616 | https://github.com/chroma-core/chroma | no_demo; severity=medium

## 4. 安全/权限坑 · 存在评分风险

- 严重度：medium
- 证据强度：source_linked
- 发现：no_demo
- 对用户的影响：风险会影响是否适合普通用户安装。
- 建议检查：把风险写入边界卡，并确认是否需要人工复核。
- 防护动作：评分风险必须进入边界卡，不能只作为内部分数。
- 证据：risks.scoring_risks | github_repo:546206616 | https://github.com/chroma-core/chroma | no_demo; severity=medium

## 5. 维护坑 · issue/PR 响应质量未知

- 严重度：low
- 证据强度：source_linked
- 发现：issue_or_pr_quality=unknown。
- 对用户的影响：用户无法判断遇到问题后是否有人维护。
- 建议检查：抽样最近 issue/PR，判断是否长期无人处理。
- 防护动作：issue/PR 响应未知时，必须提示维护风险。
- 证据：evidence.maintainer_signals | github_repo:546206616 | https://github.com/chroma-core/chroma | issue_or_pr_quality=unknown

## 6. 维护坑 · 发布节奏不明确

- 严重度：low
- 证据强度：source_linked
- 发现：release_recency=unknown。
- 对用户的影响：安装命令和文档可能落后于代码，用户踩坑概率升高。
- 建议检查：确认最近 release/tag 和 README 安装命令是否一致。
- 防护动作：发布节奏未知或过期时，安装说明必须标注可能漂移。
- 证据：evidence.maintainer_signals | github_repo:546206616 | https://github.com/chroma-core/chroma | release_recency=unknown

<!-- canonical_name: chroma-core/chroma; human_manual_source: deepwiki_human_wiki -->