# dsrag - Doramagic AI Context Pack

> 定位：安装前体验与判断资产。它帮助宿主 AI 有一个好的开始，但不代表已经安装、执行或验证目标项目。

## 充分原则

- **充分原则，不是压缩原则**：AI Context Pack 应该充分到让宿主 AI 在开工前理解项目价值、能力边界、使用入口、风险和证据来源；它可以分层组织，但不以最短摘要为目标。
- **压缩策略**：只压缩噪声和重复内容，不压缩会影响判断和开工质量的上下文。

## 给宿主 AI 的使用方式

你正在读取 Doramagic 为 dsrag 编译的 AI Context Pack。请把它当作开工前上下文：帮助用户理解适合谁、能做什么、如何开始、哪些必须安装后验证、风险在哪里。不要声称你已经安装、运行或执行了目标项目。

## Claim 消费规则

- **事实来源**：Repo Evidence + Claim/Evidence Graph；Human Wiki 只提供显著性、术语和叙事结构。
- **事实最低状态**：`supported`
- `supported`：可以作为项目事实使用，但回答中必须引用 claim_id 和证据路径。
- `weak`：只能作为低置信度线索，必须要求用户继续核实。
- `inferred`：只能用于风险提示或待确认问题，不能包装成项目事实。
- `unverified`：不得作为事实使用，应明确说证据不足。
- `contradicted`：必须展示冲突来源，不得替用户强行选择一个版本。

## 它最适合谁

- **AI 研究者或研究型 Agent 构建者**：README 明确围绕研究、实验或论文工作流展开。 证据：`README.md` Claim：`clm_0002` supported 0.86
- **正在使用 Claude/Codex/Cursor/Gemini 等宿主 AI 的开发者**：README 或插件配置提到多个宿主 AI。 证据：`README.md` Claim：`clm_0003` supported 0.86

## 它能做什么

- **命令行启动或安装流程**（需要安装后验证）：项目文档中存在可执行命令，真实使用需要在本地或宿主环境中运行这些命令。 证据：`README.md` Claim：`clm_0001` supported 0.86

## 怎么开始

- `pip install dsrag` 证据：`README.md` Claim：`clm_0004` supported 0.86, `clm_0005` supported 0.86, `clm_0006` supported 0.86, `clm_0007` supported 0.86 等
- `pip install dsrag[faiss]` 证据：`README.md` Claim：`clm_0005` supported 0.86
- `pip install dsrag[chroma]` 证据：`README.md` Claim：`clm_0006` supported 0.86
- `pip install dsrag[weaviate]` 证据：`README.md` Claim：`clm_0007` supported 0.86
- `pip install dsrag[qdrant]` 证据：`README.md` Claim：`clm_0008` supported 0.86
- `pip install dsrag[milvus]` 证据：`README.md` Claim：`clm_0009` supported 0.86
- `pip install dsrag[pinecone]` 证据：`README.md` Claim：`clm_0010` supported 0.86
- `pip install dsrag[all-vector-dbs]` 证据：`README.md` Claim：`clm_0011` supported 0.86
- `pip install dsrag[all]` 证据：`README.md` Claim：`clm_0011` supported 0.86, `clm_0012` supported 0.86

## 继续前判断卡

- **当前建议**：需要管理员/安全审批
- **为什么**：继续前可能涉及密钥、账号、外部服务或敏感上下文，建议先经过管理员或安全审批。

### 30 秒判断

- **现在怎么做**：需要管理员/安全审批
- **最小安全下一步**：先跑 Prompt Preview；若涉及凭证或企业环境，先审批再试装
- **先别相信**：真实输出质量不能在安装前相信。
- **继续会触碰**：命令执行、本地环境或项目文件、环境变量 / API Key

### 现在可以相信

- **适合人群线索：AI 研究者或研究型 Agent 构建者**（supported）：有 supported claim 或项目证据支撑，但仍不等于真实安装效果。 证据：`README.md` Claim：`clm_0002` supported 0.86
- **适合人群线索：正在使用 Claude/Codex/Cursor/Gemini 等宿主 AI 的开发者**（supported）：有 supported claim 或项目证据支撑，但仍不等于真实安装效果。 证据：`README.md` Claim：`clm_0003` supported 0.86
- **能力存在：命令行启动或安装流程**（supported）：可以相信项目包含这类能力线索；是否适合你的具体任务仍要试用或安装后验证。 证据：`README.md` Claim：`clm_0001` supported 0.86
- **存在 Quick Start / 安装命令线索**（supported）：可以相信项目文档出现过启动或安装入口；不要因此直接在主力环境运行。 证据：`README.md` Claim：`clm_0004` supported 0.86, `clm_0005` supported 0.86, `clm_0006` supported 0.86, `clm_0007` supported 0.86

### 现在还不能相信

- **真实输出质量不能在安装前相信。**（unverified）：Prompt Preview 只能展示引导方式，不能证明真实项目中的结果质量。
- **宿主 AI 版本兼容性不能在安装前相信。**（unverified）：Claude、Cursor、Codex、Gemini 等宿主加载规则和版本差异必须在真实环境验证。
- **不会污染现有宿主 AI 行为，不能直接相信。**（inferred）：Skill、plugin、AGENTS/CLAUDE/GEMINI 指令可能改变宿主 AI 的默认行为。
- **可安全回滚不能默认相信。**（unverified）：除非项目明确提供卸载和恢复说明，否则必须先在隔离环境验证。
- **真实安装后是否与用户当前宿主 AI 版本兼容？**（unverified）：兼容性只能通过实际宿主环境验证。
- **项目输出质量是否满足用户具体任务？**（unverified）：安装前预览只能展示流程和边界，不能替代真实评测。
- **安装命令是否需要网络、权限或全局写入？**（unverified）：这影响企业环境和个人环境的安装风险。 证据：`README.md`

### 继续会触碰什么

- **命令执行**：包管理器、网络下载、本地插件目录、项目配置或用户主目录。 原因：运行第一条命令就可能产生环境改动；必须先判断是否值得跑。 证据：`README.md`
- **本地环境或项目文件**：安装结果、插件缓存、项目配置或本地依赖目录。 原因：安装前无法证明写入范围和回滚方式，需要隔离验证。 证据：`README.md`
- **环境变量 / API Key**：项目入口文档明确出现 API key、token、secret 或账号凭证配置。 原因：如果真实安装需要凭证，应先使用测试凭证并经过权限/合规判断。 证据：`README.md`, `docs/getting-started/installation.md`, `docs/getting-started/quickstart.md`, `dsrag/chat/instructor_get_response.py` 等
- **宿主 AI 上下文**：AI Context Pack、Prompt Preview、Skill 路由、风险规则和项目事实。 原因：导入上下文会影响宿主 AI 后续判断，必须避免把未验证项包装成事实。

### 最小安全下一步

- **先跑 Prompt Preview**：用安装前交互式试用判断工作方式是否匹配，不需要授权或改环境。（适用：任何项目都适用，尤其是输出质量未知时。）
- **只在隔离目录或测试账号试装**：避免安装命令污染主力宿主 AI、真实项目或用户主目录。（适用：存在命令执行、插件配置或本地写入线索时。）
- **不要使用真实生产凭证**：环境变量/API key 一旦进入宿主或工具链，可能产生账号和合规风险。（适用：出现 API、TOKEN、KEY、SECRET 等环境线索时。）
- **安装后只验证一个最小任务**：先验证加载、兼容、输出质量和回滚，再决定是否深用。（适用：准备从试用进入真实工作流时。）

### 退出方式

- **保留安装前状态**：记录原始宿主配置和项目状态，后续才能判断是否可恢复。
- **记录安装命令和写入路径**：没有明确卸载说明时，至少要知道哪些目录或配置需要手动清理。
- **准备撤销测试 API key 或 token**：测试凭证泄露或误用时，可以快速止损。
- **如果没有回滚路径，不进入主力环境**：不可回滚是继续前阻断项，不应靠信任或运气继续。

## 哪些只能预览

- 解释项目适合谁和能做什么
- 基于项目文档演示典型对话流程
- 帮助用户判断是否值得安装或继续研究

## 哪些必须安装后验证

- 真实安装 Skill、插件或 CLI
- 执行脚本、修改本地文件或访问外部服务
- 验证真实输出质量、性能和兼容性

## 边界与风险判断卡

- **把安装前预览误认为真实运行**：用户可能高估项目已经完成的配置、权限和兼容性验证。 处理方式：明确区分 prompt_preview_can_do 与 runtime_required。 Claim：`clm_0013` inferred 0.45
- **命令执行会修改本地环境**：安装命令可能写入用户主目录、宿主插件目录或项目配置。 处理方式：先在隔离环境或测试账号中运行。 证据：`README.md` Claim：`clm_0014` supported 0.86
- **待确认**：真实安装后是否与用户当前宿主 AI 版本兼容？。原因：兼容性只能通过实际宿主环境验证。
- **待确认**：项目输出质量是否满足用户具体任务？。原因：安装前预览只能展示流程和边界，不能替代真实评测。
- **待确认**：安装命令是否需要网络、权限或全局写入？。原因：这影响企业环境和个人环境的安装风险。

## 开工前工作上下文

### 加载顺序

- 先读取 how_to_use.host_ai_instruction，建立安装前判断资产的边界。
- 读取 claim_graph_summary，确认事实来自 Claim/Evidence Graph，而不是 Human Wiki 叙事。
- 再读取 intended_users、capabilities 和 quick_start_candidates，判断用户是否匹配。
- 需要执行具体任务时，优先查 role_skill_index，再查 evidence_index。
- 遇到真实安装、文件修改、网络访问、性能或兼容性问题时，转入 risk_card 和 boundaries.runtime_required。

### 任务路由

- **命令行启动或安装流程**：先说明这是安装后验证能力，再给出安装前检查清单。 边界：必须真实安装或运行后验证。 证据：`README.md` Claim：`clm_0001` supported 0.86

### 上下文规模

- 文件总数：111
- 重要文件覆盖：40/111
- 证据索引条目：80
- 角色 / Skill 条目：18

### 证据不足时的处理

- **missing_evidence**：说明证据不足，要求用户提供目标文件、README 段落或安装后验证记录；不要补全事实。
- **out_of_scope_request**：说明该任务超出当前 AI Context Pack 证据范围，并建议用户先查看 Human Manual 或真实安装后验证。
- **runtime_request**：给出安装前检查清单和命令来源，但不要替用户执行命令或声称已执行。
- **source_conflict**：同时展示冲突来源，标记为待核实，不要强行选择一个版本。

## Prompt Recipes

### 适配判断

- 目标：判断这个项目是否适合用户当前任务。
- 预期输出：适配结论、关键理由、证据引用、安装前可预览内容、必须安装后验证内容、下一步建议。

```text
请基于 dsrag 的 AI Context Pack，先问我 3 个必要问题，然后判断它是否适合我的任务。回答必须包含：适合谁、能做什么、不能做什么、是否值得安装、证据来自哪里。所有项目事实必须引用 evidence_refs、source_paths 或 claim_id。
```

### 安装前体验

- 目标：让用户在安装前感受核心工作流，同时避免把预览包装成真实能力或营销承诺。
- 预期输出：一段带边界标签的体验剧本、安装后验证清单和谨慎建议；不含真实运行承诺或强营销表述。

```text
请把 dsrag 当作安装前体验资产，而不是已安装工具或真实运行环境。

请严格输出四段：
1. 先问我 3 个必要问题。
2. 给出一段“体验剧本”：用 [安装前可预览]、[必须安装后验证]、[证据不足] 三种标签展示它可能如何引导工作流。
3. 给出安装后验证清单：列出哪些能力只有真实安装、真实宿主加载、真实项目运行后才能确认。
4. 给出谨慎建议：只能说“值得继续研究/试装”“先补充信息后再判断”或“不建议继续”，不得替项目背书。

硬性边界：
- 不要声称已经安装、运行、执行测试、修改文件或产生真实结果。
- 不要写“自动适配”“确保通过”“完美适配”“强烈建议安装”等承诺性表达。
- 如果描述安装后的工作方式，必须使用“如果安装成功且宿主正确加载 Skill，它可能会……”这种条件句。
- 体验剧本只能写成“示例台词/假设流程”：使用“可能会询问/可能会建议/可能会展示”，不要写“已写入、已生成、已通过、正在运行、正在生成”。
- Prompt Preview 不负责给安装命令；如用户准备试装，只能提示先阅读 Quick Start 和 Risk Card，并在隔离环境验证。
- 所有项目事实必须来自 supported claim、evidence_refs 或 source_paths；inferred/unverified 只能作风险或待确认项。

```

### 角色 / Skill 选择

- 目标：从项目里的角色或 Skill 中挑选最匹配的资产。
- 预期输出：候选角色或 Skill 列表，每项包含适用场景、证据路径、风险边界和是否需要安装后验证。

```text
请读取 role_skill_index，根据我的目标任务推荐 3-5 个最相关的角色或 Skill。每个推荐都要说明适用场景、可能输出、风险边界和 evidence_refs。
```

### 风险预检

- 目标：安装或引入前识别环境、权限、规则冲突和质量风险。
- 预期输出：环境、权限、依赖、许可、宿主冲突、质量风险和未知项的检查清单。

```text
请基于 risk_card、boundaries 和 quick_start_candidates，给我一份安装前风险预检清单。不要替我执行命令，只说明我应该检查什么、为什么检查、失败会有什么影响。
```

### 宿主 AI 开工指令

- 目标：把项目上下文转成一次对话开始前的宿主 AI 指令。
- 预期输出：一段边界明确、证据引用明确、适合复制给宿主 AI 的开工前指令。

```text
请基于 dsrag 的 AI Context Pack，生成一段我可以粘贴给宿主 AI 的开工前指令。这段指令必须遵守 not_runtime=true，不能声称项目已经安装、运行或产生真实结果。
```

## 角色 / Skill 索引

- 共索引 18 个角色 / Skill / 项目文档条目。

- **dsRAG**（project_doc）：dsRAG ! Discord https://img.shields.io/discord/1234629280755875881.svg?label=Discord&logo=discord&color=7289DA https://discord.gg/NTUVX9DmQ3 ! Documentation https://img.shields.io/badge/docs-online-green.svg https://d-star-ai.github.io/dsRAG/ ! Ask DeepWiki https://deepwiki.com/badge.svg https://deepwiki.com/D-Star-AI/dsRAG 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`README.md`
- **dsParse**（project_doc）：dsParse dsParse is a sub-module of dsRAG that does multimodal file parsing, semantic sectioning, and chunking. You provide a file path and some config params and receive nice clean chunks. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`dsrag/dsparse/README.md`
- **Contributing to dsRAG**（project_doc）：We welcome contributions from the community! Whether it's fixing bugs, improving documentation, or proposing new features, your contributions make dsRAG better for everyone. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/community/contributing.md`
- **Welcome to dsRAG**（project_doc）：dsRAG is a retrieval engine for unstructured data. It is especially good at handling challenging queries over dense text, like financial reports, legal documents, and academic papers. dsRAG achieves substantially higher accuracy than vanilla RAG baselines on complex open-book question answering tasks. On one especially challenging benchmark, FinanceBench https://arxiv.org/abs/2311.11944 , dsRAG gets accurate answers… 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/index.md`
- **Chat**（project_doc）：The Chat module provides functionality for managing chat-based interactions with knowledge bases. It handles chat thread creation, message history, and generating responses using knowledge base search. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/api/chat.md`
- **KnowledgeBase**（project_doc）：The KnowledgeBase class is the main interface for working with dsRAG. It handles document processing, storage, and retrieval. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/api/knowledge_base.md`
- **Professional Services**（project_doc）：The creators of dsRAG, Zach and Nick McCormick, run a specialized applied AI consulting firm. As former startup founders and YC alums, we bring a unique business and product-centric perspective to highly technical engineering projects. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/community/professional-services.md`
- **Community Support**（project_doc）：Join our Discord https://discord.gg/NTUVX9DmQ3 to: 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/community/support.md`
- **Chat**（project_doc）：The Chat functionality in dsRAG provides a powerful way to interact with your knowledge bases through a conversational interface. It handles message history, knowledge base searching, and citation tracking automatically. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/chat.md`
- **Citations**（project_doc）：dsRAG's citation system ensures that responses are grounded in your knowledge base content and provides transparency about information sources. Citations are included automatically in chat responses. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/citations.md`
- **Components**（project_doc）：There are six key components that define the configuration of a KnowledgeBase. Each component is customizable, with several built-in options available. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/components.md`
- **Configuration**（project_doc）：dsRAG uses several configuration dictionaries to organize its many parameters. These configs can be passed to different methods of the KnowledgeBase class. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/config.md`
- **Knowledge Bases**（project_doc）：A knowledge base in dsRAG is a searchable collection of documents that can be queried to find relevant information. The KnowledgeBase class handles document processing, storage, and retrieval. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/knowledge-bases.md`
- **Logging Framework**（project_doc）：This document provides an overview of the logging framework implemented in dsRAG. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/logging.md`
- **Architecture Overview**（project_doc）：dsRAG is built around three key methods that improve performance over vanilla RAG systems: 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/overview.md`
- **VLM File Parsing**（project_doc）：dsRAG supports Vision Language Model VLM integration for enhanced PDF parsing capabilities. This feature is particularly useful for documents with complex layouts, tables, diagrams, or other visual elements that traditional text extraction might miss. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/vlm.md`
- **Installation**（project_doc）：If you want to use VLM file parsing, you will need one non-Python dependency: poppler. This is used for converting PDFs to images. On MacOS, you can install it using Homebrew: 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/getting-started/installation.md`
- **Quick Start Guide**（project_doc）：This guide will help you get started with dsRAG quickly. We'll cover the basics of creating a knowledge base and querying it. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/getting-started/quickstart.md`

## 证据索引

- 共索引 80 条证据。

- **dsRAG**（documentation）：dsRAG ! Discord https://img.shields.io/discord/1234629280755875881.svg?label=Discord&logo=discord&color=7289DA https://discord.gg/NTUVX9DmQ3 ! Documentation https://img.shields.io/badge/docs-online-green.svg https://d-star-ai.github.io/dsRAG/ ! Ask DeepWiki https://deepwiki.com/badge.svg https://deepwiki.com/D-Star-AI/dsRAG 证据：`README.md`
- **dsParse**（documentation）：dsParse dsParse is a sub-module of dsRAG that does multimodal file parsing, semantic sectioning, and chunking. You provide a file path and some config params and receive nice clean chunks. 证据：`dsrag/dsparse/README.md`
- **Contributing to dsRAG**（documentation）：We welcome contributions from the community! Whether it's fixing bugs, improving documentation, or proposing new features, your contributions make dsRAG better for everyone. 证据：`docs/community/contributing.md`
- **License**（source_file）：Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files the "Software" , to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 证据：`LICENSE`
- **Welcome to dsRAG**（documentation）：dsRAG is a retrieval engine for unstructured data. It is especially good at handling challenging queries over dense text, like financial reports, legal documents, and academic papers. dsRAG achieves substantially higher accuracy than vanilla RAG baselines on complex open-book question answering tasks. On one especially challenging benchmark, FinanceBench https://arxiv.org/abs/2311.11944 , dsRAG gets accurate answers 96.6% of the time, compared to the vanilla RAG baseline which only gets 32% of questions correct. 证据：`docs/index.md`
- **Chat**（documentation）：The Chat module provides functionality for managing chat-based interactions with knowledge bases. It handles chat thread creation, message history, and generating responses using knowledge base search. 证据：`docs/api/chat.md`
- **KnowledgeBase**（documentation）：The KnowledgeBase class is the main interface for working with dsRAG. It handles document processing, storage, and retrieval. 证据：`docs/api/knowledge_base.md`
- **Professional Services**（documentation）：The creators of dsRAG, Zach and Nick McCormick, run a specialized applied AI consulting firm. As former startup founders and YC alums, we bring a unique business and product-centric perspective to highly technical engineering projects. 证据：`docs/community/professional-services.md`
- **Community Support**（documentation）：Join our Discord https://discord.gg/NTUVX9DmQ3 to: 证据：`docs/community/support.md`
- **Chat**（documentation）：The Chat functionality in dsRAG provides a powerful way to interact with your knowledge bases through a conversational interface. It handles message history, knowledge base searching, and citation tracking automatically. 证据：`docs/concepts/chat.md`
- **Citations**（documentation）：dsRAG's citation system ensures that responses are grounded in your knowledge base content and provides transparency about information sources. Citations are included automatically in chat responses. 证据：`docs/concepts/citations.md`
- **Components**（documentation）：There are six key components that define the configuration of a KnowledgeBase. Each component is customizable, with several built-in options available. 证据：`docs/concepts/components.md`
- **Configuration**（documentation）：dsRAG uses several configuration dictionaries to organize its many parameters. These configs can be passed to different methods of the KnowledgeBase class. 证据：`docs/concepts/config.md`
- **Knowledge Bases**（documentation）：A knowledge base in dsRAG is a searchable collection of documents that can be queried to find relevant information. The KnowledgeBase class handles document processing, storage, and retrieval. 证据：`docs/concepts/knowledge-bases.md`
- **Logging Framework**（documentation）：This document provides an overview of the logging framework implemented in dsRAG. 证据：`docs/concepts/logging.md`
- **Architecture Overview**（documentation）：dsRAG is built around three key methods that improve performance over vanilla RAG systems: 证据：`docs/concepts/overview.md`
- **VLM File Parsing**（documentation）：dsRAG supports Vision Language Model VLM integration for enhanced PDF parsing capabilities. This feature is particularly useful for documents with complex layouts, tables, diagrams, or other visual elements that traditional text extraction might miss. 证据：`docs/concepts/vlm.md`
- **Installation**（documentation）：If you want to use VLM file parsing, you will need one non-Python dependency: poppler. This is used for converting PDFs to images. On MacOS, you can install it using Homebrew: 证据：`docs/getting-started/installation.md`
- **Quick Start Guide**（documentation）：This guide will help you get started with dsRAG quickly. We'll cover the basics of creating a knowledge base and querying it. 证据：`docs/getting-started/quickstart.md`
- **see if we need to add an addendum about non-English responses**（source_file）：DOCUMENT TITLE PROMPT = """ DOCUMENT SUMMARIZATION PROMPT = """ TRUNCATION MESSAGE = """ SECTION SUMMARIZATION PROMPT = """ LANGUAGE ADDENDUM = "YOU MUST use the same language as the document for your entire response. If the document is in English, your response MUST BE entirely in English. If the document is in another language, your response MUST BE entirely in that language." def truncate content content: str, max tokens: int ⋮---- TOKEN ENCODER = tiktoken.encoding for model 'gpt-3.5-turbo' tokens = TOKEN ENCODER.encode content, disallowed special= truncated tokens = tokens :max tokens ⋮---- def get document title auto context model: LLM, document text: str, document title guidance: str… 证据：`dsrag/auto_context.py`
- **Embedding**（source_file）：dimensionality = { class Embedding ABC ⋮---- subclasses = {} def init self, dimension: Optional int = None def init subclass cls, kwargs def to dict self ⋮---- @classmethod def from dict cls, config - "Embedding" ⋮---- subclass name = config.pop subclass = cls.subclasses.get subclass name ⋮---- @abstractmethod def get embeddings self, text: list str , input type: Optional str - list Vector class OpenAIEmbedding Embedding ⋮---- def init self, model: str = "text-embedding-3-small", dimension: int = 768 ⋮---- base url = os.environ.get "DSRAG OPENAI BASE URL", None ⋮---- def get embeddings self, text: list str , input type: Optional str = None - list Vector ⋮---- response = self.client.embeddin… 证据：`dsrag/embedding.py`
- **If the file system does not exist but is provided, use the provided file system**（source_file）：class KnowledgeBase ⋮---- created time = int time.time ⋮---- def get metadata path self ⋮---- def save self ⋮---- components = { full data = { self.kb metadata, "components": components} ⋮---- def load self, auto context model=None, reranker=None, file system=None, chunk db=None, vector db=None, vlm client: Optional VLM = None ⋮---- data = self.metadata storage.load self.kb id ⋮---- components = data.get "components", {} ⋮---- base extra = {"kb id": self.kb id} ⋮---- file system dict = components.get "file system", None ⋮---- If the file system does not exist but is provided, use the provided file system ⋮---- If the file system dict exists, use it ⋮---- If the file system does not exist an… 证据：`dsrag/knowledge_base.py`
- **Pre-initialize the model object**（source_file）：class LLM ABC ⋮---- subclasses = {} def init subclass cls, kwargs def to dict self ⋮---- @classmethod def from dict cls, config ⋮---- subclass name = config.pop 'subclass name', None subclass = cls.subclasses.get subclass name ⋮---- @abstractmethod def make llm call self, chat messages: list dict - str ⋮---- """ Takes in chat messages OpenAI format and returns the response from the LLM as a string. """ ⋮---- class OpenAIChatAPI LLM ⋮---- def init self, model: str = "gpt-4o-mini", temperature: float = 0.2, max tokens: int = 1000 def make llm call self, chat messages: list dict - str ⋮---- base url = os.environ.get "DSRAG OPENAI BASE URL", None ⋮---- client = openai.OpenAI api key=os.environ… 证据：`dsrag/llm.py`
- **Reranker**（source_file）：class Reranker ABC ⋮---- subclasses = {} def init subclass cls, kwargs def to dict self ⋮---- @classmethod def from dict cls, config ⋮---- subclass name = config.pop 'subclass name', None subclass = cls.subclasses.get subclass name ⋮---- @abstractmethod def rerank search results self, query: str, search results: list - list class CohereReranker Reranker ⋮---- def init self, model: str = "rerank-english-v3.0" ⋮---- cohere api key = os.environ 'CO API KEY' base url = os.environ.get "DSRAG COHERE BASE URL", None ⋮---- def transform self, x def rerank search results self, query: str, search results: list - list ⋮---- documents = ⋮---- reranked results = self.client.rerank model=self.model, quer… 证据：`dsrag/reranker.py`
- **Rse**（source_file）：def get best segments all relevance values: list list , document splits: list int , max length: int, overall max length: int, minimum value: float ⋮---- best segments = scores = total length = 0 rv index = 0 bad rv indices = ⋮---- relevance values = all relevance values rv index best segment = None best value = -1000 ⋮---- segment value = sum relevance values start:end ⋮---- best value = segment value best segment = start, end ⋮---- def get meta document all ranked results: list list , top k for document selection: int ⋮---- top document ids = ⋮---- unique document ids = list set top document ids document splits = document start points = {} ⋮---- max chunk index = -1 ⋮---- max chunk index =… 证据：`dsrag/rse.py`
- **Chat Types**（source_file）：class ChatThreadParams TypedDict ⋮---- kb ids: Optional list str model: Optional str temperature: Optional float system message: Optional str auto query model: Optional str auto query guidance: Optional str rse params: Optional dict target output length: Optional str max chat history tokens: Optional int class ChatResponseOutput TypedDict ⋮---- response: str metadata: dict class MetadataFilter TypedDict ⋮---- field: str operator: Literal 'equals', 'not equals', 'in', 'not in', 'greater than', 'less than', 'greater than equals', 'less than equals' value: Union str, int, float, list str , list int , list float class ChatResponseInput BaseModel ⋮---- user input: str chat thread params: Optiona… 证据：`dsrag/chat/chat_types.py`
- **Basic Db**（source_file）：class BasicChatThreadDB ChatThreadDB ⋮---- def init self def create chat thread self, chat thread params: dict - dict ⋮---- chat thread = { ⋮---- def list chat threads self - list dict def get chat thread self, thread id: str - dict def update chat thread self, thread id: str, chat thread params: dict - dict def delete chat thread self, thread id: str - dict ⋮---- chat thread = self.chat threads.pop thread id, None ⋮---- def add interaction self, thread id: str, interaction: dict - dict ⋮---- message id = str uuid.uuid4 ⋮---- def update interaction self, thread id: str, message id: str, interaction update: dict - dict def save self def load self 证据：`dsrag/database/chat_thread/basic_db.py`
- **Db**（source_file）：class ChatThreadDB ABC ⋮---- @abstractmethod def create chat thread self, chat thread params: dict - dict ⋮---- @abstractmethod def list chat threads self - list dict ⋮---- @abstractmethod def get chat thread self, thread id: str - dict ⋮---- @abstractmethod def update chat thread self, thread id: str, chat thread params: dict - dict ⋮---- @abstractmethod def delete chat thread self, thread id: str - dict ⋮---- @abstractmethod def add interaction self, thread id: str, interaction: dict - dict ⋮---- @abstractmethod def update interaction self, thread id: str, message id: str, interaction update: dict - dict 证据：`dsrag/database/chat_thread/db.py`
- **Create the interactions table. The thread id column is a foreign key that references the thread id column in the chat t…**（source_file）：class SQLiteChatThreadDB ChatThreadDB ⋮---- def init self, storage directory: str = "~/dsRAG" ⋮---- conn = sqlite3.connect self.db path c = conn.cursor result = c.execute f"SELECT name FROM sqlite master WHERE type='table' AND name='chat threads'" ⋮---- query statement = f"CREATE TABLE chat threads {', '.join f'{column} {column type}' for column, column type in zip self.chat thread columns, self.chat thread column types } " ⋮---- Create the interactions table. The thread id column is a foreign key that references the thread id column in the chat threads table query statement = f"CREATE TABLE interactions {', '.join f'{column} {column type}' for column, column type in zip self.interactions c… 证据：`dsrag/database/chat_thread/sqlite_db.py`
- **Concatenate the chunks into a single string**（source_file）：class BasicChunkDB ChunkDB ⋮---- def init self, kb id: str, storage directory: str = "~/dsRAG" - None def add document self, doc id: str, chunks: dict int, dict str, Any , supp id: str = "", metadata: dict = {} - None def remove document self, doc id: str def get chunk text self, doc id: str, chunk index: int - Optional str def get is visual self, doc id: str, chunk index: int - Optional bool def get chunk page numbers self, doc id: str, chunk index: int - Optional tuple int, int ⋮---- document = self.data doc id title = cast str, document 0 .get "document title", "" full document string = "" ⋮---- Concatenate the chunks into a single string ⋮---- Join each chunk text with a new line charac… 证据：`dsrag/database/chunk/basic_db.py`
- **Db**（source_file）：class ChunkDB ABC ⋮---- subclasses = {} def init subclass cls, kwargs def to dict self ⋮---- @classmethod def from dict cls, config - "ChunkDB" ⋮---- subclass name = config.pop subclass = cls.subclasses.get subclass name ⋮---- @abstractmethod def add document self, doc id: str, chunks: dict int, dict str, Any , supp id: str = "", metadata: dict = {} - None ⋮---- """ Store all chunks for a given document. """ ⋮---- @abstractmethod def remove document self, doc id: str - None ⋮---- """ Remove all chunks and metadata associated with a given document ID. """ ⋮---- @abstractmethod def get chunk text self, doc id: str, chunk index: int - Optional str ⋮---- """ Retrieve a specific chunk from a giv… 证据：`dsrag/database/chunk/db.py`
- **Initialize DynamoDB resource**（source_file）：def get key def process items items ⋮---- def convert decimal obj ⋮---- class DynamoDB ChunkDB ⋮---- def init self, kb id: str, table name: str = None, billing mode: str = "PAY PER REQUEST" - None ⋮---- kb id = kb id.replace " ", " " ⋮---- table name = f"{kb id} chunks" ⋮---- def check table status self ⋮---- dynamodb = self.create dynamo client table = dynamodb.Table self.table name ⋮---- response = table.table status ⋮---- def create dynamo client self ⋮---- dynamodb client = boto3.resource ⋮---- def create db table self, table name: str - None ⋮---- response = dynamodb.create table ⋮---- def add document self, doc id: str, chunks: dict int, dict str, Any , supp id: str = "", metadata: di… 证据：`dsrag/database/chunk/dynamo_db.py`
- **Create a table for this kb id**（source_file）：psycopg2 = LazyLoader "psycopg2", "psycopg2-binary" class PostgresChunkDB ChunkDB ⋮---- def init self, kb id: str, username: str, password: str, database: str, host: str="localhost", port: int = 5432 - None ⋮---- conn = psycopg2.connect cur = conn.cursor ⋮---- exists = cur.fetchone 0 ⋮---- Create a table for this kb id query statement = f"CREATE TABLE {self.table name} " ⋮---- query statement = query statement :-2 + " " ⋮---- Check if we need to add any columns to the table. This happens if the columns have been updated ⋮---- columns = cur.fetchall column names = column 0 for column in columns ⋮---- def add document self, doc id: str, chunks: dict int, dict str, Any , supp id: str = "", met… 证据：`dsrag/database/chunk/postgres_db.py`
- **Check if we need to add any columns to the table. This happens if the columns have been updated**（source_file）：class SQLiteDB ChunkDB ⋮---- def init self, kb id: str, storage directory: str = "~/dsRAG" - None ⋮---- conn = sqlite3.connect os.path.join self.db path, f"{kb id}.db" c = conn.cursor result = c.execute ⋮---- query statement = "CREATE TABLE documents " ⋮---- query statement = query statement :-2 + " " ⋮---- Check if we need to add any columns to the table. This happens if the columns have been updated ⋮---- columns = c.fetchall column names = column 1 for column in columns ⋮---- @contextlib.contextmanager def get connection self - ContextManager sqlite3.Connection ⋮---- conn = sqlite3.connect ⋮---- def execute with retry self, operation: callable, args, kwargs - Any ⋮---- last error = None… 证据：`dsrag/database/chunk/sqlite_db.py`
- **Types**（source_file）：class FormattedDocument TypedDict ⋮---- id: Optional str = None title: Optional str = None content: Optional str = None summary: Optional str = None created on: Optional datetime = None supp id: Optional str = None metadata: Optional dict = {} chunk count: Optional int = 0 证据：`dsrag/database/chunk/types.py`
- **Limit top k to the number of vectors we have - Faiss doesn't automatically handle this**（source_file）：class BasicVectorDB VectorDB ⋮---- def search self, query vector, top k=10, metadata filter: Optional dict = None - list VectorSearchResult def fallback search self, query vector, top k=10 - list VectorSearchResult ⋮---- """Fallback search method using numpy when faiss is not available.""" similarities = cosine similarity query vector , self.vectors 0 indexed similarities = sorted results: list VectorSearchResult = ⋮---- result = VectorSearchResult ⋮---- def search faiss self, query vector, top k=10 - list VectorSearchResult ⋮---- Limit top k to the number of vectors we have - Faiss doesn't automatically handle this top k = min top k, len self.vectors faiss expects 2D arrays of vectors vect… 证据：`dsrag/database/vector/basic_db.py`
- **raise ValueError 'No vectors stored in the database.'**（source_file）：chromadb = LazyLoader "chromadb" def format metadata filter metadata filter: MetadataFilter - dict ⋮---- field = metadata filter "field" operator = metadata filter "operator" value = metadata filter "value" operator mapping = { formatted operator = operator mapping operator formatted metadata filter = {field: {formatted operator: value}} ⋮---- class ChromaDB VectorDB ⋮---- def init self, kb id: str, storage directory: str = "~/dsRAG" def get num vectors self def add vectors self, vectors: list, metadata: list ⋮---- vectors as lists = vector.tolist if isinstance vector, np.ndarray else vector for vector in vectors ⋮---- ids = f"{meta 'doc id' } {meta 'chunk index' }" for meta in metadata ⋮--… 证据：`dsrag/database/vector/chroma_db.py`
- **Db**（source_file）：class VectorDB ABC ⋮---- subclasses = {} def init subclass cls, kwargs def to dict self ⋮---- @classmethod def from dict cls, config ⋮---- subclass name = config.pop subclass = cls.subclasses.get subclass name ⋮---- """ Store a list of vectors with associated metadata. """ ⋮---- @abstractmethod def remove document self, doc id - None ⋮---- """ Remove all vectors and metadata associated with a given document ID. """ ⋮---- @abstractmethod def search self, query vector, top k: int=10, metadata filter: Optional dict = None - list VectorSearchResult ⋮---- """ Retrieve the top-k closest vectors to a given query vector. - needs to return results as list of dictionaries in this format: { 'metadata'… 证据：`dsrag/database/vector/db.py`
- **Milvus Db**（source_file）：pymilvus = LazyLoader "pymilvus" def convert metadata to expr metadata filter: MetadataFilter - str ⋮---- field = metadata filter "field" operator = metadata filter "operator" value = metadata filter "value" operator mapping = { formatted operator = operator mapping operator ⋮---- value = f'"{value}"' formatted metadata filter = f"metadata '{field}' {formatted operator} {value}" ⋮---- class MilvusDB VectorDB ⋮---- def create collection self, collection name: str, dimension: int = 768 ⋮---- """ Create a collection in Milvus. Args: collection name str : The name of the collection. dimension int, optional : The dimension of the vectors. Defaults to 768. """ schema = { ⋮---- def add vectors sel… 证据：`dsrag/database/vector/milvus_db.py`
- **convert to format that Pinecone expects - list of dictionaries including "values", "id", and "metadata"**（source_file）：pinecone = LazyLoader "pinecone" def format metadata filter metadata filter: MetadataFilter - dict ⋮---- field = metadata filter "field" operator = metadata filter "operator" value = metadata filter "value" operator mapping = { formatted operator = operator mapping.get operator ⋮---- formatted metadata filter = {field: value} ⋮---- formatted metadata filter = {field: {formatted operator: value}} ⋮---- class PineconeDB VectorDB ⋮---- kb id = kb id.replace " ", "-" kb id = kb id.replace " ", "-" ⋮---- existing indexes = self.pc.list indexes existing index names = index "name" for index in existing indexes ⋮---- def add vectors self, vectors: list, metadata: list ⋮---- vectors as lists = vecto… 证据：`dsrag/database/vector/pinecone_db.py`
- **Handle different types of values**（source_file）：psycopg2 = LazyLoader "psycopg2", "psycopg2-binary" pgvector = LazyLoader "pgvector" def format metadata filter metadata filter: MetadataFilter - dict ⋮---- field = metadata filter 'field' operator = metadata filter 'operator' value = metadata filter 'value' operator map = { ⋮---- sql operator = operator map operator Handle different types of values ⋮---- Convert list to a tuple for SQL IN expressions value placeholder = f" {', '.join '%s' len value } " ⋮---- Single value placeholder value placeholder = "%s" ⋮---- filter expression = f"metadata- '{field}' {sql operator} {value placeholder}" ⋮---- class PostgresVectorDB VectorDB ⋮---- def init self, kb id: str, username: str, password: str,… 证据：`dsrag/database/vector/postgres_db.py`
- **Qdrant Db**（source_file）：qdrant client = LazyLoader "qdrant client" def convert id id: str - str class QdrantVectorDB VectorDB ⋮---- def close self ⋮---- points = ⋮---- doc id = meta.get "doc id", "" chunk text = meta.get "chunk text", "" chunk index = meta.get "chunk index", 0 uuid = convert id f"{doc id} {chunk index}" ⋮---- def remove document self, doc id - None ⋮---- query vector = query vector.tolist results: list VectorSearchResult = response = self.client.query points ⋮---- def get num vectors self def delete self def to dict self 证据：`dsrag/database/vector/qdrant_db.py`
- **Types**（source_file）：class ChunkMetadata TypedDict ⋮---- doc id: str chunk text: str chunk index: int chunk header: str Vector = Union Sequence float , Sequence int class VectorSearchResult TypedDict ⋮---- doc id: Optional str vector: Optional Vector metadata: ChunkMetadata similarity: float class MetadataFilter TypedDict ⋮---- field: str operator: str value: Union str, int, float, list str , list int , list float 证据：`dsrag/database/vector/types.py`
- **Explicitly connect to the client**（source_file）：weaviate = LazyLoader "weaviate" class WeaviateVectorDB VectorDB ⋮---- embedded options = weaviate.embedded.EmbeddedOptions ⋮---- connection params = weaviate.connect.ConnectionParams.from url ⋮---- Explicitly connect to the client ⋮---- Create collection if it doesn't exist ⋮---- Create collection ⋮---- def close self ⋮---- """ Closes the connection to Weaviate. """ ⋮---- """ Adds a list of vectors with associated metadata to Weaviate. Args: vectors: A list of vector embeddings. metadata: A list of dictionaries containing metadata for each vector. Raises: ValueError: If the number of vectors and metadata items do not match. """ ⋮---- Updated to use v4 API ⋮---- doc id = meta.get "doc id",… 证据：`dsrag/database/vector/weaviate_db.py`
- **Element Types**（source_file）：def get visual elements as str elements: list ElementType - str ⋮---- visual elements = element "name" for element in elements if element "is visual" ⋮---- last element = visual elements -1 ⋮---- def get non visual elements as str elements: list ElementType - str ⋮---- non visual elements = element "name" for element in elements if not element "is visual" ⋮---- last element = non visual elements -1 ⋮---- def get num visual elements elements: list ElementType - int def get num non visual elements elements: list ElementType - int def get element description block elements: list ElementType - str ⋮---- element blocks = ⋮---- element block = ELEMENT PROMPT.format ⋮---- ELEMENT PROMPT = """ defa… 证据：`dsrag/dsparse/file_parsing/element_types.py`
- **File System**（source_file）：class FileSystem ABC ⋮---- subclasses = {} def init self, base path: str def init subclass cls, kwargs def to dict self ⋮---- @classmethod def from dict cls, config ⋮---- subclass name = config.pop subclass = cls.subclasses.get subclass name ⋮---- @abstractmethod def create directory self, kb id: str, doc id: str - None ⋮---- @abstractmethod def delete directory self, kb id: str, doc id: str - None ⋮---- @abstractmethod def delete kb self, kb id: str - None ⋮---- @abstractmethod def save json self, kb id: str, doc id: str, file name: str, file: dict - None ⋮---- @abstractmethod def save image self, kb id: str, doc id: str, file name: str, file: any - None ⋮---- @abstractmethod def get files… 证据：`dsrag/dsparse/file_parsing/file_system.py`
- **Loop through all the pages in the PDF file**（source_file）：def extract text from pdf file path: str - tuple str, list ⋮---- pdf reader = pypdf.PdfReader file extracted text = "" Loop through all the pages in the PDF file pages = ⋮---- Extract the text from the current page page text = pdf reader.pages page num .extract text ⋮---- Add the extracted text to the final text ⋮---- def extract text from docx file path: str - str def parse file no vlm file path: str - str ⋮---- pdf pages = None ⋮---- text = extract text from docx file path ⋮---- text = file.read 证据：`dsrag/dsparse/file_parsing/non_vlm_file_parsing.py`
- **Vlm**（source_file）：def make llm call gemini image path: str, system message: str, model: str = "gemini-2.0-flash", response schema: dict = None, max tokens: int = 4000, temperature: float = 0.5 - str ⋮---- client = GeminiVLM model=model ⋮---- def make llm call vertex image path: str, system message: str, model: str, project id: str, location: str, response schema: dict = None, max tokens: int = 4000, temperature: float = 0.5 - str ⋮---- client = VertexAIVLM model=model, project id=project id, location=location ⋮---- def compress image image: PIL.Image.Image, max size bytes: int = 1097152, quality: int = 95 - tuple bytes, int ⋮---- output = io.BytesIO ⋮---- image = image.resize int width 0.9 , int height 0.9 ,… 证据：`dsrag/dsparse/file_parsing/vlm.py`
- **Vlm Clients**（source_file）：class VLM ABC ⋮---- subclasses: Dict str, Type "VLM" = {} def init subclass cls, kwargs def to dict self - Dict str, Any ⋮---- data = {k: v for k, v in self. dict .items if not k.startswith " " } ⋮---- @classmethod def from dict cls, config: Dict str, Any - "VLM" ⋮---- subclass name = config.get "subclass name" ⋮---- subclass = cls. subclasses.get subclass name ⋮---- kwargs = {k: v for k, v in config.items if k != "subclass name"} ⋮---- class GeminiVLM VLM ⋮---- def init self, model: str = "gemini-2.0-flash" - None ⋮---- api key = os.environ.get "GEMINI API KEY" ⋮---- client = genai new.Client api key=api key config = genai new.types.GenerateContentConfig ⋮---- image = None ⋮---- image = PI… 证据：`dsrag/dsparse/file_parsing/vlm_clients.py`
- **Create base logging context with identifiers**（source_file）：logger = logging.getLogger "dsrag.dsparse.vlm file parsing" SYSTEM MESSAGE = """ response schema = { def get page count file path: str, kb id: str = "", doc id: str = "" ⋮---- Create base logging context with identifiers base extra = {} ⋮---- pdf reader = PdfReader pdf file ⋮---- def pdf to images pdf path: str, kb id: str, doc id: str, file system: FileSystem, dpi=100, max workers: int=2, max pages: int=10 - list str ⋮---- base extra = {"kb id": kb id, "doc id": doc id} ⋮---- def save single image args page count = get page count pdf path, kb id, doc id all image paths = ⋮---- last page = min i + max pages-1, page count images = convert from path pdf path, dpi=dpi, thread count=max workers… 证据：`dsrag/dsparse/file_parsing/vlm_file_parsing.py`
- **dump to json for testing**（source_file）：logger = logging.getLogger "dsrag.dsparse" ⋮---- base extra = {"kb id": kb id, "doc id": doc id} ⋮---- config extra = { use vlm = file parsing config.get "use vlm", False ⋮---- overall start time = time.perf counter ⋮---- file system = LocalFileSystem base path=os.path.expanduser "~/dsParse" ⋮---- vlm config = file parsing config.get "vlm config", {} start time = time.perf counter ⋮---- duration = time.perf counter - start time ⋮---- overall duration = time.perf counter - overall start time ⋮---- base extra = {"kb id": kb id, "doc id": doc id, "file path": file path} ⋮---- parse start time = time.perf counter elements = parse file parse duration = time.perf counter - parse start time ⋮----… 证据：`dsrag/dsparse/main.py`
- **Types**（source_file）：class ElementType TypedDict ⋮---- name: str instructions: str is visual: bool class Element TypedDict ⋮---- type: str content: str page number: Optional int class Line TypedDict ⋮---- element type: str ⋮---- is visual: Optional bool class Section TypedDict ⋮---- title: str start: int end: int ⋮---- class Chunk TypedDict ⋮---- line start: int line end: int ⋮---- page start: int page end: int section index: int ⋮---- class VLMConfig TypedDict ⋮---- provider: Optional str model: Optional str fallback provider: Optional str fallback model: Optional str project id: Optional str location: Optional str exclude elements: Optional list str element types: Optional list ElementType max workers: Option… 证据：`dsrag/dsparse/models/types.py`
- **Try to get the attribute from the module**（source_file）：class LazyLoader ⋮---- def init self, module name, package name=None def getattr self, name ⋮---- Try to get the attribute from the module ⋮---- If the attribute is not found, it might be a nested module ⋮---- Try to import the nested module nested module = importlib.import module f"{self. module name}.{name}" Cache it on the module for future access ⋮---- If that fails, re-raise the original AttributeError ⋮---- Create lazy loaders for dependencies used in dsparse instructor = LazyLoader "instructor" openai = LazyLoader "openai" anthropic = LazyLoader "anthropic" genai = LazyLoader "google.generativeai", "google-generativeai" genai new = LazyLoader "google.genai", "google-genai" vertexai =… 证据：`dsrag/dsparse/utils/imports.py`
- **Try to get the attribute from the module**（source_file）：class LazyLoader ⋮---- def init self, module name, package name=None def getattr self, name ⋮---- Try to get the attribute from the module ⋮---- If the attribute is not found, it might be a nested module ⋮---- Try to import the nested module nested module = importlib.import module f"{self. module name}.{name}" Cache it on the module for future access ⋮---- If that fails, re-raise the original AttributeError ⋮---- Create lazy loaders for commonly used optional dependencies instructor = LazyLoader "instructor" openai = LazyLoader "openai" cohere = LazyLoader "cohere" voyageai = LazyLoader "voyageai" ollama = LazyLoader "ollama" anthropic = LazyLoader "anthropic" genai = LazyLoader "google.gen… 证据：`dsrag/utils/imports.py`
- **Return deterministic zero vectors of the configured dimension**（source_file）：class DummyEmbedding Embedding ⋮---- def init self, dimension: int = 8 - None def get embeddings self, text: list str , input type: str = "" ⋮---- Return deterministic zero vectors of the configured dimension ⋮---- def to dict self ⋮---- Provide a minimal serializable dict not used for from dict in this test ⋮---- class TestAddDocumentWithSerializedVLM unittest.TestCase ⋮---- @unittest.skipIf 'GEMINI API KEY' not in os.environ, "GEMINI API KEY not found in environment" def test add document with serialized vlm and images already exist self ⋮---- temp dir = tempfile.mkdtemp prefix="dsrag kb add doc " ⋮---- kb id = "kb serialized vlm" doc id = "doc serialized vlm" fs = LocalFileSystem base pa… 证据：`dsrag/dsparse/tests/integration/test_add_document_with_serialized_vlm.py`
- **Test Vlm File Parsing**（source_file）：class TestVLMFileParsing unittest.TestCase ⋮---- @classmethod def setUpClass self def test parse and chunk vlm self ⋮---- vlm config = { semantic sectioning config = { file parsing config = { ⋮---- def test non vlm file parsing self def test parse and chunk vlm with serialized client self ⋮---- @classmethod def tearDownClass self 证据：`dsrag/dsparse/tests/integration/test_vlm_file_parsing.py`
- **Check if the directory was deleted**（source_file）：class TestLocalFileSystem unittest.TestCase ⋮---- @classmethod def setUpClass self def test 001 create directory self def test 002 save json self ⋮---- test json = { file name = "elements.json" ⋮---- def test 003 save image self ⋮---- pdf path = os.path.abspath os.path.join os.path.dirname file , '../../../../tests/data/mck energy first 5 pages.pdf' images = convert from path pdf path, dpi=150 file name = "page 0.jpg" ⋮---- file name = "page 1.jpg" ⋮---- def test 004 get files self ⋮---- files = self.file system.get files self.kb id, self.doc id, page start=0, page end=1 ⋮---- files = self.file system.get files self.kb id, self.doc id, page start=2, page end=2 ⋮---- def test 005 get all fil… 证据：`dsrag/dsparse/tests/unit/test_file_system.py`
- **Test Vlm**（source_file）：class TestVLM unittest.TestCase ⋮---- @unittest.skipIf 'GEMINI API KEY' not in os.environ, "GEMINI API KEY not found in environment" def test gemini 2 0 with simple schema self ⋮---- test image path = os.path.abspath os.path.join os.path.dirname file , "../../../../tests/data/page 7.jpg" test schema = { ⋮---- result = make llm call gemini parsed result = json.loads result ⋮---- @unittest.skipIf 'GEMINI API KEY' not in os.environ, "GEMINI API KEY not found in environment" def test gemini 2 5 with simple schema self ⋮---- @unittest.skipIf 'GEMINI API KEY' not in os.environ, "GEMINI API KEY not found in environment" def test gemini 2 0 with complex schema self ⋮---- complex schema = { system m… 证据：`dsrag/dsparse/tests/unit/test_vlm.py`
- **Test Vlm Clients**（source_file）：class TestVLMClients unittest.TestCase ⋮---- def test registry contains subclasses self def test gemini to from dict roundtrip self ⋮---- client = GeminiVLM as dict = client.to dict ⋮---- rebuilt = VLM.from dict as dict ⋮---- def test vertex to from dict roundtrip self ⋮---- client = VertexAIVLM model="gemini-1.5-flash", project id="proj", location="us-central1" 证据：`dsrag/dsparse/tests/unit/test_vlm_clients.py`
- **Levels Of Agi**（structured_config）：{ "title": "", "description": "", "language": "en", "chunk size": 800, "components": { "embedding model": { "subclass name": "OpenAIEmbedding", "dimension": 768, "model": "text-embedding-3-small" }, "reranker": { "subclass name": "CohereReranker", "model": "rerank-english-v3.0" }, "auto context model": { "subclass name": "AnthropicChatAPI", "model": "claude-3-haiku-20240307", "temperature": 0.2, "max tokens": 1000 }, "vector db": { "subclass name": "BasicVectorDB", "kb id": "levels of agi", "storage directory": "example kb data", "use faiss": true }, "chunk db": { "subclass name": "BasicChunkDB", "kb id": "levels of agi", "storage directory": "example kb data" } } } 证据：`examples/example_kb_data/metadata/levels_of_agi.json`
- 其余 20 条证据见 `AI_CONTEXT_PACK.json` 或 `EVIDENCE_INDEX.json`。

## 宿主 AI 必须遵守的规则

- **把本资产当作开工前上下文，而不是运行环境。**：AI Context Pack 只包含证据化项目理解，不包含目标项目的可执行状态。 证据：`README.md`, `dsrag/dsparse/README.md`, `docs/community/contributing.md`
- **回答用户时区分可预览内容与必须安装后才能验证的内容。**：安装前体验的消费者价值来自降低误装和误判，而不是伪装成真实运行。 证据：`README.md`, `dsrag/dsparse/README.md`, `docs/community/contributing.md`

## 用户开工前应该回答的问题

- 你准备在哪个宿主 AI 或本地环境中使用它？
- 你只是想先体验工作流，还是准备真实安装？
- 你最在意的是安装成本、输出质量、还是和现有规则的冲突？

## 验收标准

- 所有能力声明都能回指到 evidence_refs 中的文件路径。
- AI_CONTEXT_PACK.md 没有把预览包装成真实运行。
- 用户能在 3 分钟内看懂适合谁、能做什么、如何开始和风险边界。

---

## Doramagic Context Augmentation

下面内容用于强化 Repomix/AI Context Pack 主体。Human Manual 只提供阅读骨架；踩坑日志会被转成宿主 AI 必须遵守的工作约束。

## Human Manual 骨架

使用规则：这里只是项目阅读路线和显著性信号，不是事实权威。具体事实仍必须回到 repo evidence / Claim Graph。

宿主 AI 硬性规则：
- 不得把页标题、章节顺序、摘要或 importance 当作项目事实证据。
- 解释 Human Manual 骨架时，必须明确说它只是阅读路线/显著性信号。
- 能力、安装、兼容性、运行状态和风险判断必须引用 repo evidence、source path 或 Claim Graph。

- **dsRAG 概述与三大核心方法**：importance `high`
  - source_paths: README.md, dsrag/knowledge_base.py, dsrag/rse.py, dsrag/auto_context.py, dsrag/auto_query.py
- **系统架构与六大可插拔组件**：importance `high`
  - source_paths: dsrag/knowledge_base.py, dsrag/embedding.py, dsrag/llm.py, dsrag/reranker.py, dsrag/metadata.py
- **dsParse 文档处理与多模态解析**：importance `high`
  - source_paths: dsrag/dsparse/README.md, dsrag/dsparse/main.py, dsrag/dsparse/file_parsing/vlm.py, dsrag/dsparse/file_parsing/vlm_file_parsing.py, dsrag/dsparse/file_parsing/non_vlm_file_parsing.py
- **配置字典、扩展模式与社区集成**：importance `high`
  - source_paths: dsrag/create_kb.py, dsrag/add_document.py, dsrag/custom_term_mapping.py, dsrag/chat/chat.py, dsrag/chat/citations.py

## Repo Inspection Evidence / 源码检查证据

- repo_clone_verified: true
- repo_inspection_verified: true
- repo_commit: `5215e9791133fc1ad383af043fb2fde0e7d4e4d4`
- inspected_files: `README.md`, `pyproject.toml`, `requirements.txt`, `docs/api/chat.md`, `docs/api/knowledge_base.md`, `docs/community/contributing.md`, `docs/community/professional-services.md`, `docs/community/support.md`, `docs/concepts/chat.md`, `docs/concepts/citations.md`, `docs/concepts/components.md`, `docs/concepts/config.md`, `docs/concepts/knowledge-bases.md`, `docs/concepts/logging.md`, `docs/concepts/overview.md`, `docs/concepts/vlm.md`, `docs/getting-started/installation.md`, `docs/getting-started/quickstart.md`, `docs/index.md`, `examples/example_kb_data/metadata/levels_of_agi.json`

宿主 AI 硬性规则：
- 没有 repo_clone_verified=true 时，不得声称已经读过源码。
- 没有 repo_inspection_verified=true 时，不得把 README/docs/package 文件判断写成事实。
- 没有 quick_start_verified=true 时，不得声称 Quick Start 已跑通。

## Doramagic Pitfall Constraints / 踩坑约束

这些规则来自 Doramagic 发现、验证或编译过程中的项目专属坑点。宿主 AI 必须把它们当作工作约束，而不是普通说明文字。

### Constraint 1: 来源证据：About Performance of Semantic Chunk

- Trigger: GitHub 社区证据显示该项目存在一个安装相关的待验证问题：About Performance of Semantic Chunk
- Why it matters: 可能增加新用户试用和生产接入成本。
- Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/113 | 来源类型 github_issue 暴露的待验证使用条件。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 2: 来源证据：raise JSONDecodeError("Extra data", s, end) json.decoder.JSONDecodeError: Extra data: line 1 column 5 (char 4)

- Trigger: GitHub 社区证据显示该项目存在一个配置相关的待验证问题：raise JSONDecodeError("Extra data", s, end) json.decoder.JSONDecodeError: Extra data: line 1 column 5 (char 4)
- Why it matters: 可能增加新用户试用和生产接入成本。
- Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/117 | 来源讨论提到 python 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 3: 来源证据：Import "dsrag.document_parsing" from the README example couldn't be resolved

- Trigger: GitHub 社区证据显示该项目存在一个安装相关的待验证问题：Import "dsrag.document_parsing" from the README example couldn't be resolved
- Host AI rule: 来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- Why it matters: 可能增加新用户试用和生产接入成本。
- Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/73 | 来源讨论提到 python 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 4: 来源证据：llm.py directly imports google.generativeai instead of using LazyLoader

- Trigger: GitHub 社区证据显示该项目存在一个安装相关的待验证问题：llm.py directly imports google.generativeai instead of using LazyLoader
- Why it matters: 可能增加新用户试用和生产接入成本。
- Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/127 | 来源讨论提到 python 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 5: 来源证据：sqlite3.OperationalError: no such column: model_response_status

- Trigger: GitHub 社区证据显示该项目存在一个配置相关的待验证问题：sqlite3.OperationalError: no such column: model_response_status
- Why it matters: 可能增加新用户试用和生产接入成本。
- Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/116 | 来源类型 github_issue 暴露的待验证使用条件。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 6: 能力判断依赖假设

- Trigger: README/documentation is current enough for a first validation pass.
- Host AI rule: 将假设转成下游验证清单。
- Why it matters: 假设不成立时，用户拿不到承诺的能力。
- Evidence: capability.assumptions | https://github.com/D-Star-AI/dsRAG | README/documentation is current enough for a first validation pass.
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 7: 来源证据：A bug at custom_term_mapping?

- Trigger: GitHub 社区证据显示该项目存在一个运行相关的待验证问题：A bug at custom_term_mapping?
- Host AI rule: 来源显示可能已有修复、规避或版本变化，说明书中必须标注适用版本。
- Why it matters: 可能增加新用户试用和生产接入成本。
- Evidence: community_evidence:github | https://github.com/D-Star-AI/dsRAG/issues/124 | 来源讨论提到 python 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 8: 维护活跃度未知

- Trigger: 未记录 last_activity_observed。
- Host AI rule: 补 GitHub 最近 commit、release、issue/PR 响应信号。
- Why it matters: 新项目、停更项目和活跃项目会被混在一起，推荐信任度下降。
- Evidence: evidence.maintainer_signals | https://github.com/D-Star-AI/dsRAG | last_activity_observed missing
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

- Trigger: no_demo
- Evidence: downstream_validation.risk_items | https://github.com/D-Star-AI/dsRAG | no_demo; severity=medium
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 10: 存在评分风险

- Trigger: no_demo
- Why it matters: 风险会影响是否适合普通用户安装。
- Evidence: risks.scoring_risks | https://github.com/D-Star-AI/dsRAG | no_demo; severity=medium
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。
