# ragas - Doramagic AI Context Pack

> 定位：安装前体验与判断资产。它帮助宿主 AI 有一个好的开始，但不代表已经安装、执行或验证目标项目。

## 充分原则

- **充分原则，不是压缩原则**：AI Context Pack 应该充分到让宿主 AI 在开工前理解项目价值、能力边界、使用入口、风险和证据来源；它可以分层组织，但不以最短摘要为目标。
- **压缩策略**：只压缩噪声和重复内容，不压缩会影响判断和开工质量的上下文。

## 给宿主 AI 的使用方式

你正在读取 Doramagic 为 ragas 编译的 AI Context Pack。请把它当作开工前上下文：帮助用户理解适合谁、能做什么、如何开始、哪些必须安装后验证、风险在哪里。不要声称你已经安装、运行或执行了目标项目。

## Claim 消费规则

- **事实来源**：Repo Evidence + Claim/Evidence Graph；Human Wiki 只提供显著性、术语和叙事结构。
- **事实最低状态**：`supported`
- `supported`：可以作为项目事实使用，但回答中必须引用 claim_id 和证据路径。
- `weak`：只能作为低置信度线索，必须要求用户继续核实。
- `inferred`：只能用于风险提示或待确认问题，不能包装成项目事实。
- `unverified`：不得作为事实使用，应明确说证据不足。
- `contradicted`：必须展示冲突来源，不得替用户强行选择一个版本。

## 它最适合谁

- **想在安装前理解开源项目价值和边界的用户**：当前证据主要来自项目文档。 证据：`README.md` Claim：`clm_0002` supported 0.86

## 它能做什么

- **命令行启动或安装流程**（需要安装后验证）：项目文档中存在可执行命令，真实使用需要在本地或宿主环境中运行这些命令。 证据：`README.md`, `docs/getstarted/install.md` Claim：`clm_0001` supported 0.86

## 怎么开始

- `pip install ragas` 证据：`README.md` Claim：`clm_0003` supported 0.86
- `pip install git+https://github.com/vibrantlabsai/ragas` 证据：`README.md` Claim：`clm_0004` supported 0.86, `clm_0005` supported 0.86
- `pip install git+https://github.com/vibrantlabsai/ragas.git` 证据：`docs/getstarted/install.md` Claim：`clm_0005` supported 0.86
- `git clone https://github.com/vibrantlabsai/ragas.git` 证据：`docs/getstarted/install.md` Claim：`clm_0006` supported 0.86
- `pip install -e .` 证据：`docs/getstarted/install.md` Claim：`clm_0007` supported 0.86
- `pip install -U "langchain-core>=0.2,<0.3" "langchain-openai>=0.1,<0.2" openai` 证据：`docs/getstarted/install.md` Claim：`clm_0008` supported 0.86

## 继续前判断卡

- **当前建议**：需要管理员/安全审批
- **为什么**：继续前可能涉及密钥、账号、外部服务或敏感上下文，建议先经过管理员或安全审批。

### 30 秒判断

- **现在怎么做**：需要管理员/安全审批
- **最小安全下一步**：先跑 Prompt Preview；若涉及凭证或企业环境，先审批再试装
- **先别相信**：角色质量和任务匹配不能直接相信。
- **继续会触碰**：角色选择偏差、命令执行、宿主 AI 配置

### 现在可以相信

- **适合人群线索：想在安装前理解开源项目价值和边界的用户**（supported）：有 supported claim 或项目证据支撑，但仍不等于真实安装效果。 证据：`README.md` Claim：`clm_0002` supported 0.86
- **能力存在：命令行启动或安装流程**（supported）：可以相信项目包含这类能力线索；是否适合你的具体任务仍要试用或安装后验证。 证据：`README.md`, `docs/getstarted/install.md` Claim：`clm_0001` supported 0.86
- **存在 Quick Start / 安装命令线索**（supported）：可以相信项目文档出现过启动或安装入口；不要因此直接在主力环境运行。 证据：`README.md` Claim：`clm_0003` supported 0.86

### 现在还不能相信

- **角色质量和任务匹配不能直接相信。**（unverified）：角色库证明有很多角色，不证明每个角色都适合你的具体任务，也不证明角色能产生高质量结果。
- **不能把角色文案当成真实执行能力。**（unverified）：安装前只能判断角色描述和任务画像是否匹配，不能证明它能在宿主 AI 里完成任务。
- **真实输出质量不能在安装前相信。**（unverified）：Prompt Preview 只能展示引导方式，不能证明真实项目中的结果质量。
- **宿主 AI 版本兼容性不能在安装前相信。**（unverified）：Claude、Cursor、Codex、Gemini 等宿主加载规则和版本差异必须在真实环境验证。
- **不会污染现有宿主 AI 行为，不能直接相信。**（inferred）：Skill、plugin、AGENTS/CLAUDE/GEMINI 指令可能改变宿主 AI 的默认行为。 证据：`CLAUDE.md`
- **可安全回滚不能默认相信。**（unverified）：除非项目明确提供卸载和恢复说明，否则必须先在隔离环境验证。
- **真实安装后是否与用户当前宿主 AI 版本兼容？**（unverified）：兼容性只能通过实际宿主环境验证。
- **项目输出质量是否满足用户具体任务？**（unverified）：安装前预览只能展示流程和边界，不能替代真实评测。

### 继续会触碰什么

- **角色选择偏差**：用户对任务应该由哪个专家角色处理的判断。 原因：选错角色会让 AI 从错误专业视角回答，浪费时间或误导决策。
- **命令执行**：包管理器、网络下载、本地插件目录、项目配置或用户主目录。 原因：运行第一条命令就可能产生环境改动；必须先判断是否值得跑。 证据：`README.md`, `docs/getstarted/install.md`
- **宿主 AI 配置**：Claude/Codex/Cursor/Gemini/OpenCode 等宿主的 plugin、Skill 或规则加载配置。 原因：宿主配置会改变 AI 后续工作方式，可能和用户已有规则冲突。 证据：`CLAUDE.md`
- **本地环境或项目文件**：安装结果、插件缓存、项目配置或本地依赖目录。 原因：安装前无法证明写入范围和回滚方式，需要隔离验证。 证据：`README.md`, `docs/getstarted/install.md`
- **环境变量 / API Key**：项目入口文档明确出现 API key、token、secret 或账号凭证配置。 原因：如果真实安装需要凭证，应先使用测试凭证并经过权限/合规判断。 证据：`README.md`, `docs/extra/components/choose_evaluator_llm.md`, `docs/extra/components/choose_generator_llm.md`, `docs/getstarted/evals.md` 等
- **宿主 AI 上下文**：AI Context Pack、Prompt Preview、Skill 路由、风险规则和项目事实。 原因：导入上下文会影响宿主 AI 后续判断，必须避免把未验证项包装成事实。

### 最小安全下一步

- **先跑 Prompt Preview**：先用交互式试用验证任务画像和角色匹配，不要先导入整套角色库。（适用：任何项目都适用，尤其是输出质量未知时。）
- **只在隔离目录或测试账号试装**：避免安装命令污染主力宿主 AI、真实项目或用户主目录。（适用：存在命令执行、插件配置或本地写入线索时。）
- **先备份宿主 AI 配置**：Skill、plugin、规则文件可能改变 Claude/Cursor/Codex 的默认行为。（适用：存在插件 manifest、Skill 或宿主规则入口时。）
- **不要使用真实生产凭证**：环境变量/API key 一旦进入宿主或工具链，可能产生账号和合规风险。（适用：出现 API、TOKEN、KEY、SECRET 等环境线索时。）
- **安装后只验证一个最小任务**：先验证加载、兼容、输出质量和回滚，再决定是否深用。（适用：准备从试用进入真实工作流时。）

### 退出方式

- **保留安装前状态**：记录原始宿主配置和项目状态，后续才能判断是否可恢复。
- **准备移除宿主 plugin / Skill / 规则入口**：如果试装后行为异常，可以把宿主 AI 恢复到试装前状态。
- **保留原始角色选择记录**：如果输出偏题，可以回到任务画像阶段重新选择角色，而不是继续沿着错误角色推进。
- **记录安装命令和写入路径**：没有明确卸载说明时，至少要知道哪些目录或配置需要手动清理。
- **准备撤销测试 API key 或 token**：测试凭证泄露或误用时，可以快速止损。
- **如果没有回滚路径，不进入主力环境**：不可回滚是继续前阻断项，不应靠信任或运气继续。

## 哪些只能预览

- 解释项目适合谁和能做什么
- 基于项目文档演示典型对话流程
- 帮助用户判断是否值得安装或继续研究

## 哪些必须安装后验证

- 真实安装 Skill、插件或 CLI
- 执行脚本、修改本地文件或访问外部服务
- 验证真实输出质量、性能和兼容性

## 边界与风险判断卡

- **把安装前预览误认为真实运行**：用户可能高估项目已经完成的配置、权限和兼容性验证。 处理方式：明确区分 prompt_preview_can_do 与 runtime_required。 Claim：`clm_0009` inferred 0.45
- **命令执行会修改本地环境**：安装命令可能写入用户主目录、宿主插件目录或项目配置。 处理方式：先在隔离环境或测试账号中运行。 证据：`README.md`, `docs/getstarted/install.md` Claim：`clm_0010` supported 0.86
- **待确认**：真实安装后是否与用户当前宿主 AI 版本兼容？。原因：兼容性只能通过实际宿主环境验证。
- **待确认**：项目输出质量是否满足用户具体任务？。原因：安装前预览只能展示流程和边界，不能替代真实评测。
- **待确认**：安装命令是否需要网络、权限或全局写入？。原因：这影响企业环境和个人环境的安装风险。

## 开工前工作上下文

### 加载顺序

- 先读取 how_to_use.host_ai_instruction，建立安装前判断资产的边界。
- 读取 claim_graph_summary，确认事实来自 Claim/Evidence Graph，而不是 Human Wiki 叙事。
- 再读取 intended_users、capabilities 和 quick_start_candidates，判断用户是否匹配。
- 需要执行具体任务时，优先查 role_skill_index，再查 evidence_index。
- 遇到真实安装、文件修改、网络访问、性能或兼容性问题时，转入 risk_card 和 boundaries.runtime_required。

### 任务路由

- **命令行启动或安装流程**：先说明这是安装后验证能力，再给出安装前检查清单。 边界：必须真实安装或运行后验证。 证据：`README.md`, `docs/getstarted/install.md` Claim：`clm_0001` supported 0.86

### 上下文规模

- 文件总数：517
- 重要文件覆盖：40/517
- 证据索引条目：80
- 角色 / Skill 条目：78

### 证据不足时的处理

- **missing_evidence**：说明证据不足，要求用户提供目标文件、README 段落或安装后验证记录；不要补全事实。
- **out_of_scope_request**：说明该任务超出当前 AI Context Pack 证据范围，并建议用户先查看 Human Manual 或真实安装后验证。
- **runtime_request**：给出安装前检查清单和命令来源，但不要替用户执行命令或声称已执行。
- **source_conflict**：同时展示冲突来源，标记为待核实，不要强行选择一个版本。

## Prompt Recipes

### 适配判断

- 目标：判断这个项目是否适合用户当前任务。
- 预期输出：适配结论、关键理由、证据引用、安装前可预览内容、必须安装后验证内容、下一步建议。

```text
请基于 ragas 的 AI Context Pack，先问我 3 个必要问题，然后判断它是否适合我的任务。回答必须包含：适合谁、能做什么、不能做什么、是否值得安装、证据来自哪里。所有项目事实必须引用 evidence_refs、source_paths 或 claim_id。
```

### 安装前体验

- 目标：让用户在安装前感受核心工作流，同时避免把预览包装成真实能力或营销承诺。
- 预期输出：一段带边界标签的体验剧本、安装后验证清单和谨慎建议；不含真实运行承诺或强营销表述。

```text
请把 ragas 当作安装前体验资产，而不是已安装工具或真实运行环境。

请严格输出四段：
1. 先问我 3 个必要问题。
2. 给出一段“体验剧本”：用 [安装前可预览]、[必须安装后验证]、[证据不足] 三种标签展示它可能如何引导工作流。
3. 给出安装后验证清单：列出哪些能力只有真实安装、真实宿主加载、真实项目运行后才能确认。
4. 给出谨慎建议：只能说“值得继续研究/试装”“先补充信息后再判断”或“不建议继续”，不得替项目背书。

硬性边界：
- 不要声称已经安装、运行、执行测试、修改文件或产生真实结果。
- 不要写“自动适配”“确保通过”“完美适配”“强烈建议安装”等承诺性表达。
- 如果描述安装后的工作方式，必须使用“如果安装成功且宿主正确加载 Skill，它可能会……”这种条件句。
- 体验剧本只能写成“示例台词/假设流程”：使用“可能会询问/可能会建议/可能会展示”，不要写“已写入、已生成、已通过、正在运行、正在生成”。
- Prompt Preview 不负责给安装命令；如用户准备试装，只能提示先阅读 Quick Start 和 Risk Card，并在隔离环境验证。
- 所有项目事实必须来自 supported claim、evidence_refs 或 source_paths；inferred/unverified 只能作风险或待确认项。

```

### 角色 / Skill 选择

- 目标：从项目里的角色或 Skill 中挑选最匹配的资产。
- 预期输出：候选角色或 Skill 列表，每项包含适用场景、证据路径、风险边界和是否需要安装后验证。

```text
请读取 role_skill_index，根据我的目标任务推荐 3-5 个最相关的角色或 Skill。每个推荐都要说明适用场景、可能输出、风险边界和 evidence_refs。
```

### 风险预检

- 目标：安装或引入前识别环境、权限、规则冲突和质量风险。
- 预期输出：环境、权限、依赖、许可、宿主冲突、质量风险和未知项的检查清单。

```text
请基于 risk_card、boundaries 和 quick_start_candidates，给我一份安装前风险预检清单。不要替我执行命令，只说明我应该检查什么、为什么检查、失败会有什么影响。
```

### 宿主 AI 开工指令

- 目标：把项目上下文转成一次对话开始前的宿主 AI 指令。
- 预期输出：一段边界明确、证据引用明确、适合复制给宿主 AI 的开工前指令。

```text
请基于 ragas 的 AI Context Pack，生成一段我可以粘贴给宿主 AI 的开工前指令。这段指令必须遵守 not_runtime=true，不能声称项目已经安装、运行或产生真实结果。
```

## 角色 / Skill 索引

- 共索引 78 个角色 / Skill / 项目文档条目。

- **Key Features**（project_doc）：Supercharge Your LLM Application Evaluations 🚀 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`README.md`
- **Ragas Examples**（project_doc）：Official examples demonstrating how to use Ragas for evaluating different types of AI applications including RAG systems, agents, prompts, workflows, and LLM benchmarking. These examples might be unstable and are subject to change. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`examples/README.md`
- **Agentic or Tool use**（project_doc）：Agentic or tool use workflows can be evaluated in multiple dimensions. Here are some of the metrics that can be used to evaluate the performance of agents or tools in a given task. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/metrics/available_metrics/agents.md`
- **Google Gemini Integration Guide**（project_doc）：This guide covers setting up and using Google's Gemini models with Ragas for evaluation. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/integrations/gemini.md`
- **AG-UI Agent Evaluation Examples**（project_doc）：This example demonstrates how to evaluate agents built with the AG-UI protocol using Ragas metrics. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`examples/ragas_examples/ag_ui_agent_experiments/README.md`
- **Backend Architecture Guide**（project_doc）：Simple plugin architecture for data storage backends. Implement one abstract class, register via entry points. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`src/ragas/backends/README.md`
- **Installation**（project_doc）：To get started, install Ragas using pip with the following command: 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/getstarted/install.md`
- **Testset Generation for Agents or Tool use cases**（project_doc）：Testset Generation for Agents or Tool use cases 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/test_data_generation/agents.md`
- **CLAUDE.md**（project_doc）：This file provides guidance to Claude Code claude.ai/code when working with code in this repository. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`CLAUDE.md`
- **Development Guide for Ragas Monorepo**（project_doc）：Development Guide for Ragas Monorepo 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`CONTRIBUTING.md`
- **✨ Introduction**（project_doc）：Ragas is a library that helps you move from "vibe checks" to systematic evaluation loops for your AI applications. It provides tools to supercharge the evaluation of Large Language Model LLM applications, enabling you to evaluate your LLM applications with ease and confidence. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/index.md`
- **QuotedSpansAlignment**（project_doc）：What: A metric that measures the fraction of quoted spans in a model's answer that appear verbatim in the retrieved sources. The score is in the range 0, 1 , where 1.0 indicates every quoted span is supported by evidence and 0.0 indicates no quoted spans are found in the sources. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/quoted_spans_metric.md`
- **❤️ Community**（project_doc）："Alone we can do so little; together we can do so much." - Helen Keller 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/community/index.md`
- **PDF Export**（project_doc）：Purpose The PDF export feature builds the complete Ragas documentation as a single PDF file using MkDocs with the mkdocs-to-pdf plugin. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/community/pdf_export.md`
- **Evaluation Dataset**（project_doc）：An evaluation dataset is a homogeneous collection of data samples eval sample.md designed to assess the performance and capabilities of an AI application. In Ragas, evaluation datasets are represented using the EvaluationDataset class, which provides a structured way to organize and manage data samples for evaluation purposes. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/components/eval_dataset.md`
- **Evaluation Sample**（project_doc）：An evaluation sample is a single structured data instance that is used to assess and measure the performance of your LLM application in specific scenarios. It represents a single unit of interaction or a specific use case that the AI application is expected to handle. In Ragas, evaluation samples are represented using the SingleTurnSample and MultiTurnSample classes. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/components/eval_sample.md`
- **Components Guide**（project_doc）：This guide provides an overview of the different components used inside Ragas. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/components/index.md`
- **Prompt Object**（project_doc）：Prompts in Ragas are used inside various metrics and synthetic data generation tasks. In each of these tasks, Ragas also provides a way for the user to modify or replace the default prompt with a custom prompt. This guide provides an overview of the Prompt Object in Ragas. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/components/prompt.md`
- **Datasets and Experiment Results**（project_doc）：When we evaluate AI systems, we typically work with two main types of data: 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/datasets.md`
- **Experiments**（project_doc）：An experiment is a deliberate change made to your application to test a hypothesis or idea. For example, in a Retrieval-Augmented Generation RAG system, you might replace the retriever model to evaluate how a new embedding model impacts chatbot responses. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/experimentation.md`
- **Utilizing User Feedback**（project_doc）：User feedback can often be noisy and challenging to harness effectively. However, within the feedback, valuable signals exist that can be leveraged to iteratively enhance your LLM and RAG applications. These signals have the potential to be amplified effectively, aiding in the detection of specific issues within the pipeline and preventing recurring errors. Ragas is equipped to assist you in the analysis of user fee… 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/feedback/index.md`
- **📚 Core Concepts**（project_doc）：- :material-flask-outline:{ .lg .middle } Experimentation experimentation.md 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/index.md`
- **Answer Correctness**（project_doc）：The assessment of Answer Correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer , with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/metrics/available_metrics/answer_correctness.md`
- **Answer Relevancy**（project_doc）：The Answer Relevancy metric measures how relevant a response is to the user input. It ranges from 0 to 1, with higher scores indicating better alignment with the user input. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/metrics/available_metrics/answer_relevance.md`
- **Aspect Critique**（project_doc）：Aspect Critique is a binary evaluation metric used to assess submissions based on predefined aspects such as harmlessness and correctness . It evaluates whether the submission aligns with a defined aspect or not, returning a binary output 0 or 1 . 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/metrics/available_metrics/aspect_critic.md`
- **Context Entities Recall**（project_doc）：ContextEntityRecall metric gives the measure of recall of the retrieved context, based on the number of entities present in both reference and retrieved contexts relative to the number of entities present in the reference alone. Simply put, it is a measure of what fraction of entities is recalled from reference . This metric is useful in fact-based use cases like tourism help desk, historical QA, etc. This metric ca… 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/metrics/available_metrics/context_entities_recall.md`
- **Context Precision**（project_doc）：Context Precision is a metric that evaluates the retriever's ability to rank relevant chunks higher than irrelevant ones for a given query in the retrieved context. Specifically, it assesses the degree to which relevant chunks in the retrieved context are placed at the top of the ranking. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/metrics/available_metrics/context_precision.md`
- **Context Recall**（project_doc）：Context Recall measures how many of the relevant documents or pieces of information were successfully retrieved. It focuses on not missing important results. Higher recall means fewer relevant documents were left out. In short, recall is about not missing anything important. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/metrics/available_metrics/context_recall.md`
- **Factual Correctness**（project_doc）：FactualCorrectness is a metric that compares and evaluates the factual accuracy of the generated response with the reference . This metric is used to determine the extent to which the generated response aligns with the reference. The factual correctness score ranges from 0 to 1, with higher values indicating better performance. To measure the alignment between the response and the reference, the metric uses the LLM… 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/metrics/available_metrics/factual_correctness.md`
- **Faithfulness**（project_doc）：The Faithfulness metric measures how factually consistent a response is with the retrieved context . It ranges from 0 to 1, with higher scores indicating better consistency. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/metrics/available_metrics/faithfulness.md`
- **General Purpose Metrics**（project_doc）：General purpose evaluation metrics are used to evaluate any given task. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/metrics/available_metrics/general_purpose.md`
- **List of available metrics**（project_doc）：Ragas provides a set of evaluation metrics that can be used to measure the performance of your LLM application. These metrics are designed to help you objectively measure the performance of your application. Metrics are available for different applications and tasks, such as RAG and Agentic workflows. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/metrics/available_metrics/index.md`
- **MultiModalFaithfulness**（project_doc）：MultiModalFaithfulness metric measures the factual consistency of the generated answer against both visual and textual context. It is calculated from the answer, retrieved textual context, and visual context. The answer is scaled to a 0,1 range, with higher scores indicating better faithfulness. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/metrics/available_metrics/multi_modal_faithfulness.md`
- **MultiModalRelevance**（project_doc）：MultiModalRelevance metric measures the relevance of the generated answer against both visual and textual context. It is calculated from the user input, response, and retrieved contexts both visual and textual . The answer is scaled to a 0,1 range, with higher scores indicating better relevance. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/metrics/available_metrics/multi_modal_relevance.md`
- **Noise Sensitivity**（project_doc）：NoiseSensitivity measures how often a system makes errors by providing incorrect responses when utilizing either relevant or irrelevant retrieved documents. The score ranges from 0 to 1, with lower values indicating better performance. Noise sensitivity is computed using the user input , reference , response , and the retrieved contexts . 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/metrics/available_metrics/noise_sensitivity.md`
- **Nvidia Metrics**（project_doc）：Answer Accuracy measures the agreement between a model’s response and a reference ground truth for a given question. This is done via two distinct "LLM-as-a-Judge" prompts that each return a rating 0, 2, or 4 . The metric converts these ratings into a 0,1 scale and then takes the average of the two scores from the judges. Higher scores indicate that the model’s answer closely matches the reference. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/metrics/available_metrics/nvidia_metrics.md`
- **Rubric-Based Evaluation**（project_doc）：Rubric-based evaluation metrics allow you to evaluate LLM responses using custom scoring criteria. Ragas provides two types of rubric metrics: 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/metrics/available_metrics/rubrics_based.md`
- **Semantic Similarity**（project_doc）：The Semantic Similarity metric evaluates the semantic resemblance between a generated response and a reference ground truth answer. It ranges from 0 to 1, with higher scores indicating better alignment between the generated answer and the ground truth. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/metrics/available_metrics/semantic_similarity.md`
- **SQL**（project_doc）：Execution based metrics In these metrics the resulting SQL is compared after executing the SQL query on the database and then comparing the response with the expected results. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/metrics/available_metrics/sql.md`
- **Tasks Metrics**（project_doc）：The Summarization Score metric measures how well a summary response captures the important information from the reference contexts . The intuition behind this metric is that a good summary should contain all the important information present in the context. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/metrics/available_metrics/summarization_score.md`
- **Traditional NLP Metrics**（project_doc）：NonLLMStringSimilarity metric measures the similarity between the reference and the response using traditional string distance measures such as Levenshtein, Hamming, and Jaro. This metric is useful for evaluating the similarity of response to the reference text without relying on large language models LLMs . The metric returns a score between 0 and 1, where 1 indicates a perfect match between the response and the re… 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/metrics/available_metrics/traditional.md`
- **Metrics**（project_doc）：- :fontawesome-solid-database: Overview Learn more about overview and design principles overview/index.md - :fontawesome-solid-robot: Available Metrics Learn about available metrics and their inner workings available metrics/index.md 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/metrics/index.md`
- **Overview of Metrics**（project_doc）：You can't improve what you don't measure. Metrics are the feedback loop that makes iteration possible. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/concepts/metrics/overview/index.md`
- **for Google AI Studio**（project_doc）：=== "OpenAI" Install the langchain-openai package 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/extra/components/choose_evaluator_llm.md`
- **Ensure you have credentials configured gcloud, workload identity, etc.**（project_doc）：=== "OpenAI" Install the langchain-openai package 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/extra/components/choose_generator_llm.md`
- **Evaluate a simple LLM application**（project_doc）：The purpose of this guide is to illustrate a simple workflow for testing and evaluating an LLM application with ragas . It assumes minimum knowledge in AI application building and evaluation. Please refer to our installation instruction ./install.md for installing ragas 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/getstarted/evals.md`
- **Run your first experiment**（project_doc）：This tutorial walks you through running your first experiment with Ragas using the @experiment decorator and a local CSV backend. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/getstarted/experiments_quickstart.md`
- **🚀 Get Started**（project_doc）：Welcome to Ragas! The Get Started guides will walk you through the fundamentals of working with Ragas. These tutorials assume basic knowledge of Python and building LLM application pipelines. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/getstarted/index.md`
- **Quick Start: Get Evaluations Running in a Flash**（project_doc）：Quick Start: Get Evaluations Running in a Flash 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/getstarted/quickstart.md`
- **Evaluate a simple RAG system**（project_doc）：The purpose of this guide is to illustrate a simple workflow for testing and evaluating a RAG system with ragas . It assumes minimum knowledge in building RAG system and evaluation. Please refer to our installation instruction ./install.md for installing ragas . 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/getstarted/rag_eval.md`
- **Testset Generation for RAG**（project_doc）：This simple guide will help you generate a testset for evaluating your RAG pipeline using your own documents. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/getstarted/rag_testset_generation.md`
- **How to estimate Cost and Usage of evaluations and testset generation**（project_doc）：How to estimate Cost and Usage of evaluations and testset generation 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/applications/_cost.md`
- **Adding to your CI pipeline with Pytest**（project_doc）：Adding to your CI pipeline with Pytest 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/applications/add_to_ci.md`
- **How to Align an LLM as a Judge**（project_doc）：In this guide, you'll learn how to systematically evaluate and align an LLM-as-judge metric with human expert judgments using Ragas. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/applications/align-llm-as-judge.md`
- **How to Evaluate a New LLM For Your Use Case**（project_doc）：How to Evaluate a New LLM For Your Use Case 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/applications/benchmark_llm.md`
- **Compare Embeddings for retriever**（project_doc）：The performance of the retriever is a critical and influential factor that determines the overall effectiveness of a Retrieval Augmented Generation RAG system. In particular, the quality of the embeddings used plays a pivotal role in determining the quality of the retrieved content. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/applications/compare_embeddings.md`
- **Compare LLMs using Ragas Evaluations**（project_doc）：Compare LLMs using Ragas Evaluations 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/applications/compare_llms.md`
- **How to Evaluate and Improve a RAG App**（project_doc）：How to Evaluate and Improve a RAG App 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/applications/evaluate-and-improve-rag.md`
- **Evaluating Multi-Turn Conversations**（project_doc）：Evaluating Multi-Turn Conversations 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/applications/evaluating_multi_turn_conversations.md`
- **Applications**（project_doc）：Ragas in action. Examples of how to use Ragas in various applications and usecases to solve problems you might encounter when you're building. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/applications/index.md`
- **How to Evaluate Your Prompt and Improve It**（project_doc）：How to Evaluate Your Prompt and Improve It 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/applications/iterate_prompt.md`
- **A systematic approach for prompt optimization**（project_doc）：A systematic approach for prompt optimization 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/applications/prompt_optimization.md`
- **Generating a Synthetic Test Set for RAG-Based Question Answering with Ragas**（project_doc）：Generating a Synthetic Test Set for RAG-Based Question Answering with Ragas 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/applications/singlehop_testset_gen.md`
- **How to evaluate a Text to SQL Agent**（project_doc）：How to evaluate a Text to SQL Agent 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/applications/text2sql.md`
- **Aligning LLM Evaluators with Human Judgment**（project_doc）：Aligning LLM Evaluators with Human Judgment 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/applications/vertexai_alignment.md`
- **Compare models provided by VertexAI on RAG-based Q&A task using Ragas metrics**（project_doc）：Compare models provided by VertexAI on RAG-based Q&A task using Ragas metrics 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/applications/vertexai_model_comparision.md`
- **Getting Started: Ragas with Vertex AI**（project_doc）：Getting Started: Ragas with Vertex AI 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/applications/vertexai_x_ragas.md`
- **Agent Evaluation Quickstart**（project_doc）：The agent evals template provides a setup for evaluating AI agents that solve mathematical problems with correctness metrics. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/cli/agent_evals.md`
- **LLM Benchmarking Quickstart**（project_doc）：The benchmark llm template benchmarks and compares different LLM models on discount calculation tasks. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/cli/benchmark_llm.md`
- **Improve RAG Quickstart**（project_doc）：The improve rag template demonstrates how to compare different RAG approaches using real-world evaluation data. It includes naive single retrieval and agentic multi-step retrieval RAG modes. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/cli/improve_rag.md`
- **Ragas CLI**（project_doc）：The Ragas Command Line Interface CLI provides tools for quickly setting up evaluation projects and running experiments from the terminal. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/cli/index.md`
- **Judge Alignment Quickstart**（project_doc）：The judge alignment template measures how well an LLM-as-judge aligns with human evaluation standards. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/cli/judge_alignment.md`
- **LlamaIndex Agent Evaluation Quickstart**（project_doc）：LlamaIndex Agent Evaluation Quickstart 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/cli/llamaIndex_agent_evals.md`
- **Prompt Evaluation Quickstart**（project_doc）：The prompt evals template evaluates and compares different prompt variations with sentiment analysis. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/cli/prompt_evals.md`
- **RAG Evaluation Quickstart**（project_doc）：The rag eval template provides a complete RAG evaluation setup with custom metrics, dataset management, and experiment tracking. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/cli/rag_eval.md`
- **Text-to-SQL Evaluation Quickstart**（project_doc）：The text2sql template evaluates text-to-SQL systems by comparing SQL execution results. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/cli/text2sql.md`
- **Workflow Evaluation Quickstart**（project_doc）：The workflow eval template evaluates complex LLM workflows with email classification and routing. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/cli/workflow_eval.md`
- **Caching in Ragas**（project_doc）：You can use caching to speed up your evaluations and testset generation by avoiding redundant computations. We use Exact Match Caching to cache the responses from the LLM and Embedding models. 激活提示：当用户需要理解项目结构、安装方式或边界时参考。 证据：`docs/howtos/customizations/_caching.md`

## 证据索引

- 共索引 80 条证据。

- **Key Features**（documentation）：Supercharge Your LLM Application Evaluations 🚀 证据：`README.md`
- **Ragas Examples**（documentation）：Official examples demonstrating how to use Ragas for evaluating different types of AI applications including RAG systems, agents, prompts, workflows, and LLM benchmarking. These examples might be unstable and are subject to change. 证据：`examples/README.md`
- **Agentic or Tool use**（documentation）：Agentic or tool use workflows can be evaluated in multiple dimensions. Here are some of the metrics that can be used to evaluate the performance of agents or tools in a given task. 证据：`docs/concepts/metrics/available_metrics/agents.md`
- **Google Gemini Integration Guide**（documentation）：This guide covers setting up and using Google's Gemini models with Ragas for evaluation. 证据：`docs/howtos/integrations/gemini.md`
- **AG-UI Agent Evaluation Examples**（documentation）：This example demonstrates how to evaluate agents built with the AG-UI protocol using Ragas metrics. 证据：`examples/ragas_examples/ag_ui_agent_experiments/README.md`
- **Backend Architecture Guide**（documentation）：Simple plugin architecture for data storage backends. Implement one abstract class, register via entry points. 证据：`src/ragas/backends/README.md`
- **Installation**（documentation）：To get started, install Ragas using pip with the following command: 证据：`docs/getstarted/install.md`
- **Testset Generation for Agents or Tool use cases**（documentation）：Testset Generation for Agents or Tool use cases 证据：`docs/concepts/test_data_generation/agents.md`
- **CLAUDE.md**（documentation）：This file provides guidance to Claude Code claude.ai/code when working with code in this repository. 证据：`CLAUDE.md`
- **Development Guide for Ragas Monorepo**（documentation）：Development Guide for Ragas Monorepo 证据：`CONTRIBUTING.md`
- **License**（source_file）：Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ 证据：`LICENSE`
- **License**（source_file）：Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ 证据：`examples/LICENSE`
- **✨ Introduction**（documentation）：Ragas is a library that helps you move from "vibe checks" to systematic evaluation loops for your AI applications. It provides tools to supercharge the evaluation of Large Language Model LLM applications, enabling you to evaluate your LLM applications with ease and confidence. 证据：`docs/index.md`
- **QuotedSpansAlignment**（documentation）：What: A metric that measures the fraction of quoted spans in a model's answer that appear verbatim in the retrieved sources. The score is in the range 0, 1 , where 1.0 indicates every quoted span is supported by evidence and 0.0 indicates no quoted spans are found in the sources. 证据：`docs/quoted_spans_metric.md`
- **❤️ Community**（documentation）："Alone we can do so little; together we can do so much." - Helen Keller 证据：`docs/community/index.md`
- **PDF Export**（documentation）：Purpose The PDF export feature builds the complete Ragas documentation as a single PDF file using MkDocs with the mkdocs-to-pdf plugin. 证据：`docs/community/pdf_export.md`
- **Evaluation Dataset**（documentation）：An evaluation dataset is a homogeneous collection of data samples eval sample.md designed to assess the performance and capabilities of an AI application. In Ragas, evaluation datasets are represented using the EvaluationDataset class, which provides a structured way to organize and manage data samples for evaluation purposes. 证据：`docs/concepts/components/eval_dataset.md`
- **Evaluation Sample**（documentation）：An evaluation sample is a single structured data instance that is used to assess and measure the performance of your LLM application in specific scenarios. It represents a single unit of interaction or a specific use case that the AI application is expected to handle. In Ragas, evaluation samples are represented using the SingleTurnSample and MultiTurnSample classes. 证据：`docs/concepts/components/eval_sample.md`
- **Components Guide**（documentation）：This guide provides an overview of the different components used inside Ragas. 证据：`docs/concepts/components/index.md`
- **Prompt Object**（documentation）：Prompts in Ragas are used inside various metrics and synthetic data generation tasks. In each of these tasks, Ragas also provides a way for the user to modify or replace the default prompt with a custom prompt. This guide provides an overview of the Prompt Object in Ragas. 证据：`docs/concepts/components/prompt.md`
- **Datasets and Experiment Results**（documentation）：When we evaluate AI systems, we typically work with two main types of data: 证据：`docs/concepts/datasets.md`
- **Experiments**（documentation）：An experiment is a deliberate change made to your application to test a hypothesis or idea. For example, in a Retrieval-Augmented Generation RAG system, you might replace the retriever model to evaluate how a new embedding model impacts chatbot responses. 证据：`docs/concepts/experimentation.md`
- **Utilizing User Feedback**（documentation）：User feedback can often be noisy and challenging to harness effectively. However, within the feedback, valuable signals exist that can be leveraged to iteratively enhance your LLM and RAG applications. These signals have the potential to be amplified effectively, aiding in the detection of specific issues within the pipeline and preventing recurring errors. Ragas is equipped to assist you in the analysis of user feedback data, enabling the discovery of patterns and making it a valuable resource for continual improvement. 证据：`docs/concepts/feedback/index.md`
- **📚 Core Concepts**（documentation）：- :material-flask-outline:{ .lg .middle } Experimentation experimentation.md 证据：`docs/concepts/index.md`
- **Answer Correctness**（documentation）：The assessment of Answer Correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer , with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness. 证据：`docs/concepts/metrics/available_metrics/answer_correctness.md`
- **Answer Relevancy**（documentation）：The Answer Relevancy metric measures how relevant a response is to the user input. It ranges from 0 to 1, with higher scores indicating better alignment with the user input. 证据：`docs/concepts/metrics/available_metrics/answer_relevance.md`
- **Aspect Critique**（documentation）：Aspect Critique is a binary evaluation metric used to assess submissions based on predefined aspects such as harmlessness and correctness . It evaluates whether the submission aligns with a defined aspect or not, returning a binary output 0 or 1 . 证据：`docs/concepts/metrics/available_metrics/aspect_critic.md`
- **Context Entities Recall**（documentation）：ContextEntityRecall metric gives the measure of recall of the retrieved context, based on the number of entities present in both reference and retrieved contexts relative to the number of entities present in the reference alone. Simply put, it is a measure of what fraction of entities is recalled from reference . This metric is useful in fact-based use cases like tourism help desk, historical QA, etc. This metric can help evaluate the retrieval mechanism for entities, based on comparison with entities present in reference , because in cases where entities matter, we need the retrieved contexts which cover them. 证据：`docs/concepts/metrics/available_metrics/context_entities_recall.md`
- **Context Precision**（documentation）：Context Precision is a metric that evaluates the retriever's ability to rank relevant chunks higher than irrelevant ones for a given query in the retrieved context. Specifically, it assesses the degree to which relevant chunks in the retrieved context are placed at the top of the ranking. 证据：`docs/concepts/metrics/available_metrics/context_precision.md`
- **Context Recall**（documentation）：Context Recall measures how many of the relevant documents or pieces of information were successfully retrieved. It focuses on not missing important results. Higher recall means fewer relevant documents were left out. In short, recall is about not missing anything important. 证据：`docs/concepts/metrics/available_metrics/context_recall.md`
- **Factual Correctness**（documentation）：FactualCorrectness is a metric that compares and evaluates the factual accuracy of the generated response with the reference . This metric is used to determine the extent to which the generated response aligns with the reference. The factual correctness score ranges from 0 to 1, with higher values indicating better performance. To measure the alignment between the response and the reference, the metric uses the LLM to first break down the response and reference into claims and then uses natural language inference to determine the factual overlap between the response and the reference. Factual overlap is quantified using precision, recall, and F1 score, which can be controlled using the mode… 证据：`docs/concepts/metrics/available_metrics/factual_correctness.md`
- **Faithfulness**（documentation）：The Faithfulness metric measures how factually consistent a response is with the retrieved context . It ranges from 0 to 1, with higher scores indicating better consistency. 证据：`docs/concepts/metrics/available_metrics/faithfulness.md`
- **General Purpose Metrics**（documentation）：General purpose evaluation metrics are used to evaluate any given task. 证据：`docs/concepts/metrics/available_metrics/general_purpose.md`
- **List of available metrics**（documentation）：Ragas provides a set of evaluation metrics that can be used to measure the performance of your LLM application. These metrics are designed to help you objectively measure the performance of your application. Metrics are available for different applications and tasks, such as RAG and Agentic workflows. 证据：`docs/concepts/metrics/available_metrics/index.md`
- **MultiModalFaithfulness**（documentation）：MultiModalFaithfulness metric measures the factual consistency of the generated answer against both visual and textual context. It is calculated from the answer, retrieved textual context, and visual context. The answer is scaled to a 0,1 range, with higher scores indicating better faithfulness. 证据：`docs/concepts/metrics/available_metrics/multi_modal_faithfulness.md`
- **MultiModalRelevance**（documentation）：MultiModalRelevance metric measures the relevance of the generated answer against both visual and textual context. It is calculated from the user input, response, and retrieved contexts both visual and textual . The answer is scaled to a 0,1 range, with higher scores indicating better relevance. 证据：`docs/concepts/metrics/available_metrics/multi_modal_relevance.md`
- **Noise Sensitivity**（documentation）：NoiseSensitivity measures how often a system makes errors by providing incorrect responses when utilizing either relevant or irrelevant retrieved documents. The score ranges from 0 to 1, with lower values indicating better performance. Noise sensitivity is computed using the user input , reference , response , and the retrieved contexts . 证据：`docs/concepts/metrics/available_metrics/noise_sensitivity.md`
- **Nvidia Metrics**（documentation）：Answer Accuracy measures the agreement between a model’s response and a reference ground truth for a given question. This is done via two distinct "LLM-as-a-Judge" prompts that each return a rating 0, 2, or 4 . The metric converts these ratings into a 0,1 scale and then takes the average of the two scores from the judges. Higher scores indicate that the model’s answer closely matches the reference. 证据：`docs/concepts/metrics/available_metrics/nvidia_metrics.md`
- **Rubric-Based Evaluation**（documentation）：Rubric-based evaluation metrics allow you to evaluate LLM responses using custom scoring criteria. Ragas provides two types of rubric metrics: 证据：`docs/concepts/metrics/available_metrics/rubrics_based.md`
- **Semantic Similarity**（documentation）：The Semantic Similarity metric evaluates the semantic resemblance between a generated response and a reference ground truth answer. It ranges from 0 to 1, with higher scores indicating better alignment between the generated answer and the ground truth. 证据：`docs/concepts/metrics/available_metrics/semantic_similarity.md`
- **SQL**（documentation）：Execution based metrics In these metrics the resulting SQL is compared after executing the SQL query on the database and then comparing the response with the expected results. 证据：`docs/concepts/metrics/available_metrics/sql.md`
- **Tasks Metrics**（documentation）：The Summarization Score metric measures how well a summary response captures the important information from the reference contexts . The intuition behind this metric is that a good summary should contain all the important information present in the context. 证据：`docs/concepts/metrics/available_metrics/summarization_score.md`
- **Traditional NLP Metrics**（documentation）：NonLLMStringSimilarity metric measures the similarity between the reference and the response using traditional string distance measures such as Levenshtein, Hamming, and Jaro. This metric is useful for evaluating the similarity of response to the reference text without relying on large language models LLMs . The metric returns a score between 0 and 1, where 1 indicates a perfect match between the response and the reference. This is a non LLM based metric. 证据：`docs/concepts/metrics/available_metrics/traditional.md`
- **Metrics**（documentation）：- :fontawesome-solid-database: Overview Learn more about overview and design principles overview/index.md - :fontawesome-solid-robot: Available Metrics Learn about available metrics and their inner workings available metrics/index.md 证据：`docs/concepts/metrics/index.md`
- **Overview of Metrics**（documentation）：You can't improve what you don't measure. Metrics are the feedback loop that makes iteration possible. 证据：`docs/concepts/metrics/overview/index.md`
- **for Google AI Studio**（documentation）：=== "OpenAI" Install the langchain-openai package 证据：`docs/extra/components/choose_evaluator_llm.md`
- **Ensure you have credentials configured gcloud, workload identity, etc.**（documentation）：=== "OpenAI" Install the langchain-openai package 证据：`docs/extra/components/choose_generator_llm.md`
- **Evaluate a simple LLM application**（documentation）：The purpose of this guide is to illustrate a simple workflow for testing and evaluating an LLM application with ragas . It assumes minimum knowledge in AI application building and evaluation. Please refer to our installation instruction ./install.md for installing ragas 证据：`docs/getstarted/evals.md`
- **Run your first experiment**（documentation）：This tutorial walks you through running your first experiment with Ragas using the @experiment decorator and a local CSV backend. 证据：`docs/getstarted/experiments_quickstart.md`
- **🚀 Get Started**（documentation）：Welcome to Ragas! The Get Started guides will walk you through the fundamentals of working with Ragas. These tutorials assume basic knowledge of Python and building LLM application pipelines. 证据：`docs/getstarted/index.md`
- **Quick Start: Get Evaluations Running in a Flash**（documentation）：Quick Start: Get Evaluations Running in a Flash 证据：`docs/getstarted/quickstart.md`
- **Evaluate a simple RAG system**（documentation）：The purpose of this guide is to illustrate a simple workflow for testing and evaluating a RAG system with ragas . It assumes minimum knowledge in building RAG system and evaluation. Please refer to our installation instruction ./install.md for installing ragas . 证据：`docs/getstarted/rag_eval.md`
- **Testset Generation for RAG**（documentation）：This simple guide will help you generate a testset for evaluating your RAG pipeline using your own documents. 证据：`docs/getstarted/rag_testset_generation.md`
- **How to estimate Cost and Usage of evaluations and testset generation**（documentation）：How to estimate Cost and Usage of evaluations and testset generation 证据：`docs/howtos/applications/_cost.md`
- **Adding to your CI pipeline with Pytest**（documentation）：Adding to your CI pipeline with Pytest 证据：`docs/howtos/applications/add_to_ci.md`
- **How to Align an LLM as a Judge**（documentation）：In this guide, you'll learn how to systematically evaluate and align an LLM-as-judge metric with human expert judgments using Ragas. 证据：`docs/howtos/applications/align-llm-as-judge.md`
- **How to Evaluate a New LLM For Your Use Case**（documentation）：How to Evaluate a New LLM For Your Use Case 证据：`docs/howtos/applications/benchmark_llm.md`
- **Compare Embeddings for retriever**（documentation）：The performance of the retriever is a critical and influential factor that determines the overall effectiveness of a Retrieval Augmented Generation RAG system. In particular, the quality of the embeddings used plays a pivotal role in determining the quality of the retrieved content. 证据：`docs/howtos/applications/compare_embeddings.md`
- **Compare LLMs using Ragas Evaluations**（documentation）：Compare LLMs using Ragas Evaluations 证据：`docs/howtos/applications/compare_llms.md`
- **How to Evaluate and Improve a RAG App**（documentation）：How to Evaluate and Improve a RAG App 证据：`docs/howtos/applications/evaluate-and-improve-rag.md`
- 其余 20 条证据见 `AI_CONTEXT_PACK.json` 或 `EVIDENCE_INDEX.json`。

## 宿主 AI 必须遵守的规则

- **把本资产当作开工前上下文，而不是运行环境。**：AI Context Pack 只包含证据化项目理解，不包含目标项目的可执行状态。 证据：`README.md`, `examples/README.md`, `docs/concepts/metrics/available_metrics/agents.md`
- **回答用户时区分可预览内容与必须安装后才能验证的内容。**：安装前体验的消费者价值来自降低误装和误判，而不是伪装成真实运行。 证据：`README.md`, `examples/README.md`, `docs/concepts/metrics/available_metrics/agents.md`

## 用户开工前应该回答的问题

- 你准备在哪个宿主 AI 或本地环境中使用它？
- 你只是想先体验工作流，还是准备真实安装？
- 你最在意的是安装成本、输出质量、还是和现有规则的冲突？

## 验收标准

- 所有能力声明都能回指到 evidence_refs 中的文件路径。
- AI_CONTEXT_PACK.md 没有把预览包装成真实运行。
- 用户能在 3 分钟内看懂适合谁、能做什么、如何开始和风险边界。

---

## Doramagic Context Augmentation

下面内容用于强化 Repomix/AI Context Pack 主体。Human Manual 只提供阅读骨架；踩坑日志会被转成宿主 AI 必须遵守的工作约束。

## Human Manual 骨架

使用规则：这里只是项目阅读路线和显著性信号，不是事实权威。具体事实仍必须回到 repo evidence / Claim Graph。

宿主 AI 硬性规则：
- 不得把页标题、章节顺序、摘要或 importance 当作项目事实证据。
- 解释 Human Manual 骨架时，必须明确说它只是阅读路线/显著性信号。
- 能力、安装、兼容性、运行状态和风险判断必须引用 repo evidence、source path 或 Claim Graph。

- **Ragas 项目概述与快速入门**：importance `high`
  - source_paths: README.md, src/ragas/__init__.py, src/ragas/cli.py, src/ragas/experiment.py, docs/getstarted/quickstart.md
- **评估指标系统（Metrics）**：importance `high`
  - source_paths: src/ragas/metrics/__init__.py, src/ragas/metrics/base.py, src/ragas/metrics/result.py, src/ragas/metrics/discrete.py, src/ragas/metrics/numeric.py
- **测试集生成、数据模式与后端存储**：importance `high`
  - source_paths: src/ragas/testset/__init__.py, src/ragas/testset/graph.py, src/ragas/testset/persona.py, src/ragas/testset/synthesizers/__init__.py, src/ragas/testset/synthesizers/base.py
- **LLM 集成、适配器、提示词优化与成本追踪**：importance `high`
  - source_paths: src/ragas/llms/__init__.py, src/ragas/llms/base.py, src/ragas/llms/adapters/base.py, src/ragas/llms/adapters/instructor.py, src/ragas/llms/adapters/litellm.py

## Repo Inspection Evidence / 源码检查证据

- repo_clone_verified: true
- repo_inspection_verified: true
- repo_commit: `298b68274234c060deacab3cf5fb52aa3a20e885`
- inspected_files: `README.md`, `pyproject.toml`, `docs/_static/annotated_data.json`, `docs/_static/edited_chain_runs.json`, `docs/_static/js/commonroom.js`, `docs/_static/js/header_border.js`, `docs/_static/js/mathjax.js`, `docs/_static/js/mendable_chat_bubble.js`, `docs/_static/js/toggle.js`, `docs/_static/sample_annotated_summary.json`, `docs/alfred.py`, `docs/community/index.md`, `docs/community/pdf_export.md`, `docs/concepts/components/eval_dataset.md`, `docs/concepts/components/eval_sample.md`, `docs/concepts/components/index.md`, `docs/concepts/components/prompt.md`, `docs/concepts/datasets.md`, `docs/concepts/experimentation.md`, `docs/concepts/feedback/index.md`

宿主 AI 硬性规则：
- 没有 repo_clone_verified=true 时，不得声称已经读过源码。
- 没有 repo_inspection_verified=true 时，不得把 README/docs/package 文件判断写成事实。
- 没有 quick_start_verified=true 时，不得声称 Quick Start 已跑通。

## Doramagic Pitfall Constraints / 踩坑约束

这些规则来自 Doramagic 发现、验证或编译过程中的项目专属坑点。宿主 AI 必须把它们当作工作约束，而不是普通说明文字。

### Constraint 1: 来源证据：Add EvaluationResult summary and threshold checks

- Trigger: GitHub 社区证据显示该项目存在一个安装相关的待验证问题：Add EvaluationResult summary and threshold checks
- Why it matters: 可能增加新用户试用和生产接入成本。
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2760 | 来源讨论提到 python 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 2: 来源证据：Incorrect class name in deprecation warning for LLMContextPrecisionWithoutReference

- Trigger: GitHub 社区证据显示该项目存在一个安装相关的待验证问题：Incorrect class name in deprecation warning for LLMContextPrecisionWithoutReference
- Why it matters: 可能影响升级、迁移或版本选择。
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2748 | 来源讨论提到 python 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 3: 来源证据：No module named 'langchain_community.chat_models.vertexai'

- Trigger: GitHub 社区证据显示该项目存在一个安装相关的待验证问题：No module named 'langchain_community.chat_models.vertexai'
- Why it matters: 可能阻塞安装或首次运行。
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2741 | 来源讨论提到 python 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 4: 来源证据：ragas 0.4.3: ChatVertexAI import broken — uses removed langchain_community path

- Trigger: GitHub 社区证据显示该项目存在一个安装相关的待验证问题：ragas 0.4.3: ChatVertexAI import broken — uses removed langchain_community path
- Why it matters: 可能增加新用户试用和生产接入成本。
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2745 | 来源讨论提到 python 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 5: 来源证据：answer_correctness is not working as expected

- Trigger: GitHub 社区证据显示该项目存在一个配置相关的待验证问题：answer_correctness is not working as expected
- Why it matters: 可能影响升级、迁移或版本选择。
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2585 | 来源讨论提到 python 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 6: 来源证据：faithfulness_score: nan

- Trigger: GitHub 社区证据显示该项目存在一个配置相关的待验证问题：faithfulness_score: nan
- Why it matters: 可能增加新用户试用和生产接入成本。
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/1309 | 来源讨论提到 python 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 7: 来源证据：llm_factory raises ValueError when using mistralai client

- Trigger: GitHub 社区证据显示该项目存在一个配置相关的待验证问题：llm_factory raises ValueError when using mistralai client
- Why it matters: 可能阻塞安装或首次运行。
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2774 | 来源讨论提到 python 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 8: 来源证据：AspectCritic not working with openai o3

- Trigger: GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：AspectCritic not working with openai o3
- Why it matters: 可能影响授权、密钥配置或安全边界。
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2067 | 来源讨论提到 python 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 9: 来源证据：Feature request: Add AgentThreatBench memory poison task as a RAG security evaluation

- Trigger: GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：Feature request: Add AgentThreatBench memory poison task as a RAG security evaluation
- Why it matters: 可能影响授权、密钥配置或安全边界。
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2732 | 来源类型 github_issue 暴露的待验证使用条件。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。

### Constraint 10: 来源证据：Make python-diskcache dependency optional

- Trigger: GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：Make python-diskcache dependency optional
- Why it matters: 可能增加新用户试用和生产接入成本。
- Evidence: community_evidence:github | https://github.com/vibrantlabsai/ragas/issues/2622 | 来源讨论提到 python 相关条件，需在安装/试用前复核。
- Hard boundary: 不要把这个坑点包装成已解决、已验证或可忽略，除非后续验证证据明确证明它已经关闭。
