# https://github.com/hegelai/prompttools 项目说明书

生成时间：2026-06-27 07:21:42 UTC

## 目录

- [概览、安装与支持集成](#page-overview)
- [实验系统（Experiments）与多后端集成](#page-experiments)
- [Harness 评测框架与 Playground 界面](#page-harness-playground)
- [工具函数、PromptTest 测试框架与可观测性](#page-utils-testing)

<a id='page-overview'></a>

## 概览、安装与支持集成

### 相关页面

相关主题：[实验系统（Experiments）与多后端集成](#page-experiments), [Harness 评测框架与 Playground 界面](#page-harness-playground)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [README.md](https://github.com/hegelai/prompttools/blob/main/README.md)
- [prompttools/playground/README.md](https://github.com/hegelai/prompttools/blob/main/prompttools/playground/README.md)
- [prompttools/utils/__init__.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/__init__.py)
- [prompttools/utils/chunk_text.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/chunk_text.py)
- [prompttools/utils/similarity.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/similarity.py)
- [prompttools/utils/autoeval.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval.py)
- [prompttools/utils/moderation.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/moderation.py)
- [prompttools/utils/validate_json.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/validate_json.py)
- [examples/notebooks/README.md](https://github.com/hegelai/prompttools/blob/main/examples/notebooks/README.md)
</details>

# 概览、安装与支持集成

## 项目概览

`prompttools` 是由 [Hegel AI](https://hegel-ai.com/) 维护的一组开源、可自托管的工具集，专注于对大语言模型（LLM）、向量数据库以及提示词（prompt）的实验、测试与评估。其核心理念是让开发者能够通过熟悉的代码、Notebook 以及本地 Playground 三种界面来评估模型和检索质量 ([README.md](https://github.com/hegelai/prompttools/blob/main/README.md))。

库的核心单元是 `Experiment` 对象。开发者传入模型列表、提示词以及参数（如温度），调用 `run()` 执行，再调用 `visualize()` 进行对比。整个调用过程完全在本地机器上执行，不会被转发到任何中间服务器，也不会存储用户的 API Key 或输入输出数据 ([README.md](https://github.com/hegelai/prompttools/blob/main/README.md))。同时，库默认通过 [Sentry](https://sentry.io/) 收集错误日志以辅助改进体验，可通过环境变量 `SENTRY_OPT_OUT` 退出 ([README.md](https://github.com/hegelai/prompttools/blob/main/README.md))。

工具层提供了大量可复用的评估与处理函数，例如 `chunk_text` 用于长文本分块 ([chunk_text.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/chunk_text.py))、`semantic_similarity` / `cos_similarity` 用于向量相似度计算 ([similarity.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/similarity.py))、`apply_moderation` 调用 OpenAI 内容审核 ([moderation.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/moderation.py))，以及 `autoeval_binary_scoring`、`autoeval_with_documents` 等自动评分函数 ([__init__.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/__init__.py))。

## 安装

### Python 包安装

最简方式是通过 `pip` 直接安装：

```
pip install prompttools
```

来源：[README.md](https://github.com/hegelai/prompttools/blob/main/README.md)

### Playground 安装

Playground 是一个基于 Streamlit 的图形界面，可对比不同系统提示词、Prompt 模板与模型 ([playground/README.md](https://github.com/hegelai/prompttools/blob/main/prompttools/playground/README.md))。需要先克隆仓库并安装 Playground 的依赖：

```
git clone https://github.com/hegelai/prompttools.git
cd prompttools && pip install -r prompttools/playground/requirements.txt
streamlit run prompttools/playground/playground.py
```

社区中曾出现 “运行 `streamlit run` 时缺少依赖” 的反馈（[Issue #126](https://github.com/hegelai/prompttools/issues/126)），也出现过 Streamlit `st.experimental_get_query_params` 在 2024-04-11 后被弃用的告警（[Issue #124](https://github.com/hegelai/prompttools/issues/124)、[Issue #127](https://github.com/hegelai/prompttools/issues/127)）。运行前请确保本地 Streamlit 为较新版本。

若不想本地部署，也可使用官方托管 Playground（不支持 `LlamaCpp` 集成）（[README.md](https://github.com/hegelai/prompttools/blob/main/README.md)）。

### 常见安装故障

- **`AttributeError: module 'openai' has no attribute 'types'`**：通常由于 `openai` 客户端版本与库期望不一致导致（[Issue #122](https://github.com/hegelai/prompttools/issues/122)）。
- **LanceDB 实验导入失败**：直接 `from prompttools.experiment import LanceDBExperiment` 可能抛出错误，建议升级到最新版本（[Issue #132](https://github.com/hegelai/prompttools/issues/132)）。
- **依赖版本冲突**：运行 OpenAI Chat 示例 Notebook 时可能需要固定 `pandas==1.5.3` 等版本（[Issue #121](https://github.com/hegelai/prompttools/issues/121)）。

## 支持的集成

下表汇总了仓库目前官方支持的集成（“支持”状态来源于 README；“进行中 / 探索中” 为社区跟进项）。

| 类别 | 集成 | 状态 | 备注 / 相关 Issue |
|------|------|------|------------------|
| LLM | OpenAI（Completion、ChatCompletion、Fine-tuned） | 支持 | 同时存在对 Assistants API 的诉求（[Issue #111](https://github.com/hegelai/prompttools/issues/111)） |
| LLM | Azure OpenAI Service | 支持 | Notebook 调用参数变更曾引发 `TypeError`（[Issue #116](https://github.com/hegelai/prompttools/issues/116)） |
| LLM | Anthropic Claude | 支持 | `autoeval_scoring` 使用 Claude 作为评分模型 ([autoeval_scoring.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval_scoring.py)) |
| LLM | Mistral AI / Google Gemini / Google PaLM / Vertex AI | 支持 | v0.0.35 起 Vertex AI 加入 ([Release v0.0.35](https://github.com/hegelai/prompttools/releases/tag/v0.0.35)) |
| LLM | LLaMA.Cpp / HuggingFace Hub / Replicate | 支持 | Ollama 为社区跟进项（[Issue #39](https://github.com/hegelai/prompttools/issues/39)） |
| 向量数据库 | Chroma / Weaviate / Qdrant / LanceDB / Pinecone | 支持 | RAG 示例覆盖多库 ([examples/notebooks/README.md](https://github.com/hegelai/prompttools/blob/main/examples/notebooks/README.md)) |
| 向量数据库 | Milvus | 探索中 | — |
| 框架 | LangChain / MindsDB | 支持 | 社区曾正式提出 LangChain 支持请求（[Issue #5](https://github.com/hegelai/prompttools/issues/5)）；Semantic-Kernel 在跟进中（[Issue #114](https://github.com/hegelai/prompttools/issues/114)） |
| 计算机视觉 | Stable Diffusion / Replicate Stable Diffusion | 支持 | `structural_similarity` 用于图像 SSIM 评估 ([similarity.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/similarity.py)) |
| 新增能力 | Observability / Hosted Playground | Beta | v0.0.45 引入观测；v0.0.41 上线托管 Playground |

来源：[README.md](https://github.com/hegelai/prompttools/blob/main/README.md)、[examples/notebooks/README.md](https://github.com/hegelai/prompttools/blob/main/examples/notebooks/README.md)、[Release v0.0.35](https://github.com/hegelai/prompttools/releases/tag/v0.0.35)、[Release v0.0.41](https://github.com/hegelai/prompttools/releases/tag/v0.0.41)、[Release v0.0.45](https://github.com/hegelai/prompttools/releases/tag/v0.0.45)。

### 典型调用流程

```mermaid
flowchart LR
    A[用户定义 models/prompts/params] --> B[Experiment.run]
    B --> C[本地调用 LLM/向量库 API]
    C --> D[收集响应到 DataFrame]
    D --> E[可选: 调用 utils 评估函数]
    D --> F[Experiment.visualize / 导出]
    F --> G[to_csv / to_json / to_mongo_db]
```

来源：[README.md](https://github.com/hegelai/prompttools/blob/main/README.md)、[__init__.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/__init__.py)。

## 常见问题与最佳实践

1. **数据是否会被上传？** 不会。代码在本地运行，API Key 和提示词均保留在本地 ([README.md](https://github.com/hegelai/prompttools/blob/main/README.md))。
2. **如何持久化结果？** 可调用 `to_csv`、`to_json`、`to_lora_json` 或 `to_mongo_db` 方法导出 ([README.md](https://github.com/hegalai/prompttools/blob/main/README.md))。
3. **如何对模型输出做结构化校验？** 使用 `validate_json_response` / `validate_python_response`，并可传入 `pre_process_fn` 做清洗 ([validate_json.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/validate_json.py))。
4. **如何自动评分？** `autoeval_binary_scoring` 适合判断是否遵循指令，`autoeval_with_documents` 适合结合文档进行 0–10 评分 ([autoeval.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval.py)、[autoeval_with_docs.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval_with_docs.py))。
5. **想要贡献新集成？** 可参考 `CONTRIBUTING.md` 与 “Help Wanted” Issue 列表（[README.md](https://github.com/hegelai/prompttools/blob/main/README.md)）。

## See Also

- 工具函数参考：[prompttools/utils/__init__.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/__init__.py)
- Playground 使用：[prompttools/playground/README.md](https://github.com/hegelai/prompttools/blob/main/prompttools/playground/README.md)
- Notebook 示例目录：[examples/notebooks/README.md](https://github.com/hegelai/prompttools/blob/main/examples/notebooks/README.md)
- 官方文档：[prompttools.readthedocs.io](http://prompttools.readthedocs.io/)

---

<a id='page-experiments'></a>

## 实验系统（Experiments）与多后端集成

### 相关页面

相关主题：[概览、安装与支持集成](#page-overview), [Harness 评测框架与 Playground 界面](#page-harness-playground), [工具函数、PromptTest 测试框架与可观测性](#page-utils-testing)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [README.md](https://github.com/hegelai/prompttools/blob/main/README.md)
- [prompttools/utils/__init__.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/__init__.py)
- [prompttools/utils/similarity.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/similarity.py)
- [prompttools/utils/chunk_text.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/chunk_text.py)
- [prompttools/utils/autoeval.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval.py)
- [prompttools/utils/autoeval_with_docs.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval_with_docs.py)
- [prompttools/utils/autoeval_from_expected.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval_from_expected.py)
- [prompttools/utils/autoeval_scoring.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval_scoring.py)
- [prompttools/utils/moderation.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/moderation.py)
- [prompttools/playground/README.md](https://github.com/hegelai/prompttools/blob/main/prompttools/playground/README.md)
- [examples/notebooks/README.md](https://github.com/hegelai/prompttools/blob/main/examples/notebooks/README.md)
</details>

# 实验系统（Experiments）与多后端集成

## 一、概览与设计目标

`prompttools` 是由 [Hegel AI](https://hegel-ai.com/) 维护的开源评测框架，核心定位是为开发者提供一套**可自托管**的工具，用于在 **代码、Notebook 以及本地 Playground** 中对 LLM、向量数据库与 Prompt 进行系统性实验与评估。资料来源：[README.md:1-30]()。

实验系统的最小使用单元是 `Experiment` 类。一个典型的 OpenAI Chat 实验只需几行代码即可完成参数化运行与可视化：

```python
from prompttools.experiment import OpenAIChatExperiment

messages = [
    [{"role": "user", "content": "Tell me a joke."}],
    [{"role": "user", "content": "Is 17077 a prime number?"}],
]
models = ["gpt-3.5-turbo", "gpt-4"]
temperatures = [0.0]

openai_experiment = OpenAIChatExperiment(models, messages, temperature=temperatures)
openai_experiment.run()
openai_experiment.visualize()
```

资料来源：[README.md:15-30]()。该范式强调三件事：**参数化输入**（`models`、`messages`、`temperature`）、**统一执行接口**（`run()`）、**结果可视化**（`visualize()`），从而把不同后端的差异封装在 `Experiment` 子类内部，对外暴露一致 API。

## 二、多后端集成矩阵

`prompttools` 的实验系统通过**子类化**方式覆盖三类目标：**大语言模型（LLM）、向量数据库、Agent/Chain 框架**。资料来源：[README.md:100-150]()。

| 类别 | 后端 | 支持状态 |
| --- | --- | --- |
| LLM | OpenAI（Completion / ChatCompletion / Fine-tuned） | Supported |
| LLM | LLaMA.Cpp（LLaMA 1 / LLaMA 2） | Supported |
| LLM | HuggingFace（Hub API、Inference Endpoints） | Supported |
| LLM | Anthropic | Supported |
| LLM | Mistral AI | Supported |
| LLM | Google Gemini / PaLM / Vertex AI | Supported |
| LLM | Azure OpenAI Service | Supported |
| LLM | Replicate | Supported |
| LLM | Ollama | In Progress（参见 Issue #39） |
| Vector DB | Chroma / Weaviate / Qdrant / LanceDB / Pinecone | Supported |
| Vector DB | Milvus / Epsilla | Exploratory / In Progress |
| Framework | LangChain / MindsDB | Supported |

每一类后端都对应一个 `Experiment` 子类文件，集中在 `prompttools/experiment/experiments/` 目录下，并通过 `prompttools.experiment` 命名空间统一对外暴露，开发者无需关心各后端 SDK 的细节差异。资料来源：[README.md:115-150]()。

### Notebook 示例入口

仓库提供了一套 Notebook 样例作为「上手即用」的实验模板，分类组织在 `examples/notebooks/` 目录中。资料来源：[examples/notebooks/README.md:1-40]()：

- **LLM 单模型实验**：`OpenAIChatExperiment.ipynb`、`AnthropicExperiment.ipynb`、`PaLM2Experiment.ipynb`、`LlamaCppExperiment.ipynb`、`HuggingFaceHub.ipynb`、`GPT4RegressionTesting.ipynb`。
- **框架实验**：`frameworks/LangChainSequentialChainExperiment.ipynb`、`frameworks/LangChainRouterChainExperiment.ipynb`、`frameworks/MindsDBExperiment.ipynb`。
- **向量数据库实验**：`vectordb_experiments/ChromaDBExperiment.ipynb`、`WeaviateExperiment.ipynb`、`LanceDBExperiment.ipynb`、`QdrantExperiment.ipynb`、`PineconeExperiment.ipynb`、`RetrievalAugmentedGeneration.ipynb`。
- **图像实验**：`image_experiments/StableDiffusion.ipynb`、`image_experiments/ReplicateStableDiffusion.ipynb`。
- **人机协同**：`HumanFeedback.ipynb`，允许对模型输出提供人类反馈。

这些 Notebook 与底层 `Experiment` 类保持「一对一」或「一对多」的对应关系，开发者通常只需替换数据与参数，即可复用同一套评估流水线。

## 三、核心实验流程

```mermaid
flowchart LR
    A[构造输入参数<br/>models / prompts / temperature] --> B[实例化 Experiment 子类]
    B --> C[调用 run()<br/>并发调度后端 API]
    C --> D[收集响应与元数据]
    D --> E[调用 visualize()<br/>渲染结果表]
    E --> F[应用评估函数<br/>similarity / autoeval / moderation]
    F --> G[导出结果<br/>to_csv / to_json / to_lora_json / to_mongo_db]
```

如流程图所示，实验执行被拆解为四个清晰的阶段：**参数化构造 → 后端调用 → 可视化 → 评估/导出**。其中 `to_csv`、`to_json`、`to_lora_json`、`to_mongo_db` 提供了多种持久化方式，便于与外部 CI、回归测试或 LoRA 训练管线集成。资料来源：[README.md:55-65]()。

## 四、评估工具与结果处理

实验系统配套了一组独立的工具函数，集中定义在 `prompttools/utils/` 目录中并通过 `prompttools/utils/__init__.py` 统一暴露。资料来源：[prompttools/utils/__init__.py:1-35]()。

### 语义相似度

`similarity.py` 实现了基于 HuggingFace SentenceTransformer（默认模型 `sentence-transformers/all-MiniLM-L6-v2`）或 Chroma 的语义相似度计算，并支持可选的 `cv2` + `skimage` 实现图像 SSIM 指标。资料来源：[prompttools/utils/similarity.py:1-50]()。核心入口为 `semantic_similarity`、`cos_similarity`、`structural_similarity`。

### 自动评估（LLM-as-a-Judge）

针对文本响应提供了三类 LLM 评判工具：

- `autoeval.compute()`：调用 OpenAI 聊天模型对「指令遵循度」进行二分判定（RIGHT / WRONG）。资料来源：[prompttools/utils/autoeval.py:1-30]()。
- `autoeval_with_docs.compute()`：基于给定文档上下文对回答进行 0–10 分制打分。资料来源：[prompttools/utils/autoeval_with_docs.py:1-30]()。
- `autoeval_from_expected.compute()`：将模型输出与用户给定的「期望答案」进行比对。资料来源：[prompttools/utils/autoeval_from_expected.py:1-30]()。
- `autoeval_scoring.compute()`：使用 Anthropic `claude-2` 作为评判器对事实进行 1–7 分制打分。资料来源：[prompttools/utils/autoeval_scoring.py:1-30]()。

### 内容审核

`moderation.apply_moderation()` 封装了 OpenAI Moderation API，可按需抽取指定的类别（`category_names`）或类别分数（`category_score_names`）作为 DataFrame 的扩展列。资料来源：[prompttools/utils/moderation.py:1-30]()。

### 文本切分与校验

`chunk_text()` 提供基于空格的「不切词」长文本分块，用于 RAG 场景下构造文档片段。资料来源：[prompttools/utils/chunk_text.py:1-25]()。另外 `validate_json_response` 与 `validate_python_response` 用于校验模型是否输出合法结构化结果。

## 五、社区反馈与已知问题

实验系统在迭代过程中也累积了一些来自社区的反馈，开发者需要注意：

1. **Streamlit Playground 弃用警告**：Playground 仍依赖 `st.experimental_get_query_params` / `st.experimental_set_query_params`，自 2024-04-11 起产生 `DeprecationWarning`。资料来源：[Issue #124](https://github.com/hegelai/prompttools/issues/124)、[Issue #127](https://github.com/hegelai/prompttools/issues/127)。
2. **依赖缺失**：启动 Playground 前需要手动 `pip install -r prompttools/playground/requirements.txt`，`streamlit` 并未包含在主 `requirements.txt` 中。资料来源：[Issue #126](https://github.com/hegelai/prompttools/issues/126)。
3. **LanceDB 实验导入失败**：在最新版中 `from prompttools.experiment import LanceDBExperiment` 可能抛错，需检查 LanceDB 与 PyArrow 版本兼容性。资料来源：[Issue #132](https://github.com/hegelai/prompttools/issues/132)。
4. **OpenAI SDK 升级**：`module 'openai' has no attribute 'types'` 提示当前 `openai` 版本低于 `1.x` 助手 API 要求，需显式升级。资料来源：[Issue #122](https://github.com/hegelai/prompttools/issues/122)。
5. **Azure OpenAI Notebook 报错**：在 Notebook 中需保证同时传入 `model` 与 `prompt`（或 `stream`）参数，否则会触发 `TypeError`。资料来源：[Issue #116](https://github.com/hegelai/prompttools/issues/116)。

社区呼声较高的扩展方向还包括 **OpenAI Assistants API**（Issue #111）、**Microsoft Semantic-Kernel**（Issue #114）、**Ollama**（Issue #39）、**OpenAI Image Generation**（Issue #113）以及 **MusicGen**（Issue #82），这些是后续后端矩阵的重要增量候选。

## 六、观测与 Playground

v0.0.45 起，实验系统还引入了 **Observability**（私有 Beta）：只需一行 `import prompttools.logger`，即可将生产链路的 LLM 调用上报到 Hegel AI 托管平台进行监控。资料来源：[Release v0.0.45](https://github.com/hegelai/prompttools/releases/tag/v0.0.45)。

同时，Streamlit Playground 提供了无代码的 UI 体验：可对比不同 system prompt、prompt template 与模型（如 GPT-4 vs 本地 LLaMA 2），并由本地直接调用 LLM，不上传用户数据。资料来源：[prompttools/playground/README.md:1-15]()。

## 七、使用建议

- 优先复用 `examples/notebooks/` 中已有的 `Experiment` Notebook 作为模板，仅替换数据。
- 对 LLM 输出的离线评估，建议组合 `semantic_similarity` + `autoeval_with_documents`，前者做语义覆盖度，后者做事实准确度。
- 引入新后端时，按照现有 `Experiment` 子类约定提供 `run()` / `visualize()`，并在 `prompttools/experiment/experiments/__init__.py` 中完成导出。
- 上线前显式设置环境变量 `SENTRY_OPT_OUT=1` 以关闭默认的 Sentry 错误上报（项目仅上报本库自身异常）。资料来源：[README.md:85-95]()。

## See Also

- [README.md](https://github.com/hegelai/prompttools/blob/main/README.md) — 项目主入口与快速上手指南
- [examples/notebooks/README.md](https://github.com/hegelai/prompttools/blob/main/examples/notebooks/README.md) — Notebook 示例索引
- [prompttools/playground/README.md](https://github.com/hegelai/prompttools/blob/main/prompttools/playground/README.md) — Streamlit Playground 使用说明
- [prompttools/utils/__init__.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/__init__.py) — 评估工具统一导出
- [Release v0.0.45 — Observability](https://github.com/hegelai/prompttools/releases/tag/v0.0.45) — 观测功能发布说明

---

<a id='page-harness-playground'></a>

## Harness 评测框架与 Playground 界面

### 相关页面

相关主题：[实验系统（Experiments）与多后端集成](#page-experiments), [工具函数、PromptTest 测试框架与可观测性](#page-utils-testing), [概览、安装与支持集成](#page-overview)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [prompttools/harness/__init__.py](https://github.com/hegelai/prompttools/blob/main/prompttools/harness/__init__.py)
- [prompttools/harness/harness.py](https://github.com/hegelai/prompttools/blob/main/prompttools/harness/harness.py)
- [prompttools/harness/utility.py](https://github.com/hegelai/prompttools/blob/main/prompttools/harness/utility.py)
- [prompttools/harness/system_prompt_harness.py](https://github.com/hegelai/prompttools/blob/main/prompttools/harness/system_prompt_harness.py)
- [prompttools/harness/prompt_template_harness.py](https://github.com/hegelai/prompttools/blob/main/prompttools/harness/prompt_template_harness.py)
- [prompttools/harness/model_comparison_harness.py](https://github.com/hegelai/prompttools/blob/main/prompttools/harness/model_comparison_harness.py)
- [prompttools/playground/playground.py](https://github.com/hegelai/prompttools/blob/main/prompttools/playground/playground.py)
- [prompttools/playground/README.md](https://github.com/hegelai/prompttools/blob/main/prompttools/playground/README.md)
- [README.md](https://github.com/hegelai/prompttools/blob/main/README.md)
- [prompttools/utils/__init__.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/__init__.py)
- [prompttools/utils/autoeval_with_docs.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval_with_docs.py)
- [prompttools/utils/similarity.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/similarity.py)
- [prompttools/utils/moderation.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/moderation.py)
- [prompttools/utils/validate_python.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/validate_python.py)
</details>

# Harness 评测框架与 Playground 界面

## 概述

`prompttools` 提供了两种互补的评测体验：面向工程师的 **Harness 评测框架**（在脚本或 Notebook 中以代码形式组织评测）与面向非代码用户的 **Playground 图形界面**（基于 Streamlit）。两者共用同一套 `prompttools/experiment` 实验对象与 `prompttools/utils` 评估工具，差异主要在交互形态：Harness 强调可编程、可复现；Playground 强调所见即所得、便于团队评审。资料来源：[README.md:1-30]()

> 社区在 Issue #5「LangChain Support」中明确提出，Harness 应支持 Chain/Agent 的逐步可视化与中间输出评估，这与现有 `system_prompt_harness`、`prompt_template_harness`、`model_comparison_harness` 围绕「参数网格 + 评估函数」的设计目标一致。资料来源：[Issue #5]()

## Harness 评测框架

Harness 位于 `prompttools/harness/` 包下，承担「评测编排」职责——它把多个 Experiment 调用、参数网格与评估函数组合在一起，对外暴露统一的 `run()` 与 `visualize()` 接口。资料来源：[prompttools/harness/__init__.py:1-50]()

### 核心组件

| 组件文件 | 主要职责 |
|---|---|
| `harness.py` | 基类与通用编排：参数网格展开、结果收集、与 Experiment 协作 |
| `system_prompt_harness.py` | 横向对比不同 system prompt 对模型行为的影响 |
| `prompt_template_harness.py` | 基于模板渲染（类似 Jinja2）对比 prompt 变量填充差异 |
| `model_comparison_harness.py` | 在相同输入下对比不同模型（如 GPT-4 vs LLaMA 2） |
| `utility.py` | 跨 Harness 共享的辅助函数 |

资料来源：[prompttools/harness/harness.py:1-80](), [prompttools/harness/system_prompt_harness.py:1-50](), [prompttools/harness/prompt_template_harness.py:1-50](), [prompttools/harness/model_comparison_harness.py:1-50](), [prompttools/harness/utility.py:1-50]()

### 工作流

```mermaid
flowchart LR
    A[用户配置参数网格] --> B[Harness.run]
    B --> C[调用 Experiment]
    C --> D[收集模型响应]
    D --> E[应用评估函数]
    E --> F[visualize 输出 DataFrame 与图表]
```

通过该流水线，Harness 将「构造输入 → 调度模型 → 度量响应」三步标准化，使 Prompt 工程实验像单元测试一样可重复运行。

## Playground 图形界面

`prompttools/playground/playground.py` 是一个 Streamlit 应用，提供与 Harness 等价的图形化能力，其核心交互场景由 `prompttools/playground/README.md` 明确列出：

- **评估不同指令（system prompts）**：在 UI 中切换系统提示并观察模型响应
- **尝试不同 Prompt 模板**：填入变量后批量渲染并比较输出
- **跨模型对比**：在同一输入下横向对比 GPT-4 与本地 LLaMA 2 等

资料来源：[prompttools/playground/README.md:1-25]()

启动命令（README 提供）：

```
git clone https://github.com/hegelai/prompttools.git
cd prompttools && pip install -r prompttools/playground/requirements.txt
streamlit run prompttools/playground/playground.py
```

README 还提到，托管版本部署在 [Streamlit Community Cloud](https://prompttools.streamlit.app/)，但**不支持 LlamaCpp**。资料来源：[README.md:70-85](), [prompttools/playground/README.md:18-22]()

## 共享评估工具

Harness 与 Playground 都依赖 `prompttools/utils/` 中的评估函数。`__init__.py` 暴露的常用工具包括：

- **`autoeval_binary_scoring` / `autoeval_from_expected_response` / `autoeval_scoring` / `autoeval_with_documents`**：使用 GPT-4 或 Claude 作为「裁判模型」对响应打分；`autoeval_with_documents` 适用于 RAG 场景，会返回 0–10 的整数评分。资料来源：[prompttools/utils/autoeval_with_docs.py:1-60](), [prompttools/utils/__init__.py:1-30]()
- **`semantic_similarity` / `cos_similarity`**：基于 `sentence-transformers/all-MiniLM-L6-v2` 或 Chroma 客户端计算 embedding 余弦相似度，并提供 `evaluate(prompt, response, metadata, expected)` 便捷封装。资料来源：[prompttools/utils/similarity.py:1-60]()
- **`apply_moderation`**：调用 OpenAI moderation API 对响应进行合规检查，可选择性返回分类标签与得分。资料来源：[prompttools/utils/moderation.py:1-40]()
- **`validate_python_response` / `validate_json_response`**：通过 `pylint` 与 JSON Schema 验证生成代码或 JSON 的合法性。资料来源：[prompttools/utils/validate_python.py:1-40]()

## 已知问题与社区反馈

社区中与 Harness/Playground 链路相关的常见反馈如下：

1. **Streamlit 弃用警告**：Issue #124 与 #127 报告 `st.experimental_get_query_params` / `st.experimental_set_query_params` 将在 2024-04-11 后被移除，应迁移至 `st.query_params`。资料来源：[Issue #124](), [Issue #127]()
2. **依赖缺失**：Issue #126 指出 `prompttools/playground/requirements.txt` 需要显式声明 `streamlit`，否则 `streamlit run` 会失败。资料来源：[Issue #126]()
3. **LanceDB Experiment 导入失败**：Issue #132 报告 `from prompttools.experiment import LanceDBExperiment` 在最新版本中报错，影响依赖 LanceDB 的 Harness 评测链路。资料来源：[Issue #132]()
4. **LangChain Harness 待补齐**：Issue #5 长期跟踪 Chain/Agent Harness 与中间步骤可视化能力，是该框架未来扩展方向之一。资料来源：[Issue #5]()

## See Also

- [OpenAI Chat Experiment Notebook](https://github.com/hegelai/prompttools/blob/main/examples/notebooks/OpenAIChatExperiment.ipynb)
- [Retrieval Augmented Generation Notebook](https://github.com/hegelai/prompttools/blob/main/examples/notebooks/vectordb_experiments/RetrievalAugmentedGeneration.ipynb)
- [Stable Diffusion Notebook](https://github.com/hegelai/prompttools/blob/main/examples/notebooks/image_experiments/StableDiffusion.ipynb)
- [Notebook Examples 总览](https://github.com/hegelai/prompttools/blob/main/examples/notebooks/README.md)

---

<a id='page-utils-testing'></a>

## 工具函数、PromptTest 测试框架与可观测性

### 相关页面

相关主题：[实验系统（Experiments）与多后端集成](#page-experiments), [Harness 评测框架与 Playground 界面](#page-harness-playground)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [prompttools/utils/__init__.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/__init__.py)
- [prompttools/utils/error.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/error.py)
- [prompttools/utils/autoeval.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval.py)
- [prompttools/utils/autoeval_scoring.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval_scoring.py)
- [prompttools/utils/autoeval_with_docs.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval_with_docs.py)
- [prompttools/utils/autoeval_from_expected.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval_from_expected.py)
- [prompttools/utils/chunk_text.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/chunk_text.py)
- [prompttools/utils/similarity.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/similarity.py)
- [prompttools/utils/validate_python.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/validate_python.py)
- [prompttools/utils/moderation.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/moderation.py)

</details>

# 工具函数、PromptTest 测试框架与可观测性

## 概述

`prompttools` 由 Hegel AI 维护，提供了一套围绕「提示词工程与大模型评测」的工具链。`prompttools.utils` 子包承担了**评测函数库**的角色：把"如何打分"从各个 `Experiment` 中抽离成可复用的工具，既能在 Jupyter Notebook 中直接调用，也能与 `Experiment.run()` / `Experiment.visualize()` 工作流配合。本页聚焦三类能力：评分与验证工具、文本/图像处理工具，以及自 v0.0.45 起引入的可观测性扩展点 `prompttools.logger`。

> 社区关注：在 v0.0.35 中加入的 `chunk_text`、`autoeval_with_documents`、`structural_similarity` 是该模块的早期成员；v0.0.45 通过 `import prompttools.logger` 进一步打开了「生产可观测性」的入口。

资料来源：[prompttools/utils/__init__.py:1-37]()

## 工具函数库（`prompttools.utils`）

### 自动评测（autoeval 系列）

自动评测函数使用一个高质量模型（默认 `gpt-4`）充当裁判，针对不同的评测语义提供不同入口：

| 函数 | 用途 | 评分范围 |
| --- | --- | --- |
| `autoeval_binary_scoring` | 判断模型是否遵循了指令 | `0.0` / `1.0` |
| `autoeval_scoring` | 自由形式的指令遵循打分 | `0.0` / `1.0` |
| `autoeval_with_documents` | 基于给定文档对回答进行 0–10 准确性评分 | `int` |
| `autoeval_from_expected_response` | 与「期望答案」对比得出 `RIGHT` / `WRONG` | `0.0` / `1.0` |

所有自动评测函数都依赖环境变量中的 `OPENAI_API_KEY`，否则会抛出 `PromptToolsUtilityError` 资料来源：[prompttools/utils/error.py:1-10]()。例如 `autoeval.compute` 的核心是构建 `{system, user}` 两段消息并通过 `openai.chat.completions.create` 发起调用 资料来源：[prompttools/utils/autoeval.py:31-44]()。`autoeval_with_documents` 则把多篇文档拼接到 `DOCUMENTS:` 段并要求 GPT-4 输出 0–10 的整数评分 资料来源：[prompttools/utils/autoeval_with_docs.py:14-22]()。

### 相似度与排序指标

`similarity` 模块通过懒加载避免强制依赖：

- `_get_embedding_model()` 在首次调用时下载 `sentence-transformers/all-MiniLM-L6-v2` 资料来源：[prompttools/utils/similarity.py:36-41]()；
- `_get_chroma_client()` 会在需要时创建 `chromadb.Client()` 实例 资料来源：[prompttools/utils/similarity.py:44-47]()；
- 当传入 `use_chroma=True` 时使用 Chroma 计算相似度，否则回退到 HuggingFace 嵌入路径 资料来源：[prompttools/utils/similarity.py:55-58]()。

`compute_similarity_against_model` 与 `semantic_similarity`、`cos_similarity` 在 `__init__.py` 中被统一对外暴露 资料来源：[prompttools/utils/__init__.py:17-26]()。

### 代码、JSON 校验与内容审核

- `validate_python_response` 调用 `pylint.epylint` 对生成的代码进行语法校验；该函数强依赖 `pylint<3.0`，否则会提示用户改用自定义评测函数 资料来源：[prompttools/utils/validate_python.py:11-30]()。
- `apply_moderation` 包装了 OpenAI 的 `text-moderation-latest` 模型，支持提取指定的类别名 / 类别分作为新列注入到结果 `DataFrame` 中 资料来源：[prompttools/utils/moderation.py:11-22]()。
- `chunk_text` 按空格分词后贪心拼接，保证任意 chunk 的字符数不超过 `max_chunk_length` 且不切断单词 资料来源：[prompttools/utils/chunk_text.py:5-25]()。

## PromptTest 测试框架与工具函数的协作

`prompttools` 的实验抽象统一暴露 `run()` 与 `visualize()`，工具函数则以两种形式参与：

1. **直接调用**：`similarity.evaluate(prompt, response, metadata, expected)`、`similarity.structural_similarity(row, expected)` 接收行级数据，直接产出分数 资料来源：[prompttools/utils/similarity.py:61-79]()。
2. **DataFrame 适配器**：当一个函数签名为 `(row: pandas.Series, …)` 时（如 `autoeval_binary_scoring`、`validate_python_response`），可以直接通过 `Experiment.visualize()` 注册为评分列。

下图为一次典型 RAG 评测中「实验 → 工具函数 → 评分 DataFrame」的链路：

```mermaid
flowchart LR
    A[Experiment.run<br/>调用 LLM / 向量库] --> B[原始 DataFrame<br/>包含 prompt, response]
    B --> C{选择工具函数}
    C -->|autoeval_with_documents| D[GPT-4 裁判打分<br/>0-10 整数]
    C -->|ranking_correlation| E[向检索排序打分]
    C -->|apply_moderation| F[内容安全分类]
    D --> G[合并后的评分 DataFrame]
    E --> G
    F --> G
    G --> H[Experiment.visualize<br/>渲染对比表]
```

## 可观测性（Observability）

v0.0.45 发布公告指出：「我们很高兴在托管平台上推出 Observability 功能……只需一行代码即可启用」。启用方式为：

```python
import prompttools.logger
```

该入口允许团队在不修改现有 LLM 调用代码的前提下，把生产环境中的请求/响应上送到 PromptTools 的托管平台，用于集中监控与评估。值得注意的运维事项：

- **Playground 与 Streamlit 版本**：社区已报告多次关于 `st.experimental_get_query_params` / `st.experimental_set_query_params` 的弃用警告（[#124](https://github.com/hegelai/prompttools/issues/124)、[#127](https://github.com/hegelai/prompttools/issues/127)），运行 `streamlit run prompttools/playground/playground.py` 时需要确保本地 Streamlit ≥ 1.30；
- **依赖缺失**：托管 README 之外的 streamlit 依赖未列入顶层 `requirements.txt`，需手动安装 [#126](https://github.com/hegelai/prompttools/issues/126)；
- **LanceDB 实验导入异常**：在某些最新版本上 `from prompttools.experiment import LanceDBExperiment` 会触发错误 [#132](https://github.com/hegelai/prompttools/issues/132)，建议在引入观测模块前先确认相关实验可正常 import。

## See Also

- [LLM 实验与可视化（OpenAI / Anthropic / HuggingFace）](#)
- [向量数据库实验（RAG 与检索评测）](#)
- [Playground 与 Streamlit 部署指南](#)
- [PromptTools 托管平台与团队协作](#)

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Doramagic 踩坑日志

项目：hegelai/prompttools

摘要：发现 14 个潜在踩坑项，其中 2 个为 high/blocking；最高优先级：维护坑 - 来源证据：package breaks while importing LanceDB experiment。

## 1. 维护坑 · 来源证据：package breaks while importing LanceDB experiment

- 严重度：high
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个维护/版本相关的待验证问题：package breaks while importing LanceDB experiment
- 对用户的影响：可能影响升级、迁移或版本选择。
- 证据：community_evidence:github | https://github.com/hegelai/prompttools/issues/132 | 来源类型 github_issue 暴露的待验证使用条件。

## 2. 安全/权限坑 · 来源证据：OpenAI Chat Experiment Example dependency issue fix

- 严重度：high
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安全/权限相关的待验证问题：OpenAI Chat Experiment Example dependency issue fix
- 对用户的影响：可能影响授权、密钥配置或安全边界。
- 证据：community_evidence:github | https://github.com/hegelai/prompttools/issues/121 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 3. 安装坑 · 来源证据：AttributeError: module 'openai' has no attribute 'types'

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：AttributeError: module 'openai' has no attribute 'types'
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 证据：community_evidence:github | https://github.com/hegelai/prompttools/issues/122 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 4. 安装坑 · 来源证据：AzureOpenAIServiceExperiment notebook: TypeError: Missing required arguments; Expected either ('model' and 'prompt') or…

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：AzureOpenAIServiceExperiment notebook: TypeError: Missing required arguments; Expected either ('model' and 'prompt') or ('model', 'prompt' and 'stream') argume…
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 证据：community_evidence:github | https://github.com/hegelai/prompttools/issues/116 | 来源讨论提到 python 相关条件，需在安装/试用前复核。

## 5. 安装坑 · 来源证据：Missing requirements - streamlit

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个安装相关的待验证问题：Missing requirements - streamlit
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 证据：community_evidence:github | https://github.com/hegelai/prompttools/issues/126 | 来源类型 github_issue 暴露的待验证使用条件。

## 6. 能力坑 · 能力判断依赖假设

- 严重度：medium
- 证据强度：source_linked
- 发现：README/documentation is current enough for a first validation pass.
- 对用户的影响：假设不成立时，用户拿不到承诺的能力。
- 证据：capability.assumptions | https://github.com/hegelai/prompttools | README/documentation is current enough for a first validation pass.

## 7. 运行坑 · 来源证据：Deprecation error : st.experimental_get_query_params and st.experimental_set_query_params will be removed after 2024-04…

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个运行相关的待验证问题：Deprecation error : st.experimental_get_query_params and st.experimental_set_query_params will be removed after 2024-04-11
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 证据：community_evidence:github | https://github.com/hegelai/prompttools/issues/124 | 来源类型 github_issue 暴露的待验证使用条件。

## 8. 运行坑 · 来源证据：Deprecation warnings

- 严重度：medium
- 证据强度：source_linked
- 发现：GitHub 社区证据显示该项目存在一个运行相关的待验证问题：Deprecation warnings
- 对用户的影响：可能增加新用户试用和生产接入成本。
- 证据：community_evidence:github | https://github.com/hegelai/prompttools/issues/127 | 来源类型 github_issue 暴露的待验证使用条件。

## 9. 运行坑 · 运行可能依赖外部服务

- 严重度：medium
- 证据强度：source_linked
- 发现：项目说明出现 external service/cloud/webhook/database 等运行依赖关键词。
- 对用户的影响：本地安装成功不等于能力可用，外部服务不可用会阻断体验。
- 证据：packet_text.keyword_scan | https://github.com/hegelai/prompttools | matched external service / cloud / webhook / database keyword

## 10. 维护坑 · 维护活跃度未知

- 严重度：medium
- 证据强度：source_linked
- 发现：未记录 last_activity_observed。
- 对用户的影响：新项目、停更项目和活跃项目会被混在一起，推荐信任度下降。
- 证据：evidence.maintainer_signals | https://github.com/hegelai/prompttools | last_activity_observed missing

- 严重度：medium
- 证据强度：source_linked
- 发现：no_demo
- 证据：downstream_validation.risk_items | https://github.com/hegelai/prompttools | no_demo; severity=medium

## 12. 安全/权限坑 · 存在评分风险

- 严重度：medium
- 证据强度：source_linked
- 发现：no_demo
- 对用户的影响：风险会影响是否适合普通用户安装。
- 证据：risks.scoring_risks | https://github.com/hegelai/prompttools | no_demo; severity=medium

## 13. 维护坑 · issue/PR 响应质量未知

- 严重度：low
- 证据强度：source_linked
- 发现：issue_or_pr_quality=unknown。
- 对用户的影响：用户无法判断遇到问题后是否有人维护。
- 证据：evidence.maintainer_signals | https://github.com/hegelai/prompttools | issue_or_pr_quality=unknown

## 14. 维护坑 · 发布节奏不明确

- 严重度：low
- 证据强度：source_linked
- 发现：release_recency=unknown。
- 对用户的影响：安装命令和文档可能落后于代码，用户踩坑概率升高。
- 证据：evidence.maintainer_signals | https://github.com/hegelai/prompttools | release_recency=unknown

<!-- canonical_name: hegelai/prompttools; human_manual_source: deepwiki_human_wiki -->