# https://github.com/hegelai/prompttools Project Manual

Generated at: 2026-06-27 07:24:49 UTC

## Table of Contents

- [Overview, Supported Integrations, and Quickstart](#page-1)
- [Core Experiments API: LLMs, Vector Databases, and Frameworks](#page-2)
- [Playground, Widgets, and Visualization](#page-3)
- [Utilities, Harness, PromptTest, and Observability](#page-4)

<a id='page-1'></a>

## Overview, Supported Integrations, and Quickstart

### Related Pages

Related topics: [Core Experiments API: LLMs, Vector Databases, and Frameworks](#page-2), [Playground, Widgets, and Visualization](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/hegelai/prompttools/blob/main/README.md)
- [prompttools/utils/__init__.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/__init__.py)
- [prompttools/utils/similarity.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/similarity.py)
- [prompttools/utils/chunk_text.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/chunk_text.py)
- [prompttools/utils/autoeval.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval.py)
- [prompttools/utils/autoeval_with_docs.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval_with_docs.py)
- [prompttools/utils/autoeval_from_expected.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval_from_expected.py)
- [prompttools/utils/validate_python.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/validate_python.py)
- [prompttools/utils/error.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/error.py)
- [prompttools/playground/README.md](https://github.com/hegelai/prompttools/blob/main/prompttools/playground/README.md)
- [examples/notebooks/README.md](https://github.com/hegelai/prompttools/blob/main/examples/notebooks/README.md)
</details>

# Overview, Supported Integrations, and Quickstart

## 1. Project Overview

`prompttools` is an open-source library created by Hegel AI that provides self-hostable tools for **experimenting with, testing, and evaluating LLMs, vector databases, and prompts**. The project positions itself around three familiar interfaces: code, notebooks, and a local playground, so that developers can rapidly compare prompts, models, and retrieval configurations with minimal boilerplate [Source: [README.md](https://github.com/hegelai/prompttools/blob/main/README.md)].

A core design choice is that **all executions and calls to LLM services happen locally on the user's machine**. The library explicitly states it does not forward requests or log user information [Source: [prompttools/playground/README.md](https://github.com/hegelai/prompttools/blob/main/prompttools/playground/README.md)]. By default, the package does emit error telemetry to Sentry to track its own reliability issues; this can be disabled with the `SENTRY_OPT_OUT` environment variable [Source: [README.md](https://github.com/hegelai/prompttools/blob/main/README.md)].

A representative entry point is the `OpenAIChatExperiment` class, which can be instantiated with models, messages, and parameters, then executed and visualized in a few lines [Source: [README.md](https://github.com/hegelai/prompttools/blob/main/README.md)].

## 2. Supported Integrations

The README maintains an explicit matrix of integrations grouped into three categories. The table below summarizes the supported surface as documented:

| Category | Component | Status |
| --- | --- | --- |
| LLMs | OpenAI (Completion, ChatCompletion, Fine-tuned models) | Supported |
| LLMs | LLaMA.Cpp (LLaMA 1, LLaMA 2) | Supported |
| LLMs | HuggingFace (Hub API, Inference Endpoints) | Supported |
| LLMs | Anthropic | Supported |
| LLMs | Mistral AI | Supported |
| LLMs | Google Gemini | Supported |
| LLMs | Google PaLM (legacy) | Supported |
| LLMs | Google Vertex AI | Supported |
| LLMs | Azure OpenAI Service | Supported |
| LLMs | Replicate | Supported |
| LLMs | Ollama | In Progress |
| Vector DBs | Chroma, Weaviate, Qdrant, LanceDB, Pinecone | Supported |
| Vector DBs | Milvus | Exploratory |
| Vector DBs | Epsilla | In Progress |
| Frameworks | LangChain, MindsDB | Supported |

Source: [README.md](https://github.com/hegelai/prompttools/blob/main/README.md)

### Release-Driven Additions

Versioned releases in the public changelog show how the supported surface has expanded:

- **v0.0.35** introduced Google Vertex AI, Azure OpenAI Service, Replicate, Stable Diffusion, Pinecone, Qdrant, and RAG experiments, alongside utility helpers `chunk_text`, `autoeval_with_documents`, and `structural_similarity` [Source: [README.md](https://github.com/hegelai/prompttools/blob/main/README.md)].
- **v0.0.41** launched the hosted Playground (private beta), persisting experiments with version control and team collaboration features [Source: [README.md](https://github.com/hegelai/prompttools/blob/main/README.md)].
- **v0.0.45** added observability features via `import prompttools.logger`, intended for monitoring production LLM usage [Source: [README.md](https://github.com/hegelai/prompttools/blob/main/README.md)].

### Community-Reported Integration Gaps

The community tracker surfaces requested integrations that are not yet shipped. Tracked feature requests include Ollama ([Issue #39](https://github.com/hegelai/prompttools/issues/39)), Microsoft Semantic-Kernel ([Issue #114](https://github.com/hegelai/prompttools/issues/114)), the OpenAI Image Generation API ([Issue #113](https://github.com/hegelai/prompttools/issues/113)), MusicGen/audio model evaluation ([Issue #82](https://github.com/hegelai/prompttools/issues/82)), and broader LangChain harnesses ([Issue #5](https://github.com/hegelai/prompttools/issues/5)). Ollama is explicitly listed as "In Progress" in the official matrix [Source: [README.md](https://github.com/hegelai/prompttools/blob/main/README.md)].

## 3. Quickstart

### Installation

The package is distributed on PyPI and installed via pip [Source: [README.md](https://github.com/hegelai/prompttools/blob/main/README.md)]:

```bash
pip install prompttools
```

To run the playground locally, the repo must be cloned and the Streamlit dependency installed separately because the playground is not bundled as a runtime dependency of the pip package. Community issue [#126](https://github.com/hegelai/prompttools/issues/126) reports that `streamlit` is missing from the main requirements, so the documented command sequence is:

```bash
git clone https://github.com/hegelai/prompttools.git
cd prompttools && pip install -r prompttools/playground/requirements.txt
streamlit run prompttools/playground/playground.py
```

Source: [prompttools/playground/README.md](https://github.com/hegelai/prompttools/blob/main/prompttools/playground/README.md)

### Minimal Code Example

The canonical first example uses `OpenAIChatExperiment`. The user supplies a list of message threads, a list of model identifiers, and a parameter grid (e.g., temperature values), then calls `.run()` followed by `.visualize()` [Source: [README.md](https://github.com/hegelai/prompttools/blob/main/README.md)]:

```python
from prompttools.experiment import OpenAIChatExperiment

messages = [
    [{"role": "user", "content": "Tell me a joke."}],
    [{"role": "user", "content": "Is 17077 a prime number?"}],
]

models = ["gpt-3.5-turbo", "gpt-4"]
temperatures = [0.0]
openai_experiment = OpenAIChatExperiment(models, messages, temperature=temperatures)
openai_experiment.run()
openai_experiment.visualize()
```

### Notebook Examples

A curated set of runnable notebooks is available under `examples/notebooks/`. Coverage spans single-model LLM experiments (OpenAI Chat, Anthropic, PaLM 2, Vertex AI, LLaMA.Cpp, HuggingFace Hub), function calling, regression testing, human feedback, and multimodal cases such as Stable Diffusion [Source: [examples/notebooks/README.md](https://github.com/hegelai/prompttools/blob/main/examples/notebooks/README.md)]. Vector database notebooks under `vectordb_experiments/` exercise ChromaDB, Weaviate, LanceDB, Qdrant, and Pinecone, while framework notebooks demonstrate LangChain sequential chains, router chains, and MindsDB integration [Source: [examples/notebooks/README.md](https://github.com/hegelai/prompttools/blob/main/examples/notebooks/README.md)].

### Known Setup Pitfalls

Several community-reported issues affect first-run experience:

- **Streamlit deprecation warnings** — Issue [#124](https://github.com/hegelai/prompttools/issues/124) and [#127](https://github.com/hegelai/prompttools/issues/127) report that the playground uses `st.experimental_get_query_params`/`st.experimental_set_query_params`, which were slated for removal after 2024-04-11. Users should migrate to `st.query_params` [Source: [prompttools/playground/README.md](https://github.com/hegelai/prompttools/blob/main/prompttools/playground/README.md)].
- **LanceDB import breakage** — Issue [#132](https://github.com/hegelai/prompttools/issues/132) documents a crash on `from prompttools.experiment import LanceDBExperiment` in the latest version, with a traceback screenshot attached.
- **OpenAI client compatibility** — Issue [#122](https://github.com/hegelai/prompttools/issues/122) reports `AttributeError: module 'openai' has no attribute 'types'` after a clean clone, indicating a need to pin or upgrade the `openai` package.
- **Notebook dependency drift** — Issue [#121](https://github.com/hegelai/prompttools/issues/121) and [#116](https://github.com/hegelai/prompttools/issues/116) document that notebook examples require explicit pins such as `pandas==1.5.3`, `fastapi`, `kaleido`, `uvicorn`, `cohere`, and `tiktoken`, and that AzureOpenAIService notebooks require both `model` and `prompt` (or the `stream` flag).

## 4. Utility Layer

Beyond experiment drivers, `prompttools.utils` exposes a curated set of evaluation and text utilities, re-exported through `prompttools/utils/__init__.py`:

- **`chunk_text(text, max_chunk_length)`** — Splits a paragraph into space-aware chunks that do not break words [Source: [prompttools/utils/chunk_text.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/chunk_text.py)].
- **`semantic_similarity` / `cos_similarity`** — Lazy-loaded embedding helpers backed by `sentence-transformers/all-MiniLM-L6-v2` or Chroma [Source: [prompttools/utils/similarity.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/similarity.py)].
- **`structural_similarity`** — Image SSIM scorer requiring `opencv-python` and `scikit-image` [Source: [prompttools/utils/similarity.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/similarity.py)].
- **Auto-evaluation** — GPT-4-as-judge scorers: `autoeval_binary_scoring` (RIGHT/WRONG), `autoeval_from_expected_response` (grade against an expected answer), and `autoeval_with_documents` (RAG-grounded 0–10 scoring) [Source: [prompttools/utils/autoeval.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval.py), [prompttools/utils/autoeval_from_expected.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval_from_expected.py), [prompttools/utils/autoeval_with_docs.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval_with_docs.py)].
- **`validate_python_response`** — Uses `pylint<3.0` (via `epylint`) to check whether a generated string parses as valid Python [Source: [prompttools/utils/validate_python.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/validate_python.py)].
- **`PromptToolsUtilityError`** — Shared exception type for utility-layer failures, e.g., missing `OPENAI_API_KEY` [Source: [prompttools/utils/error.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/error.py)].

These utilities are surfaced in the public namespace as `autoeval_binary_scoring`, `autoeval_with_documents`, `chunk_text`, `semantic_similarity`, `cos_similarity`, `validate_json_response`, `validate_python_response`, and `apply_moderation` [Source: [prompttools/utils/__init__.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/__init__.py)].

## See Also

- [PromptTools Playground (Streamlit UI)](prompttools/playground/README.md)
- [Notebook Examples Index](examples/notebooks/README.md)
- [Release Notes](https://github.com/hegelai/prompttools/releases)

---

<a id='page-2'></a>

## Core Experiments API: LLMs, Vector Databases, and Frameworks

### Related Pages

Related topics: [Overview, Supported Integrations, and Quickstart](#page-1), [Utilities, Harness, PromptTest, and Observability](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/hegelai/prompttools/blob/main/README.md)
- [examples/notebooks/README.md](https://github.com/hegelai/prompttools/blob/main/examples/notebooks/README.md)
- [prompttools/playground/README.md](https://github.com/hegelai/prompttools/blob/main/prompttools/playground/README.md)
- [prompttools/utils/__init__.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/__init__.py)
- [prompttools/utils/similarity.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/similarity.py)
- [prompttools/utils/chunk_text.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/chunk_text.py)
- [prompttools/utils/autoeval.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval.py)
- [prompttools/utils/autoeval_from_expected.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval_from_expected.py)
- [prompttools/utils/autoeval_with_docs.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval_with_docs.py)
- [prompttools/utils/moderation.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/moderation.py)
</details>

# Core Experiments API: LLMs, Vector Databases, and Frameworks

## Overview and Purpose

The Core Experiments API is the central public surface of `prompttools`. It exposes a uniform `Experiment` abstraction that lets developers sweep prompts, model parameters, and embedding/retrieval configurations across heterogeneous backends, then collect, evaluate, and visualize the results in a pandas DataFrame. Source: [README.md](https://github.com/hegelai/prompttools/blob/main/README.md).

The library targets three backend families that share the same call pattern (`.run()` + `.visualize()`):

| Family | Purpose | Example backends |
| --- | --- | --- |
| **LLMs** | Compare prompt/model/parameter combinations | OpenAI, Anthropic, LLaMA.Cpp, HuggingFace, Mistral, Google Gemini/PaLM/Vertex, Azure OpenAI, Replicate |
| **Vector Databases** | Evaluate retrieval quality and embedding functions | Chroma, Weaviate, Qdrant, LanceDB, Pinecone (Milvus/Epsilla exploratory) |
| **Frameworks** | Test composed chains and agents | LangChain, MindsDB (LlamaIndex exploratory) |

Source: [README.md](https://github.com/hegelai/prompttools/blob/main/README.md) (Supported Integrations section).

## Architecture and Data Flow

All experiment classes inherit from a common `Experiment` base class that defines the run loop, the evaluation hooks, and the persistence methods (`to_csv`, `to_json`, `to_lora_json`, `to_mongo_db`). Source: [README.md](https://github.com/hegelai/prompttools/blob/main/README.md).

```mermaid
flowchart LR
    A[User code<br/>inputs + param grid] --> B[Experiment subclass<br/>e.g. OpenAIChatExperiment]
    B --> C[.run()]
    C --> D[Backend SDK<br/>OpenAI / Anthropic / Chroma / LangChain]
    D --> E[pandas DataFrame<br/>prompt, response, metadata]
    E --> F[Evaluation utilities<br/>similarity, autoeval, moderation]
    F --> G[.visualize() / .to_csv / .to_json]
    E --> H[Hosted Playground / Observability<br/>import prompttools.logger]
```

The pipeline is deliberately local-first: API calls originate from the user's machine and only the experiment framework code is loaded. Source: [README.md](https://github.com/hegelai/prompttools/blob/main/README.md) (FAQ: "Will this library forward my LLM calls to a server...?").

## LLM Experiments

LLM experiments are constructed from three orthogonal axes: the message/prompt list, the model identifier list, and a parameter dictionary (e.g., `temperature`). The `OpenAIChatExperiment` shown in the README demonstrates the canonical pattern. Source: [README.md](https://github.com/hegelai/prompttools/blob/main/README.md).

```python
from prompttools.experiment import OpenAIChatExperiment

messages = [
    [{"role": "user", "content": "Tell me a joke."}],
    [{"role": "user", "content": "Is 17077 a prime number?"}],
]
models = ["gpt-3.5-turbo", "gpt-4"]
temperatures = [0.0]
exp = OpenAIChatExperiment(models, messages, temperature=temperatures)
exp.run()
exp.visualize()
```

Equivalent notebooks exist for Anthropic Claude, PaLM 2, Google Vertex chat, LLaMA.Cpp, and HuggingFace Hub, each parameterized to that provider's SDK. Source: [examples/notebooks/README.md](https://github.com/hegelai/prompttools/blob/main/examples/notebooks/README.md).

**Community-known failure modes** on this surface include a `TypeError: Missing required arguments` thrown by the Azure OpenAI notebook when `model`/`prompt`/`stream` are not all supplied ([Issue #116](https://github.com/hegelai/prompttools/issues/116)), and `AttributeError: module 'openai' has no attribute 'types'` when the local `openai` SDK is older than what the experiment expects ([Issue #122](https://github.com/hegelai/prompttools/issues/122)). Both are tracked separately from the experiment API itself.

## Vector Database and Retrieval Experiments

Vector database experiments take a corpus plus an embedding configuration and measure retrieval accuracy, typically with ranking-correlation utilities against an expected ordering. Source: [examples/notebooks/README.md](https://github.com/hegelai/prompttools/blob/main/examples/notebooks/README.md).

The supported notebooks include Chroma, Weaviate, LanceDB, Qdrant, Pinecone, and a Retrieval-Augmented Generation (RAG) experiment that chains a vector store with an LLM. Source: [examples/notebooks/README.md](https://github.com/hegelai/prompttools/blob/main/examples/notebooks/README.md).

Two helper utilities make these experiments practical:

- `chunk_text(text, max_chunk_length)` splits paragraphs without breaking words, enabling consistent ingestion across embedding configurations. Source: [prompttools/utils/chunk_text.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/chunk_text.py).
- `autoeval_with_documents(row, documents, response_column_name)` asks GPT-4 to grade whether a response is grounded in retrieved documents, returning an integer 0–10. Source: [prompttools/utils/autoeval_with_docs.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval_with_docs.py).

**Community-known issue**: `from prompttools.experiment import LanceDBExperiment` raises an `ImportError` on the latest published version ([Issue #132](https://github.com/hegelai/prompttools/issues/132)), indicating that optional backends are imported eagerly rather than lazily.

## Framework Experiments and Evaluation Utilities

Framework experiments let users treat chains and routers as first-class experimentables. The currently documented notebooks are `LangChainSequentialChainExperiment`, `LangChainRouterChainExperiment`, and `MindsDBExperiment`. Source: [examples/notebooks/README.md](https://github.com/hegelai/prompttools/blob/main/examples/notebooks/README.md).

A request to add Microsoft Semantic-Kernel support ([Issue #114](https://github.com/hegelai/prompttools/issues/114)) and ongoing interest in deeper LangChain support ([Issue #5](https://github.com/hegelai/prompttools/issues/5)) reflect where the framework surface is expected to grow.

Evaluation utilities are re-exported from `prompttools.utils` and attach to experiment rows. Source: [prompttools/utils/__init__.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/__init__.py).

| Utility | Purpose | Source |
| --- | --- | --- |
| `semantic_similarity` / `cos_similarity` | Embedding-based comparison of two strings (HuggingFace or Chroma) | [prompttools/utils/similarity.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/similarity.py) |
| `structural_similarity` | SSIM between images (cv2 + skimage) | [prompttools/utils/similarity.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/similarity.py) |
| `autoeval_binary_scoring` | GPT-4 judges whether the response follows the prompt | [prompttools/utils/autoeval.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval.py) |
| `autoeval_from_expected_response` | GPT-4 grades ACTUAL against EXPECTED | [prompttools/utils/autoeval_from_expected.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval_from_expected.py) |
| `autoeval_with_documents` | GPT-4 grades RAG grounding in provided docs | [prompttools/utils/autoeval_with_docs.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval_with_docs.py) |
| `apply_moderation` | OpenAI moderation API on a response column | [prompttools/utils/moderation.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/moderation.py) |
| `validate_json_response` / `validate_python_response` | Schema validation against model output | [prompttools/utils/__init__.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/__init__.py) |
| `ranking_correlation` | Compare vector DB ordering to an expected ordering | [prompttools/utils/__init__.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/__init__.py) |

## Playground, Persistence, and Observability

Experiment objects can be persisted locally via `to_csv`, `to_json`, `to_lora_json`, or `to_mongo_db`. Source: [README.md](https://github.com/hegelai/prompttools/blob/main/README.md). For interactive exploration, the Streamlit playground is launched from the cloned repo and shares the same `Experiment` API. Source: [prompttools/playground/README.md](https://github.com/hegelai/prompttools/blob/main/prompttools/playground/README.md).

Two community-reported issues affect the playground specifically and are worth noting here: deprecation warnings from `st.experimental_get_query_params` ([Issue #124](https://github.com/hegelai/prompttools/issues/124), [Issue #127](https://github.com/hegelai/prompttools/issues/127)) and a missing `streamlit` dependency in the playground's `requirements.txt` ([Issue #126](https://github.com/hegelai/prompttools/issues/126)).

For hosted workflows, `import prompttools.logger` enables the PromptTools Observability beta (v0.0.45), which persists experiments with version control and adds a one-line observability hook to production LLM calls. Source: [README.md](https://github.com/hegelai/prompttools/blob/main/README.md).

## See Also

- [Evaluation Utilities Reference](evaluation-utilities.md) — deeper documentation of the auto-eval, similarity, and moderation helpers.
- [Playground UI Guide](playground-ui.md) — running and troubleshooting the Streamlit playground.
- [Notebook Examples Index](notebooks.md) — full catalog of runnable notebooks per backend.

---

<a id='page-3'></a>

## Playground, Widgets, and Visualization

### Related Pages

Related topics: [Overview, Supported Integrations, and Quickstart](#page-1), [Core Experiments API: LLMs, Vector Databases, and Frameworks](#page-2)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/hegelai/prompttools/blob/main/README.md)
- [prompttools/playground/README.md](https://github.com/hegelai/prompttools/blob/main/prompttools/playground/README.md)
- [prompttools/playground/playground.py](https://github.com/hegelai/prompttools/blob/main/prompttools/playground/playground.py)
- [prompttools/playground/constants.py](https://github.com/hegelai/prompttools/blob/main/prompttools/playground/constants.py)
- [prompttools/playground/data_loader.py](https://github.com/hegelai/prompttools/blob/main/prompttools/playground/data_loader.py)
- [prompttools/playground/__init__.py](https://github.com/hegelai/prompttools/blob/main/prompttools/playground/__init__.py)
- [prompttools/playground/packages.txt](https://github.com/hegelai/prompttools/blob/main/prompttools/playground/packages.txt)
- [examples/notebooks/README.md](https://github.com/hegelai/prompttools/blob/main/examples/notebooks/README.md)
- [prompttools/utils/__init__.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/__init__.py)
</details>

# Playground, Widgets, and Visualization

## Overview and Purpose

The **Playground** is `prompttools`' Streamlit-based graphical interface that lets users evaluate prompts, model parameters, and vector-database retrieval settings without writing code. It complements the notebook-driven workflow (`OpenAIChatExperiment.ipynb`, `ChromaDBExperiment.ipynb`, etc.) described in [examples/notebooks/README.md](https://github.com/hegelai/prompttools/blob/main/examples/notebooks/README.md) and exposes the same `experiment.run()` + `experiment.visualize()` flow described in the top-level [README.md](https://github.com/hegelai/prompttools/blob/main/README.md). Per [prompttools/playground/README.md](https://github.com/hegelai/prompttools/blob/main/prompttools/playground/README.md), the playground can:

- Evaluate different system instructions (system prompts)
- Try different prompt templates
- Compare responses across models (e.g., GPT-4 vs. local LLaMA 2)

All calls to LLM services and vector databases execute locally on the user's machine; the package does not forward requests or log responses, as stated in [prompttools/playground/README.md](https://github.com/hegelai/prompttools/blob/main/prompttools/playground/README.md).

```mermaid
flowchart LR
    A[User opens Playground] --> B[Select Model & API]
    B --> C[Configure Prompts / Templates]
    C --> D[Run Experiment]
    D --> E[Experiment.run&#40;&#41;]
    E --> F[Experiment.visualize&#40;&#41;]
    F --> G[Streamlit Widgets: Tables, Charts, Rankings]
    G --> H[Compare Responses Across Models]
```

## Architecture and Module Layout

The playground is structured as a small package:

| File | Role |
| --- | --- |
| `prompttools/playground/playground.py` | Streamlit entry point that renders widgets and dispatches experiments. |
| `prompttools/playground/constants.py` | Defines supported model lists, experiment types, and reusable constants. |
| `prompttools/playground/data_loader.py` | Loads user-supplied CSV/data inputs into the experiment runtime. |
| `prompttools/playground/__init__.py` | Package marker; exposes playground helpers. |
| `prompttools/playground/packages.txt` | Lists system packages required by the hosted Streamlit deployment. |

Per the launch instructions in [prompttools/playground/README.md](https://github.com/hegelai/prompttools/blob/main/prompttools/playground/README.md), the application is started locally with:

```bash
git clone https://github.com/hegelai/prompttools.git
cd prompttools && pip install -r prompttools/playground/requirements.txt
streamlit run prompttools/playground/playground.py
```

The playground can also be reached through the hosted Streamlit Community Cloud deployment at `https://prompttools.streamlit.app/`, as documented in [README.md](https://github.com/hegelai/prompttools/blob/main/README.md). That hosted variant does **not** support the LlamaCpp experiment.

## Widgets and Visualization Pipeline

Each widget in the playground corresponds to a stage of the experiment lifecycle declared in the underlying experiments package. After the user configures inputs and triggers a run, `playground.py` invokes the standard `run()` / `visualize()` contract, which is the same one exposed in the notebook examples summarized in [examples/notebooks/README.md](https://github.com/hegelai/prompttools/blob/main/examples/notebooks/README.md).

The typical visualization surfaces include:

- **Tabular response grids** that pivot input prompts against model/parameter combinations.
- **Charts and ranking correlations** for vector-database experiments (e.g., `ChromaDBExperiment`, `LanceDBExperiment`).
- **Side-by-side response comparison** for chat and completion models.
- **Evaluation overlays** powered by utility functions re-exported from [prompttools/utils/__init__.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/__init__.py) — `semantic_similarity`, `cos_similarity`, `ranking_correlation`, `autoeval_scoring`, `autoeval_with_documents`, `validate_json_response`, and `validate_python_response`.

The `__init__.py` re-exports ensure the playground can attach any of these evaluators to a response column without modifying experiment code:

```python
from prompttools.utils import semantic_similarity, autoeval_with_documents
```

## Configuration, Deployment, and Community-Reported Issues

Several community-reported issues directly shape how the playground should be configured and operated:

1. **Missing Streamlit dependency** — Issue #126 reports that `streamlit` is not declared in the top-level `requirements.txt`. The workaround documented in [prompttools/playground/README.md](https://github.com/hegelai/prompttools/blob/main/prompttools/playground/README.md) is to install `prompttools/playground/requirements.txt` before invoking `streamlit run`.
2. **Streamlit deprecations** — Issues #124 and #127 report that `st.experimental_get_query_params` and `st.experimental_set_query_params` were removed after 2024-04-11. Users running the playground against recent Streamlit versions should migrate to `st.query_params` per the upstream [Streamlit docs](https://docs.streamlit.io/library/api-reference/utilities/st.query_params).
3. **Hosted Playground (0.0.41)** — Release [v0.0.41](https://github.com/hegelai/prompttools/releases/tag/v0.0.41) introduced the hosted Playground as a private beta with experiment persistence and collaboration features.
4. **Observability overlay (0.0.45)** — Release [v0.0.45](https://github.com/hegelai/prompttools/releases/tag/v0.0.45) added `import prompttools.logger` so that teams can monitor production LLM usage from inside the same UI surface.
5. **Integration gaps** — Open community requests include Ollama (#39), Microsoft Semantic-Kernel (#114), OpenAI Assistants API (#111), MusicGen (#82), and OpenAI Image Generation (#113). Until those experiments are added under `prompttools/experiment/experiments/`, their constants are absent from `prompttools/playground/constants.py` and they cannot be selected in the playground UI.

The `packages.txt` file is consulted when deploying to Streamlit Community Cloud so that native dependencies (e.g., for `cv2`, `librosa`, image/audio experiments referenced in [examples/notebooks/README.md](https://github.com/hegelai/prompttools/blob/main/examples/notebooks/README.md)) are present at runtime.

## Failure Modes and Best Practices

- **CSV ingestion errors** — `data_loader.py` requires well-formed inputs; malformed CSVs will surface as Streamlit exceptions rather than silent skips.
- **Missing API keys** — Utility evaluators such as `autoeval`, `autoeval_from_expected_response`, and `apply_moderation` (re-exported from [prompttools/utils/__init__.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/__init__.py)) raise `PromptToolsUtilityError` when `OPENAI_API_KEY` is unset.
- **Local model limitations** — LlamaCpp experiments run only on the local playground, never on the hosted Streamlit deployment, per the note in [README.md](https://github.com/hegelai/prompttools/blob/main/README.md).
- **Dependency drift** — Issue #121 documents a working pinned notebook dependency set (`fastapi`, `kaleido`, `python-multipart`, `uvicorn`, `cohere`, `tiktoken`, `pandas==1.5.3`) that users running playground-launched notebooks may need to mirror.
- **Vector-DB imports** — Issue #132 reports a crash when importing `LanceDBExperiment`. This propagates to the playground whenever LanceDB is selected, so users should verify the experiment imports cleanly in isolation before relying on the UI surface.

## See Also

- [Project README](https://github.com/hegelai/prompttools/blob/main/README.md) — high-level overview and supported integrations.
- [Notebook Examples](https://github.com/hegelai/prompttools/blob/main/examples/notebooks/README.md) — programmatic counterparts to every playground view.
- [Release v0.0.41 — Hosted Playground](https://github.com/hegelai/prompttools/releases/tag/v0.0.41) and [Release v0.0.45 — Observability](https://github.com/hegelai/prompttools/releases/tag/v0.0.45).
- `prompttools/utils/__init__.py` for the evaluation utilities the playground attaches to response columns.

---

<a id='page-4'></a>

## Utilities, Harness, PromptTest, and Observability

### Related Pages

Related topics: [Core Experiments API: LLMs, Vector Databases, and Frameworks](#page-2), [Playground, Widgets, and Visualization](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/hegelai/prompttools/blob/main/README.md)
- [prompttools/utils/__init__.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/__init__.py)
- [prompttools/utils/error.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/error.py)
- [prompttools/utils/similarity.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/similarity.py)
- [prompttools/utils/autoeval.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval.py)
- [prompttools/utils/autoeval_from_expected.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval_from_expected.py)
- [prompttools/utils/autoeval_with_docs.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval_with_docs.py)
- [prompttools/utils/validate_python.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/validate_python.py)
- [prompttools/utils/chunk_text.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/chunk_text.py)
- [prompttools/playground/README.md](https://github.com/hegelai/prompttools/blob/main/prompttools/playground/README.md)
- [examples/notebooks/README.md](https://github.com/hegelai/prompttools/blob/main/examples/notebooks/README.md)
</details>

# Utilities, Harness, PromptTest, and Observability

## Overview

`prompttools` provides four cross-cutting capabilities that sit on top of its experiment suite: a **Utilities** module for evaluating, scoring, and chunking text/code, a **Harness** abstraction for embedding full applications (such as LangChain agents) into experiments, the **PromptTest** workflow for asserting behavioral expectations, and an **Observability** layer that ships call metadata to the hosted Hegel AI platform. The Utilities module is fully open-sourced and importable from the `prompttools.utils` namespace, while Harness, PromptTest, and Observability are referenced in the project's roadmap and release notes as the path toward higher-level prompt evaluation and production monitoring.

The README frames the project as a way to "test and experiment with prompts, LLMs, and vector databases" using familiar interfaces such as code, notebooks, and a local playground, and the Utilities module is the bridge that connects raw model responses to these interfaces [Source: [README.md](https://github.com/hegelai/prompttools/blob/main/README.md)].

## Utilities Module

The `prompttools.utils` package re-exports a curated set of helper functions used to score, compare, and post-process model outputs. The full public surface is declared in `prompttools/utils/__init__.py` and includes the following entry points [Source: [prompttools/utils/__init__.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/__init__.py)]:

| Function | Purpose |
| --- | --- |
| `autoeval_binary_scoring` | Judge a response as RIGHT/WRONG via GPT-4 |
| `autoeval_from_expected_response` | Compare actual vs. expected with a grader model |
| `autoeval_scoring` | Score a response on an integer scale |
| `autoeval_with_documents` | Grounded RAG scoring using supporting documents |
| `chunk_text` | Split a paragraph into word-preserving chunks |
| `compute_similarity_against_model` | Embedding-based similarity to a model output |
| `apply_moderation` | Run OpenAI moderation on a response |
| `ranking_correlation` | Rank-correlation metric for retrieval results |
| `semantic_similarity`, `cos_similarity` | HuggingFace/Chroma-backed similarity |
| `validate_json_response`, `validate_python_response` | Structural validators for generated code/data |

A custom exception, `PromptToolsUtilityError`, is defined in `prompttools/utils/error.py` and is raised by utilities when preconditions are not met (for example, a missing `OPENAI_API_KEY` environment variable) [Source: [prompttools/utils/error.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/error.py)].

### Auto-evaluation Utilities

`prompttools/utils/autoeval.py` implements an LLM-as-judge pattern: it asks a chat model (defaulting to GPT-4) to classify a response as `RIGHT` or `WRONG` and returns `1.0` or `0.0` accordingly [Source: [prompttools/utils/autoeval.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval.py)]. `autoeval_from_expected.py` extends the same idea to ground-truth comparisons, asking the judge to compare `PROMPT`, `EXPECTED`, and `ACTUAL` strings [Source: [prompttools/utils/autoeval_from_expected.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval_from_expected.py)]. `autoeval_with_docs.py` adds document-grounded evaluation, rendering retrieved contexts through a Jinja template and producing an integer rating between 0 and 10 [Source: [prompttools/utils/autoeval_with_docs.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval_with_docs.py)].

### Similarity and Structural Metrics

`prompttools/utils/similarity.py` lazily initializes a SentenceTransformer model (`all-MiniLM-L6-v2`) and an optional Chroma client so that similarity functions can run without forcing a heavy import at module load [Source: [prompttools/utils/similarity.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/similarity.py)]. Optional dependencies are imported defensively: if `cv2` or `skimage` is missing, `structural_similarity` raises a `ModuleNotFoundError` directing the user to install `opencv-python` and `scikit-image`.

### Text and Code Validators

`prompttools/utils/chunk_text.py` exposes a single `chunk_text(text, max_chunk_length)` function that splits on whitespace and never breaks a word across chunks [Source: [prompttools/utils/chunk_text.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/chunk_text.py)]. `validate_python.py` writes the response to a temporary file and shells out to `pylint` via `pylint.epylint`; if pylint is not installed, it raises a `RuntimeError` asking the user to either install `pylint<3.0` or supply a custom evaluator [Source: [prompttools/utils/validate_python.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/validate_python.py)].

```mermaid
flowchart LR
    A[Experiment.run] --> B[DataFrame of responses]
    B --> C{Choose Utility}
    C --> D[autoeval_*]
    C --> E[semantic_similarity]
    C --> F[validate_*]
    C --> G[chunk_text]
    D --> H[Scored DataFrame]
    E --> H
    F --> H
    H --> I[visualize / export]
```

## Harness, PromptTest, and Observability

The **Harness** concept is referenced in community requests such as issue #5 (LangChain Support), which proposes "Harnesses and Experiments to support testing LangChains natively" with low-level chain/agent experiments, step-by-step visualizations, and intermediate-output evaluation. The Utilities module is the natural plug-in point for the harness: scoring and validation functions can be applied per-step rather than only on the final response.

**PromptTest** is introduced in the 0.0.41 Hosted Playground release as part of the broader effort to persist experiments and add behavioral assertions that survive across runs (see [GitHub release notes for v0.0.41](https://github.com/hegelai/prompttools/releases/tag/v0.0.41)). It is intended to be authored alongside the existing experiment workflow and evaluated using the same autoeval and similarity utilities documented above.

**Observability** was announced in the 0.0.45 release as a private beta on the hosted Hegel AI platform. The integration is a one-line opt-in via `import prompttools.logger`, which begins forwarding call-level telemetry to the hosted dashboard ([GitHub release notes for v0.0.45](https://github.com/hegelai/prompttools/releases/tag/v0.0.45)). The hosted Playground is accessible at [prompttools.streamlit.app](https://prompttools.streamlit.app/) for users who do not wish to run the Streamlit app locally [Source: [README.md](https://github.com/hegelai/prompttools/blob/main/README.md)].

## Common Failure Modes

Several issues surfaced in the community align directly with this topic:

- **Missing `streamlit` dependency** when running the playground via the documented `streamlit run` command (issue #126). The Playground README correctly lists a separate `prompttools/playground/requirements.txt` that should be installed first [Source: [prompttools/playground/README.md](https://github.com/hegelai/prompttools/blob/main/prompttools/playground/README.md)].
- **Deprecation warnings** from `st.experimental_get_query_params` / `st.experimental_set_query_params` printed at playground launch (issues #124 and #127) — these originate in the Streamlit version pinned to the playground requirements.
- **Optional-dependency errors**: utilities like `structural_similarity` and `validate_python` raise `ModuleNotFoundError`/`RuntimeError` when `cv2`, `skimage`, or `pylint<3.0` are absent, so callers should pre-install these or supply a custom evaluator [Source: [prompttools/utils/similarity.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/similarity.py); [prompttools/utils/validate_python.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/validate_python.py)].
- **Missing `OPENAI_API_KEY`**: the autoeval utilities check `os.environ["OPENAI_API_KEY"]` and raise `PromptToolsUtilityError` if it is unset, which is the most common cause of zero scores in CI runs [Source: [prompttools/utils/autoeval.py](https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval.py)].

## See Also

- [Getting Started (Quickstart & Integrations)](getting-started.md)
- [Experiment Reference](experiments.md)
- [Vector Database & RAG Experiments](vector-databases.md)
- [Notebook Examples Index](https://github.com/hegelai/prompttools/blob/main/examples/notebooks/README.md)

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: hegelai/prompttools

Summary: Found 14 structured pitfall item(s), including 2 high/blocking item(s). Top priority: Maintenance risk - Maintenance risk requires verification.

## 1. Maintenance risk - Maintenance risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/hegelai/prompttools/issues/132

## 2. Security or permission risk - Security or permission risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/hegelai/prompttools/issues/121

## 3. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/hegelai/prompttools/issues/122

## 4. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/hegelai/prompttools/issues/116

## 5. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/hegelai/prompttools/issues/126

## 6. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | https://github.com/hegelai/prompttools

## 7. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/hegelai/prompttools/issues/124

## 8. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/hegelai/prompttools/issues/127

## 9. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: packet_text.keyword_scan | https://github.com/hegelai/prompttools

## 10. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/hegelai/prompttools

## 11. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: downstream_validation.risk_items | https://github.com/hegelai/prompttools

## 12. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: risks.scoring_risks | https://github.com/hegelai/prompttools

## 13. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/hegelai/prompttools

## 14. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/hegelai/prompttools

<!-- canonical_name: hegelai/prompttools; human_manual_source: deepwiki_human_wiki -->
