Doramagic Project Pack · Human Manual
prompttools
Open-source tools for prompt testing and experimentation, with support for both LLMs (e.g. OpenAI, LLaMA) and vector databases (e.g. Chroma, Weaviate, LanceDB).
Overview, Supported Integrations, and Quickstart
Related topics: Core Experiments API: LLMs, Vector Databases, and Frameworks, Playground, Widgets, and Visualization
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Core Experiments API: LLMs, Vector Databases, and Frameworks, Playground, Widgets, and Visualization
Overview, Supported Integrations, and Quickstart
1. Project Overview
prompttools is an open-source library created by Hegel AI that provides self-hostable tools for experimenting with, testing, and evaluating LLMs, vector databases, and prompts. The project positions itself around three familiar interfaces: code, notebooks, and a local playground, so that developers can rapidly compare prompts, models, and retrieval configurations with minimal boilerplate Source: [README.md].
A core design choice is that all executions and calls to LLM services happen locally on the user's machine. The library explicitly states it does not forward requests or log user information Source: [prompttools/playground/README.md]. By default, the package does emit error telemetry to Sentry to track its own reliability issues; this can be disabled with the SENTRY_OPT_OUT environment variable Source: [README.md].
A representative entry point is the OpenAIChatExperiment class, which can be instantiated with models, messages, and parameters, then executed and visualized in a few lines Source: [README.md].
2. Supported Integrations
The README maintains an explicit matrix of integrations grouped into three categories. The table below summarizes the supported surface as documented:
| Category | Component | Status |
|---|---|---|
| LLMs | OpenAI (Completion, ChatCompletion, Fine-tuned models) | Supported |
| LLMs | LLaMA.Cpp (LLaMA 1, LLaMA 2) | Supported |
| LLMs | HuggingFace (Hub API, Inference Endpoints) | Supported |
| LLMs | Anthropic | Supported |
| LLMs | Mistral AI | Supported |
| LLMs | Google Gemini | Supported |
| LLMs | Google PaLM (legacy) | Supported |
| LLMs | Google Vertex AI | Supported |
| LLMs | Azure OpenAI Service | Supported |
| LLMs | Replicate | Supported |
| LLMs | Ollama | In Progress |
| Vector DBs | Chroma, Weaviate, Qdrant, LanceDB, Pinecone | Supported |
| Vector DBs | Milvus | Exploratory |
| Vector DBs | Epsilla | In Progress |
| Frameworks | LangChain, MindsDB | Supported |
Source: README.md
Release-Driven Additions
Versioned releases in the public changelog show how the supported surface has expanded:
- v0.0.35 introduced Google Vertex AI, Azure OpenAI Service, Replicate, Stable Diffusion, Pinecone, Qdrant, and RAG experiments, alongside utility helpers
chunk_text,autoeval_with_documents, andstructural_similaritySource: [README.md]. - v0.0.41 launched the hosted Playground (private beta), persisting experiments with version control and team collaboration features Source: [README.md].
- v0.0.45 added observability features via
import prompttools.logger, intended for monitoring production LLM usage Source: [README.md].
Community-Reported Integration Gaps
The community tracker surfaces requested integrations that are not yet shipped. Tracked feature requests include Ollama (Issue #39), Microsoft Semantic-Kernel (Issue #114), the OpenAI Image Generation API (Issue #113), MusicGen/audio model evaluation (Issue #82), and broader LangChain harnesses (Issue #5). Ollama is explicitly listed as "In Progress" in the official matrix Source: [README.md].
3. Quickstart
Installation
The package is distributed on PyPI and installed via pip Source: [README.md]:
pip install prompttools
To run the playground locally, the repo must be cloned and the Streamlit dependency installed separately because the playground is not bundled as a runtime dependency of the pip package. Community issue #126 reports that streamlit is missing from the main requirements, so the documented command sequence is:
git clone https://github.com/hegelai/prompttools.git
cd prompttools && pip install -r prompttools/playground/requirements.txt
streamlit run prompttools/playground/playground.py
Source: prompttools/playground/README.md
Minimal Code Example
The canonical first example uses OpenAIChatExperiment. The user supplies a list of message threads, a list of model identifiers, and a parameter grid (e.g., temperature values), then calls .run() followed by .visualize() Source: [README.md]:
from prompttools.experiment import OpenAIChatExperiment
messages = [
[{"role": "user", "content": "Tell me a joke."}],
[{"role": "user", "content": "Is 17077 a prime number?"}],
]
models = ["gpt-3.5-turbo", "gpt-4"]
temperatures = [0.0]
openai_experiment = OpenAIChatExperiment(models, messages, temperature=temperatures)
openai_experiment.run()
openai_experiment.visualize()
Notebook Examples
A curated set of runnable notebooks is available under examples/notebooks/. Coverage spans single-model LLM experiments (OpenAI Chat, Anthropic, PaLM 2, Vertex AI, LLaMA.Cpp, HuggingFace Hub), function calling, regression testing, human feedback, and multimodal cases such as Stable Diffusion Source: [examples/notebooks/README.md]. Vector database notebooks under vectordb_experiments/ exercise ChromaDB, Weaviate, LanceDB, Qdrant, and Pinecone, while framework notebooks demonstrate LangChain sequential chains, router chains, and MindsDB integration Source: [examples/notebooks/README.md].
Known Setup Pitfalls
Several community-reported issues affect first-run experience:
- Streamlit deprecation warnings — Issue #124 and #127 report that the playground uses
st.experimental_get_query_params/st.experimental_set_query_params, which were slated for removal after 2024-04-11. Users should migrate tost.query_paramsSource: [prompttools/playground/README.md]. - LanceDB import breakage — Issue #132 documents a crash on
from prompttools.experiment import LanceDBExperimentin the latest version, with a traceback screenshot attached. - OpenAI client compatibility — Issue #122 reports
AttributeError: module 'openai' has no attribute 'types'after a clean clone, indicating a need to pin or upgrade theopenaipackage. - Notebook dependency drift — Issue #121 and #116 document that notebook examples require explicit pins such as
pandas==1.5.3,fastapi,kaleido,uvicorn,cohere, andtiktoken, and that AzureOpenAIService notebooks require bothmodelandprompt(or thestreamflag).
4. Utility Layer
Beyond experiment drivers, prompttools.utils exposes a curated set of evaluation and text utilities, re-exported through prompttools/utils/__init__.py:
chunk_text(text, max_chunk_length)— Splits a paragraph into space-aware chunks that do not break words Source: [prompttools/utils/chunk_text.py].semantic_similarity/cos_similarity— Lazy-loaded embedding helpers backed bysentence-transformers/all-MiniLM-L6-v2or Chroma Source: [prompttools/utils/similarity.py].structural_similarity— Image SSIM scorer requiringopencv-pythonandscikit-imageSource: [prompttools/utils/similarity.py].- Auto-evaluation — GPT-4-as-judge scorers:
autoeval_binary_scoring(RIGHT/WRONG),autoeval_from_expected_response(grade against an expected answer), andautoeval_with_documents(RAG-grounded 0–10 scoring) Source: [prompttools/utils/autoeval.py, prompttools/utils/autoeval_from_expected.py, prompttools/utils/autoeval_with_docs.py]. validate_python_response— Usespylint<3.0(viaepylint) to check whether a generated string parses as valid Python Source: [prompttools/utils/validate_python.py].PromptToolsUtilityError— Shared exception type for utility-layer failures, e.g., missingOPENAI_API_KEYSource: [prompttools/utils/error.py].
These utilities are surfaced in the public namespace as autoeval_binary_scoring, autoeval_with_documents, chunk_text, semantic_similarity, cos_similarity, validate_json_response, validate_python_response, and apply_moderation Source: [prompttools/utils/__init__.py].
See Also
Source: https://github.com/hegelai/prompttools / Human Manual
Core Experiments API: LLMs, Vector Databases, and Frameworks
Related topics: Overview, Supported Integrations, and Quickstart, Utilities, Harness, PromptTest, and Observability
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Overview, Supported Integrations, and Quickstart, Utilities, Harness, PromptTest, and Observability
Core Experiments API: LLMs, Vector Databases, and Frameworks
Overview and Purpose
The Core Experiments API is the central public surface of prompttools. It exposes a uniform Experiment abstraction that lets developers sweep prompts, model parameters, and embedding/retrieval configurations across heterogeneous backends, then collect, evaluate, and visualize the results in a pandas DataFrame. Source: README.md.
The library targets three backend families that share the same call pattern (.run() + .visualize()):
| Family | Purpose | Example backends |
|---|---|---|
| LLMs | Compare prompt/model/parameter combinations | OpenAI, Anthropic, LLaMA.Cpp, HuggingFace, Mistral, Google Gemini/PaLM/Vertex, Azure OpenAI, Replicate |
| Vector Databases | Evaluate retrieval quality and embedding functions | Chroma, Weaviate, Qdrant, LanceDB, Pinecone (Milvus/Epsilla exploratory) |
| Frameworks | Test composed chains and agents | LangChain, MindsDB (LlamaIndex exploratory) |
Source: README.md (Supported Integrations section).
Architecture and Data Flow
All experiment classes inherit from a common Experiment base class that defines the run loop, the evaluation hooks, and the persistence methods (to_csv, to_json, to_lora_json, to_mongo_db). Source: README.md.
flowchart LR
A[User code<br/>inputs + param grid] --> B[Experiment subclass<br/>e.g. OpenAIChatExperiment]
B --> C[.run()]
C --> D[Backend SDK<br/>OpenAI / Anthropic / Chroma / LangChain]
D --> E[pandas DataFrame<br/>prompt, response, metadata]
E --> F[Evaluation utilities<br/>similarity, autoeval, moderation]
F --> G[.visualize() / .to_csv / .to_json]
E --> H[Hosted Playground / Observability<br/>import prompttools.logger]The pipeline is deliberately local-first: API calls originate from the user's machine and only the experiment framework code is loaded. Source: README.md (FAQ: "Will this library forward my LLM calls to a server...?").
LLM Experiments
LLM experiments are constructed from three orthogonal axes: the message/prompt list, the model identifier list, and a parameter dictionary (e.g., temperature). The OpenAIChatExperiment shown in the README demonstrates the canonical pattern. Source: README.md.
from prompttools.experiment import OpenAIChatExperiment
messages = [
[{"role": "user", "content": "Tell me a joke."}],
[{"role": "user", "content": "Is 17077 a prime number?"}],
]
models = ["gpt-3.5-turbo", "gpt-4"]
temperatures = [0.0]
exp = OpenAIChatExperiment(models, messages, temperature=temperatures)
exp.run()
exp.visualize()
Equivalent notebooks exist for Anthropic Claude, PaLM 2, Google Vertex chat, LLaMA.Cpp, and HuggingFace Hub, each parameterized to that provider's SDK. Source: examples/notebooks/README.md.
Community-known failure modes on this surface include a TypeError: Missing required arguments thrown by the Azure OpenAI notebook when model/prompt/stream are not all supplied (Issue #116), and AttributeError: module 'openai' has no attribute 'types' when the local openai SDK is older than what the experiment expects (Issue #122). Both are tracked separately from the experiment API itself.
Vector Database and Retrieval Experiments
Vector database experiments take a corpus plus an embedding configuration and measure retrieval accuracy, typically with ranking-correlation utilities against an expected ordering. Source: examples/notebooks/README.md.
The supported notebooks include Chroma, Weaviate, LanceDB, Qdrant, Pinecone, and a Retrieval-Augmented Generation (RAG) experiment that chains a vector store with an LLM. Source: examples/notebooks/README.md.
Two helper utilities make these experiments practical:
chunk_text(text, max_chunk_length)splits paragraphs without breaking words, enabling consistent ingestion across embedding configurations. Source: prompttools/utils/chunk_text.py.autoeval_with_documents(row, documents, response_column_name)asks GPT-4 to grade whether a response is grounded in retrieved documents, returning an integer 0–10. Source: prompttools/utils/autoeval_with_docs.py.
Community-known issue: from prompttools.experiment import LanceDBExperiment raises an ImportError on the latest published version (Issue #132), indicating that optional backends are imported eagerly rather than lazily.
Framework Experiments and Evaluation Utilities
Framework experiments let users treat chains and routers as first-class experimentables. The currently documented notebooks are LangChainSequentialChainExperiment, LangChainRouterChainExperiment, and MindsDBExperiment. Source: examples/notebooks/README.md.
A request to add Microsoft Semantic-Kernel support (Issue #114) and ongoing interest in deeper LangChain support (Issue #5) reflect where the framework surface is expected to grow.
Evaluation utilities are re-exported from prompttools.utils and attach to experiment rows. Source: prompttools/utils/__init__.py.
| Utility | Purpose | Source |
|---|---|---|
semantic_similarity / cos_similarity | Embedding-based comparison of two strings (HuggingFace or Chroma) | prompttools/utils/similarity.py |
structural_similarity | SSIM between images (cv2 + skimage) | prompttools/utils/similarity.py |
autoeval_binary_scoring | GPT-4 judges whether the response follows the prompt | prompttools/utils/autoeval.py |
autoeval_from_expected_response | GPT-4 grades ACTUAL against EXPECTED | prompttools/utils/autoeval_from_expected.py |
autoeval_with_documents | GPT-4 grades RAG grounding in provided docs | prompttools/utils/autoeval_with_docs.py |
apply_moderation | OpenAI moderation API on a response column | prompttools/utils/moderation.py |
validate_json_response / validate_python_response | Schema validation against model output | prompttools/utils/__init__.py |
ranking_correlation | Compare vector DB ordering to an expected ordering | prompttools/utils/__init__.py |
Playground, Persistence, and Observability
Experiment objects can be persisted locally via to_csv, to_json, to_lora_json, or to_mongo_db. Source: README.md. For interactive exploration, the Streamlit playground is launched from the cloned repo and shares the same Experiment API. Source: prompttools/playground/README.md.
Two community-reported issues affect the playground specifically and are worth noting here: deprecation warnings from st.experimental_get_query_params (Issue #124, Issue #127) and a missing streamlit dependency in the playground's requirements.txt (Issue #126).
For hosted workflows, import prompttools.logger enables the PromptTools Observability beta (v0.0.45), which persists experiments with version control and adds a one-line observability hook to production LLM calls. Source: README.md.
See Also
- Evaluation Utilities Reference — deeper documentation of the auto-eval, similarity, and moderation helpers.
- Playground UI Guide — running and troubleshooting the Streamlit playground.
- Notebook Examples Index — full catalog of runnable notebooks per backend.
Source: https://github.com/hegelai/prompttools / Human Manual
Playground, Widgets, and Visualization
Related topics: Overview, Supported Integrations, and Quickstart, Core Experiments API: LLMs, Vector Databases, and Frameworks
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Overview, Supported Integrations, and Quickstart, Core Experiments API: LLMs, Vector Databases, and Frameworks
Playground, Widgets, and Visualization
Overview and Purpose
The Playground is prompttools' Streamlit-based graphical interface that lets users evaluate prompts, model parameters, and vector-database retrieval settings without writing code. It complements the notebook-driven workflow (OpenAIChatExperiment.ipynb, ChromaDBExperiment.ipynb, etc.) described in examples/notebooks/README.md and exposes the same experiment.run() + experiment.visualize() flow described in the top-level README.md. Per prompttools/playground/README.md, the playground can:
- Evaluate different system instructions (system prompts)
- Try different prompt templates
- Compare responses across models (e.g., GPT-4 vs. local LLaMA 2)
All calls to LLM services and vector databases execute locally on the user's machine; the package does not forward requests or log responses, as stated in prompttools/playground/README.md.
flowchart LR
A[User opens Playground] --> B[Select Model & API]
B --> C[Configure Prompts / Templates]
C --> D[Run Experiment]
D --> E[Experiment.run()]
E --> F[Experiment.visualize()]
F --> G[Streamlit Widgets: Tables, Charts, Rankings]
G --> H[Compare Responses Across Models]Architecture and Module Layout
The playground is structured as a small package:
| File | Role |
|---|---|
prompttools/playground/playground.py | Streamlit entry point that renders widgets and dispatches experiments. |
prompttools/playground/constants.py | Defines supported model lists, experiment types, and reusable constants. |
prompttools/playground/data_loader.py | Loads user-supplied CSV/data inputs into the experiment runtime. |
prompttools/playground/__init__.py | Package marker; exposes playground helpers. |
prompttools/playground/packages.txt | Lists system packages required by the hosted Streamlit deployment. |
Per the launch instructions in prompttools/playground/README.md, the application is started locally with:
git clone https://github.com/hegelai/prompttools.git
cd prompttools && pip install -r prompttools/playground/requirements.txt
streamlit run prompttools/playground/playground.py
The playground can also be reached through the hosted Streamlit Community Cloud deployment at https://prompttools.streamlit.app/, as documented in README.md. That hosted variant does not support the LlamaCpp experiment.
Widgets and Visualization Pipeline
Each widget in the playground corresponds to a stage of the experiment lifecycle declared in the underlying experiments package. After the user configures inputs and triggers a run, playground.py invokes the standard run() / visualize() contract, which is the same one exposed in the notebook examples summarized in examples/notebooks/README.md.
The typical visualization surfaces include:
- Tabular response grids that pivot input prompts against model/parameter combinations.
- Charts and ranking correlations for vector-database experiments (e.g.,
ChromaDBExperiment,LanceDBExperiment). - Side-by-side response comparison for chat and completion models.
- Evaluation overlays powered by utility functions re-exported from prompttools/utils/__init__.py —
semantic_similarity,cos_similarity,ranking_correlation,autoeval_scoring,autoeval_with_documents,validate_json_response, andvalidate_python_response.
The __init__.py re-exports ensure the playground can attach any of these evaluators to a response column without modifying experiment code:
from prompttools.utils import semantic_similarity, autoeval_with_documents
Configuration, Deployment, and Community-Reported Issues
Several community-reported issues directly shape how the playground should be configured and operated:
- Missing Streamlit dependency — Issue #126 reports that
streamlitis not declared in the top-levelrequirements.txt. The workaround documented in prompttools/playground/README.md is to installprompttools/playground/requirements.txtbefore invokingstreamlit run. - Streamlit deprecations — Issues #124 and #127 report that
st.experimental_get_query_paramsandst.experimental_set_query_paramswere removed after 2024-04-11. Users running the playground against recent Streamlit versions should migrate tost.query_paramsper the upstream Streamlit docs. - Hosted Playground (0.0.41) — Release v0.0.41 introduced the hosted Playground as a private beta with experiment persistence and collaboration features.
- Observability overlay (0.0.45) — Release v0.0.45 added
import prompttools.loggerso that teams can monitor production LLM usage from inside the same UI surface. - Integration gaps — Open community requests include Ollama (#39), Microsoft Semantic-Kernel (#114), OpenAI Assistants API (#111), MusicGen (#82), and OpenAI Image Generation (#113). Until those experiments are added under
prompttools/experiment/experiments/, their constants are absent fromprompttools/playground/constants.pyand they cannot be selected in the playground UI.
The packages.txt file is consulted when deploying to Streamlit Community Cloud so that native dependencies (e.g., for cv2, librosa, image/audio experiments referenced in examples/notebooks/README.md) are present at runtime.
Failure Modes and Best Practices
- CSV ingestion errors —
data_loader.pyrequires well-formed inputs; malformed CSVs will surface as Streamlit exceptions rather than silent skips. - Missing API keys — Utility evaluators such as
autoeval,autoeval_from_expected_response, andapply_moderation(re-exported from prompttools/utils/__init__.py) raisePromptToolsUtilityErrorwhenOPENAI_API_KEYis unset. - Local model limitations — LlamaCpp experiments run only on the local playground, never on the hosted Streamlit deployment, per the note in README.md.
- Dependency drift — Issue #121 documents a working pinned notebook dependency set (
fastapi,kaleido,python-multipart,uvicorn,cohere,tiktoken,pandas==1.5.3) that users running playground-launched notebooks may need to mirror. - Vector-DB imports — Issue #132 reports a crash when importing
LanceDBExperiment. This propagates to the playground whenever LanceDB is selected, so users should verify the experiment imports cleanly in isolation before relying on the UI surface.
See Also
- Project README — high-level overview and supported integrations.
- Notebook Examples — programmatic counterparts to every playground view.
- Release v0.0.41 — Hosted Playground and Release v0.0.45 — Observability.
prompttools/utils/__init__.pyfor the evaluation utilities the playground attaches to response columns.
Source: https://github.com/hegelai/prompttools / Human Manual
Utilities, Harness, PromptTest, and Observability
Related topics: Core Experiments API: LLMs, Vector Databases, and Frameworks, Playground, Widgets, and Visualization
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Core Experiments API: LLMs, Vector Databases, and Frameworks, Playground, Widgets, and Visualization
Utilities, Harness, PromptTest, and Observability
Overview
prompttools provides four cross-cutting capabilities that sit on top of its experiment suite: a Utilities module for evaluating, scoring, and chunking text/code, a Harness abstraction for embedding full applications (such as LangChain agents) into experiments, the PromptTest workflow for asserting behavioral expectations, and an Observability layer that ships call metadata to the hosted Hegel AI platform. The Utilities module is fully open-sourced and importable from the prompttools.utils namespace, while Harness, PromptTest, and Observability are referenced in the project's roadmap and release notes as the path toward higher-level prompt evaluation and production monitoring.
The README frames the project as a way to "test and experiment with prompts, LLMs, and vector databases" using familiar interfaces such as code, notebooks, and a local playground, and the Utilities module is the bridge that connects raw model responses to these interfaces Source: [README.md].
Utilities Module
The prompttools.utils package re-exports a curated set of helper functions used to score, compare, and post-process model outputs. The full public surface is declared in prompttools/utils/__init__.py and includes the following entry points Source: [prompttools/utils/__init__.py]:
| Function | Purpose |
|---|---|
autoeval_binary_scoring | Judge a response as RIGHT/WRONG via GPT-4 |
autoeval_from_expected_response | Compare actual vs. expected with a grader model |
autoeval_scoring | Score a response on an integer scale |
autoeval_with_documents | Grounded RAG scoring using supporting documents |
chunk_text | Split a paragraph into word-preserving chunks |
compute_similarity_against_model | Embedding-based similarity to a model output |
apply_moderation | Run OpenAI moderation on a response |
ranking_correlation | Rank-correlation metric for retrieval results |
semantic_similarity, cos_similarity | HuggingFace/Chroma-backed similarity |
validate_json_response, validate_python_response | Structural validators for generated code/data |
A custom exception, PromptToolsUtilityError, is defined in prompttools/utils/error.py and is raised by utilities when preconditions are not met (for example, a missing OPENAI_API_KEY environment variable) Source: [prompttools/utils/error.py].
Auto-evaluation Utilities
prompttools/utils/autoeval.py implements an LLM-as-judge pattern: it asks a chat model (defaulting to GPT-4) to classify a response as RIGHT or WRONG and returns 1.0 or 0.0 accordingly Source: [prompttools/utils/autoeval.py]. autoeval_from_expected.py extends the same idea to ground-truth comparisons, asking the judge to compare PROMPT, EXPECTED, and ACTUAL strings Source: [prompttools/utils/autoeval_from_expected.py]. autoeval_with_docs.py adds document-grounded evaluation, rendering retrieved contexts through a Jinja template and producing an integer rating between 0 and 10 Source: [prompttools/utils/autoeval_with_docs.py].
Similarity and Structural Metrics
prompttools/utils/similarity.py lazily initializes a SentenceTransformer model (all-MiniLM-L6-v2) and an optional Chroma client so that similarity functions can run without forcing a heavy import at module load Source: [prompttools/utils/similarity.py]. Optional dependencies are imported defensively: if cv2 or skimage is missing, structural_similarity raises a ModuleNotFoundError directing the user to install opencv-python and scikit-image.
Text and Code Validators
prompttools/utils/chunk_text.py exposes a single chunk_text(text, max_chunk_length) function that splits on whitespace and never breaks a word across chunks Source: [prompttools/utils/chunk_text.py]. validate_python.py writes the response to a temporary file and shells out to pylint via pylint.epylint; if pylint is not installed, it raises a RuntimeError asking the user to either install pylint<3.0 or supply a custom evaluator Source: [prompttools/utils/validate_python.py].
flowchart LR
A[Experiment.run] --> B[DataFrame of responses]
B --> C{Choose Utility}
C --> D[autoeval_*]
C --> E[semantic_similarity]
C --> F[validate_*]
C --> G[chunk_text]
D --> H[Scored DataFrame]
E --> H
F --> H
H --> I[visualize / export]Harness, PromptTest, and Observability
The Harness concept is referenced in community requests such as issue #5 (LangChain Support), which proposes "Harnesses and Experiments to support testing LangChains natively" with low-level chain/agent experiments, step-by-step visualizations, and intermediate-output evaluation. The Utilities module is the natural plug-in point for the harness: scoring and validation functions can be applied per-step rather than only on the final response.
PromptTest is introduced in the 0.0.41 Hosted Playground release as part of the broader effort to persist experiments and add behavioral assertions that survive across runs (see GitHub release notes for v0.0.41). It is intended to be authored alongside the existing experiment workflow and evaluated using the same autoeval and similarity utilities documented above.
Observability was announced in the 0.0.45 release as a private beta on the hosted Hegel AI platform. The integration is a one-line opt-in via import prompttools.logger, which begins forwarding call-level telemetry to the hosted dashboard (GitHub release notes for v0.0.45). The hosted Playground is accessible at prompttools.streamlit.app for users who do not wish to run the Streamlit app locally Source: [README.md].
Common Failure Modes
Several issues surfaced in the community align directly with this topic:
- Missing
streamlitdependency when running the playground via the documentedstreamlit runcommand (issue #126). The Playground README correctly lists a separateprompttools/playground/requirements.txtthat should be installed first Source: [prompttools/playground/README.md]. - Deprecation warnings from
st.experimental_get_query_params/st.experimental_set_query_paramsprinted at playground launch (issues #124 and #127) — these originate in the Streamlit version pinned to the playground requirements. - Optional-dependency errors: utilities like
structural_similarityandvalidate_pythonraiseModuleNotFoundError/RuntimeErrorwhencv2,skimage, orpylint<3.0are absent, so callers should pre-install these or supply a custom evaluator Source: [prompttools/utils/similarity.py; prompttools/utils/validate_python.py]. - Missing
OPENAI_API_KEY: the autoeval utilities checkos.environ["OPENAI_API_KEY"]and raisePromptToolsUtilityErrorif it is unset, which is the most common cause of zero scores in CI runs Source: [prompttools/utils/autoeval.py].
See Also
- Getting Started (Quickstart & Integrations)
- Experiment Reference
- Vector Database & RAG Experiments
- Notebook Examples Index
Source: https://github.com/hegelai/prompttools / Human Manual
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
Doramagic Pitfall Log
Found 14 structured pitfall item(s), including 2 high/blocking item(s). Top priority: Maintenance risk - Maintenance risk requires verification.
1. Maintenance risk: Maintenance risk requires verification
- Severity: high
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/hegelai/prompttools/issues/132
2. Security or permission risk: Security or permission risk requires verification
- Severity: high
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/hegelai/prompttools/issues/121
3. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/hegelai/prompttools/issues/122
4. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/hegelai/prompttools/issues/116
5. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/hegelai/prompttools/issues/126
6. Capability evidence risk: Capability evidence risk requires verification
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: capability.assumptions | https://github.com/hegelai/prompttools
7. Runtime risk: Runtime risk requires verification
- Severity: medium
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/hegelai/prompttools/issues/124
8. Runtime risk: Runtime risk requires verification
- Severity: medium
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/hegelai/prompttools/issues/127
9. Runtime risk: Runtime risk requires verification
- Severity: medium
- Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: packet_text.keyword_scan | https://github.com/hegelai/prompttools
10. Maintenance risk: Maintenance risk requires verification
- Severity: medium
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/hegelai/prompttools
11. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: downstream_validation.risk_items | https://github.com/hegelai/prompttools
12. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: risks.scoring_risks | https://github.com/hegelai/prompttools
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using prompttools with real data or production workflows.
- Deprecation error : st.experimental_get_query_params and st.experimental - github / github_issue
- package breaks while importing LanceDB experiment - github / github_issue
- Support for OpenAI Assistants API - github / github_issue
- Deprecation warnings - github / github_issue
- Missing requirements - streamlit - github / github_issue
- AzureOpenAIServiceExperiment notebook: TypeError: Missing required argum - github / github_issue
- AttributeError: module 'openai' has no attribute 'types' - github / github_issue
- OpenAI Chat Experiment Example dependency issue fix - github / github_issue
- Add support for Microsoft Semantic-Kernel - github / github_issue
- prompttools 0.0.45 - Introducing Observability features! - github / github_release
- prompttools 0.0.41 - Hosted Playground Launch - github / github_release
- prompttools 0.0.35 - github / github_release
Source: Project Pack community evidence and pitfall evidence