Doramagic Project Pack · Human Manual
promptfoo
Promptfoo is described in its manifest as an "LLM eval & testing toolkit" distributed as a Node.js ES module with dual entry points for import and require, and ships CLI binaries promptfoo...
Core Evaluation Engine & Architecture
Related topics: LLM Provider Ecosystem & Custom Integrations, Web UI, Code Scanning, Server & Deployment
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: LLM Provider Ecosystem & Custom Integrations, Web UI, Code Scanning, Server & Deployment
Core Evaluation Engine & Architecture
Purpose and Scope
Promptfoo is a comprehensive LLM evaluation and testing toolkit, distributed as a Node.js package with two binary entry points (promptfoo and pf) defined in package.json. The package targets Node.js ^20.20.0 || >=22.22.0, uses ES modules by default, and is organized as a monorepo with workspaces for src/app and site directories. Its core evaluation engine is responsible for orchestrating prompt execution, provider invocation, assertion grading, and result aggregation across the many supported LLM backends.
The engine is designed around three interacting subsystems:
- Configuration and provider resolution — declarative YAML/JSON configurations that reference any of 50+ supported model providers.
- Test execution and concurrency — scheduled fan-out of prompt/test-case combinations with retries, caching, and timeout handling.
- Assertion grading and result reporting — pluggable graders ranging from deterministic string matchers to LLM-as-judge rubrics and multimodal scorers.
MCP Tool Surface
Promptfoo exposes its evaluation capabilities as a set of Model Context Protocol (MCP) tools, allowing AI agents to drive evaluations programmatically. The tool registry in src/commands/mcp/lib/toolRegistry.ts acts as the single source of truth for tool metadata. Core evaluation tools include list_evaluations, get_evaluation_details, run_evaluation, and share_evaluation, each decorated with MCP-spec annotations (readOnlyHint, idempotentHint, longRunningHint) that follow the 2025-03-26 specification.
run_evaluation accepts a configPath, optional testCaseIndices filter (single index, array, or {start, end} range), promptFilter, providerFilter, maxConcurrency (1–20), and timeoutMs (1s–5min). Result pagination uses resultLimit and resultOffset, with a default page size of 20.
Supporting utilities in src/commands/mcp/lib/utils.ts standardize responses: createToolResponse wraps payloads into TextContent JSON blocks with success, timestamp, and isError fields; withTimeout races a promise against a 5-minute timer; and safeStringify handles circular references and BigInt values encountered during provider response serialization.
Provider and Assertion Architecture
The engine is provider-agnostic. Each provider integration (OpenAI, Anthropic, Google Vertex, xAI, Cerebras, Bedrock, Codex SDK, Claude Agent SDK, OpenCode SDK, and more) implements a uniform call interface that returns a normalized output plus latency. Provider-specific features are surfaced through configuration blocks, for example Anthropic's output_format JSON schema in examples/anthropic/structured-outputs/README.md, xAI's image and search tools in examples/xai/chat/README.md, and OpenAI's MCP tools array in examples/openai-mcp/README.md.
Assertions operate on the normalized provider output. The schema-driven is-valid-openai-tools-call assertion validates both function and MCP tool invocations, while weighted compositions such as contains-any plus llm-rubric allow multi-layered grading. The release history indicates that 0.121.15 added multimodal output grading (#9617) and that 0.121.12 added repeat-aware fetch cache options (#6844) — both directly relevant to community requests for per-test-case repeat semantics discussed in #9700.
Redteam Subsystem
The redteam layer reuses the core engine but adds iterative attack providers. The README in src/redteam/providers/README.md documents an iteration loop with on-topic checking, target invocation, vision-based judging, score comparison, and an unblocking feature that detects when a target model asks a clarifying question and generates an automated response. The examples/redteam-guardrails/README.md preset bundles 42 plugins spanning prompt injection, harmful content, and PII — each plugin contributes test cases that the same evaluation engine grades, demonstrating how the architecture cleanly separates plugin definitions from execution.
flowchart LR
Config[YAML Config] --> Engine[Eval Engine]
MCP[MCP Tool Call] --> Engine
Engine --> Providers[Provider Adapters]
Providers --> Graders[Assertion Graders]
Graders --> Results[Aggregated Results]
Redteam[Redteam Plugins] --> Engine
Engine --> Cache[(Fetch Cache)]Configuration Flexibility
TypeScript configs are supported via loaders, as noted in examples/config-ts/README.md, which uses Zod schemas for structured outputs across OpenAI and Gemini providers. Node.js 20+ requires the --import flag or a tsx loader; native TypeScript execution is a future direction.
See Also
- Provider-specific examples under
examples/*/README.md - src/redteam/providers/README.md for adversarial orchestration
- Release notes at github.com/promptfoo/promptfoo/releases for engine-level changes
Source: https://github.com/promptfoo/promptfoo / Human Manual
LLM Provider Ecosystem & Custom Integrations
Related topics: Core Evaluation Engine & Architecture, Red Teaming & Adversarial Security Testing
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Core Evaluation Engine & Architecture, Red Teaming & Adversarial Security Testing
LLM Provider Ecosystem & Custom Integrations
Overview
Promptfoo is described in its manifest as an "LLM eval & testing toolkit" distributed as a Node.js ES module with dual entry points for import and require, and ships CLI binaries promptfoo and pf (Source: package.json). Beyond its core evaluation engine, the project maintains a large surface area of provider adapters that let users evaluate prompts against hosted APIs, open-source models served through gateways, agent runtimes, and protocol-based tool servers. The ecosystem is intentionally heterogeneous: it covers commercial chat models, image generators, TTS providers, embeddings/vectorization endpoints, agent SDKs, Model Context Protocol (MCP) tool servers, and bespoke self-hosted gateways.
Provider adapters are exposed through example directories under examples/ that pair a runnable promptfooconfig.yaml with prose documentation, allowing users to bootstrap new evaluations through npx promptfoo@latest init --example <name> (Source: examples/provider-atlascloud/README.md).
Provider Categories
The provider ecosystem organizes adapters into several functional groups, each with distinct configuration patterns and runtime expectations.
| Category | Representative Providers | Defining Trait |
|---|---|---|
| Hosted chat / completion | OpenAI, Anthropic (Claude), Google Vertex, Cerebras, Atlas Cloud, Replicate-hosted Llama | API key + base URL; standard generation parameters (temperature, max_tokens, top_p) |
| Multimodal generation | openai:image:gpt-image-1.5, openai:image:gpt-image-1, ElevenLabs TTS, QuiverAI image pipelines | Size/quality/format options; per-image moderation; pronunciation dictionaries |
| Agent runtimes | Claude Agent SDK, OpenAI Codex app-server, OpenAI Deep Research | Spawn subprocess; permission policy and sandbox mode controls |
| Tool / protocol servers | OpenAI MCP integration, A2A provider | Server URL, allowed tools list, approval policy |
| Red-team adversarial targets | Custom, Crescendo, GOAT red-team providers | Multi-turn loop with on-topic scoring and feedback iteration |
| Documentation site | Docusaurus 2 build with llms-full.txt / llms.txt plugin | Static site generation for promptfoo.dev |
Source: examples/openai-images/README.md, examples/provider-elevenlabs/tts-advanced/README.md, examples/claude-agent-sdk/README.md, examples/openai-mcp/README.md, src/redteam/providers/README.md, site/README.md.
Configuration Patterns
Hosted Model Providers
Hosted providers expose a consistent id:<model> shorthand and a config: block for transport and sampling parameters. Cerebras demonstrates JSON schema enforcement on structured outputs, returning deterministic fields such as cuisine, difficulty, and ingredients (Source: examples/provider-cerebras/README.md). Atlas Cloud shows that OpenAI-compatible gateways can be retargeted via apiBaseUrl and a custom apiKeyEnvar, enabling evaluation across model families behind a single provider account (Source: examples/provider-atlascloud/README.md).
Google Vertex configurations are split per model family (promptfooconfig.gemini.yaml, promptfooconfig.claude.yaml, promptfooconfig.llama.yaml, promptfooconfig.search.yaml, promptfooconfig.response-schema.yaml) and run with promptfoo eval -c <file> (Source: examples/google-vertex/README.md). Replicate examples highlight typical model knobs such as temperature, max_tokens, and top_p while documenting that cold-start latency for some models can reach 30–60 seconds (Source: examples/provider-replicate/llama4-scout/README.md).
Tool and Protocol Integrations
The OpenAI MCP example shows how to declare remote tool servers in YAML:
tools:
- type: mcp
server_label: deepwiki
server_url: https://mcp.deepwiki.com/mcp
allowed_tools: ['ask_question', 'read_wiki_structure']
require_approval: never
Approval can be scoped granularly with require_approval.never.tool_names so that only specific tools bypass confirmation. Assertion patterns for MCP workflows include is-valid-openai-tools-call for structural validation of function and MCP tool invocations, plus content-level checks such as contains: 'MCP Tool Result' and not-contains: 'MCP Tool Error' (Source: examples/openai-mcp/README.md).
Anthropic's web tools example pairs web_search_20250305 and web_fetch_20250910 tool definitions with allowed_domains, citation control, and max_content_tokens to bound retrieval (Source: examples/anthropic/web-tools/README.md).
Agent Runtime Providers
The Claude Agent SDK example surfaces several advanced controls: sandbox configuration with network restrictions, JavaScript runtime selection (node, bun, deno), extra CLI arguments, setting sources for SDK initialization, permission bypass for automated testing, and AskUserQuestion handling via ask_user_question.behavior plus append_allowed_tools (Source: examples/claude-agent-sdk/README.md). The OpenAI Codex app-server provider starts its own codex app-server subprocess rather than attaching to an already-running Codex Desktop process, and defaults to sandbox_mode: read-only, approval_policy: never, and skip_git_repo_check: true (Source: examples/openai-codex-app-server/README.md). OpenAI Deep Research documents that reasoning models burn significant tokens and recommends validating citation URLs before relying on outputs (Source: examples/openai-deep-research/README.md).
TypeScript Configurations
Promptfoo supports TypeScript configuration via external loaders; Node.js 20+ can run ES modules with the --import flag, and tsx is recommended for the best developer experience. Configs can use Zod schemas that the system automatically adapts for different providers (OpenAI and Gemini), enforcing strict schema adherence for structured JSON outputs (Source: examples/config-ts/README.md).
MCP Server-Side Surface
Promptfoo also exposes itself as an MCP tool server. The TOOL_DEFINITIONS array in toolRegistry.ts is the single source of truth for tool documentation, including list_evaluations, get_evaluation_details, run_evaluation, and share_evaluation. Each entry declares parameters, annotations such as readOnlyHint, idempotentHint, and longRunningHint, and a category such as evaluation (Source: src/commands/mcp/lib/toolRegistry.ts). run_evaluation is annotated as longRunningHint: true and accepts configPath, testCaseIndices, promptFilter, providerFilter, maxConcurrency (1–20), timeoutMs (1s–5min), resultLimit, and resultOffset, reflecting how external agents can drive evaluations through MCP.
Companion Tooling
Outside the core CLI, the repository ships code-scan-action, a GitHub Action that uses AI agents to find LLM-related vulnerabilities such as prompt injection, PII exposure, and excessive agency. It can post PR review comments and, with sarif-output-path configured, surface findings in GitHub Code Scanning via github/codeql-action/upload-sarif (Source: code-scan-action/README.md). The accompanying documentation site is built with Docusaurus 2 and includes a custom plugin that emits llms-full.txt and llms.txt so LLM-based tools can analyze or search the documentation corpus (Source: site/README.md).
Recent Ecosystem Evolution
Release notes show continuous provider expansion: 0.121.8 added Claude Agent SDK bumps and a GPT model (release); 0.121.9 added gpt-5.5 model support (release); 0.121.11 brought QuiverAI Arrow 1.1, a vectorize endpoint, and a GPT Image-2 pipeline (release); 0.121.12 introduced repeat-aware fetch cache options (release); 0.121.13 added the MiniMax provider and Claude Opus 4.8 support (release); 0.121.14 added the A2A provider and an agent-rubric grader (release); and 0.121.15 enabled grading of multimodal outputs (release). Community requests such as a per-test-case repeat option (#9700) reflect ongoing demand for finer control over how many times individual tests run, particularly for negative assertions that may otherwise pass by chance.
See Also
Source: https://github.com/promptfoo/promptfoo / Human Manual
Red Teaming & Adversarial Security Testing
Related topics: LLM Provider Ecosystem & Custom Integrations, Web UI, Code Scanning, Server & Deployment
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: LLM Provider Ecosystem & Custom Integrations, Web UI, Code Scanning, Server & Deployment
Red Teaming & Adversarial Security Testing
Purpose and Scope
Promptfoo's red team module is a comprehensive framework for adversarial security testing of LLM applications and foundation models. It enables developers and security teams to systematically probe AI systems for vulnerabilities, safety issues, and policy violations before deployment, replacing ad-hoc manual probing with a reproducible, automated evaluation pipeline. Source: README.md:1-15
The framework targets a broad spectrum of systems:
- Foundation models such as DeepSeek R1 0528 and GPT-5.4, evaluated side-by-side for security responses. Source: examples/redteam-deepseek-foundation/README.md:1-25
- Agentic systems including autonomous coding agents tested for prompt injection, terminal output injection, secret exfiltration, sandbox escapes, and verifier sabotage. Source: examples/redteam-coding-agent/README.md:1-30
- MCP-enabled assistants that use Model Context Protocol for tool discovery, parameter handling, and recursive function execution. Source: examples/redteam-mcp/README.md:1-20
- Multi-modal models that accept image inputs, evaluated with UnsafeBench and VLGuard datasets. Source: examples/redteam-multi-modal/README.md:1-55
- API-driven applications instrumented with OWASP API Security Top 10 vulnerabilities. Source: examples/redteam-api-top-10/README.md:1-15
Core Architecture
The red team system is organized into four cooperating components: plugins, strategies, providers, and tracing.
flowchart LR
A[promptfooconfig.yaml] --> P[Plugins]
A --> S[Strategies]
P --> T[Test Cases]
S --> T
T --> PR[Providers]
PR --> TM[Target Model]
PR --> AM[Adversarial Model]
TM --> G[Grading & Assertions]
AM --> G
G --> R[Reports]Plugins
Plugins generate adversarial test cases from curated datasets or heuristic generators. The guardrails preset, for instance, activates 42 plugins spanning prompt injection, jailbreaking, harmful content, system prompt override, and prompt extraction. Source: examples/redteam-guardrails/README.md:10-40 Other notable plugins include HarmBench for standardized jailbreak evaluation and VLGuard for multi-modal content moderation. Source: examples/redteam-harmbench/README.md:1-20, Source: examples/redteam-multi-modal/README.md:35-55
Strategies
Strategies modify or wrap test cases to maximize attack success. The iterative provider generates adversarial prompts and refines them based on a judge's feedback loop, penalizing certain phrases and enforcing on-topic checks. Source: src/redteam/providers/README.md:1-30 Other strategies include indirect-web-pwn, which tests data exfiltration by injecting hidden instructions into fetched web pages. Source: examples/redteam-indirect-web-pwn/README.md:1-25
Providers
Providers supply the target under test and any auxiliary adversarial or grading models. Recent releases added MiniMax, A2A, and broader multi-modal support (community release notes v0.121.13–0.121.15). For OpenAI-hosted MCP integration, providers expose tool configuration including server_label, server_url, allowed_tools, and selective approval policies. Source: examples/openai-mcp/README.md:1-30
Tracing
Tracing captures internal execution spans (LLM calls, guardrail decisions, tool invocations) and feeds them into attack generation or grading. Trace data is sanitizable to strip sensitive attributes and can be filtered by span name pattern or limited by depth and count. Source: examples/redteam-tracing-example/README.md:1-50
Configuration and Usage
Red team evaluations are configured in promptfooconfig.yaml and executed via:
npx promptfoo@latest redteam run
npx promptfoo@latest view
Source: examples/redteam-deepseek-foundation/README.md:25-45, Source: examples/redteam-indirect-web-pwn/README.md:30-45
A representative configuration specifies plugins, strategies, and a target provider. For example, tracing can be enabled globally or overridden per strategy so that crescendo-style attacks see guardrail decisions while GOAT-style attacks see tool call data. Source: examples/redteam-tracing-example/README.md:10-50
For coding agents, the target provider's working_dir should point at a disposable checkout, and synthetic canary secrets should be used in place of production credentials. Source: examples/redteam-coding-agent/README.md:25-40
Common Failure Modes
When running red team evaluations, watch for:
- Missing API credentials — Foundation-model targets require provider keys (e.g.,
OPENROUTER_API_KEY,OPENAI_API_KEY); absent keys cause silent skips. Source: examples/redteam-deepseek-foundation/README.md:15-25 - Remote-generation dependency — Plugin collections like
coding-agent:corerequire the remote generation service. SettingPROMPTFOO_DISABLE_REDTEAM_REMOTE_GENERATION=truedisables these collections entirely. Source: examples/redteam-coding-agent/README.md:20-30 - Credential leakage in traces — Trace spans can capture sensitive payloads; sanitize attributes when capturing them. Source: examples/redteam-tracing-example/README.md:30-50
- Flaky pass/fail on negative assertions — A model may pass a refusal check once by chance; the community has requested a per-test-case
repeatoption to verify consistency (community issue #9700). Recent cache changes in v0.121.12 added repeat-aware fetch options, supporting this direction.
Community-Driven Evolution
Recent releases show the red team module expanding along community-requested lines:
- Multimodal grading (v0.121.15) extends assertions to non-text outputs.
- Agent-rubric grader and A2A provider (v0.121.14) broaden agentic evaluation.
- Repeat-aware fetch cache (v0.121.12) supports reliability checks via repeated runs.
- Humanized per-cell latency (v0.121.10) improves result-table readability.
These changes illustrate promptfoo's pattern of treating red teaming as an evolving, community-shaped discipline rather than a static checklist.
See Also
- Providers and Model Integration
- Assertions and Grading
- Configuration Reference
- CLI Commands
- Tracing and Observability
Source: https://github.com/promptfoo/promptfoo / Human Manual
Web UI, Code Scanning, Server & Deployment
Related topics: Core Evaluation Engine & Architecture, Red Teaming & Adversarial Security Testing
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Core Evaluation Engine & Architecture, Red Teaming & Adversarial Security Testing
Web UI, Code Scanning, Server & Deployment
Promptfoo ships as a multi-surface product: a CLI for evaluations, a React-based web UI for browsing results, a Docusaurus documentation site, an MCP server for tool-based interactions, and a dedicated GitHub Action for code scanning. This page describes the moving parts of those surfaces, how they relate to the CLI entry point, and the most important deployment and operational considerations.
Repository Surfaces and Workspaces
The npm package promptfoo (currently 0.121.17) declares two npm workspaces in addition to the core CLI package: src/app and site. This structure signals that the evaluation Web UI and the public documentation site are first-class deliverables shipped alongside the evaluator (package.json).
The CLI exposes two binary entry points — promptfoo and pf — both pointing at dist/src/entrypoint.js. The package supports both ESM and CJS consumers via conditional exports, and ships a separate ./contracts entry for shared type definitions. Node engines are pinned to ^20.20.0 || >=22.22.0, which is the supported runtime for both the CLI, the server, and the bundled UIs (package.json).
The published files whitelist deliberately excludes dist/test, mocks, and source maps, so production deployments only carry the compiled evaluator, the bundled Web UI assets, and the MCP server surface.
Web UI and Documentation Site
Evaluation Web UI (`src/app`)
The src/app workspace hosts the React-based evaluation viewer launched by promptfoo view. Although the workspace README is not part of the fetched context, the package wiring confirms it is built and shipped as a workspace artifact. Release 0.121.10 added "humanize per-cell latency in eval results table" to the Web UI, showing the surface is actively maintained for readability of evaluation output (release 0.121.10). The viewer is intended for inspecting prompts, providers, test cases, and per-cell latency / pass-rate metrics produced by the evaluator.
Companion example READMEs demonstrate the kinds of outputs the UI is expected to render — for example, the eval-markdown-rendering example is explicitly paired with promptfoo eval followed by promptfoo view, indicating the viewer is the canonical way to read structured and markdown-bearing evaluation results (examples/eval-markdown-rendering/README.md).
Documentation Site (`site`)
The public documentation is a Docusaurus 2 site. The standard commands are npm install, npm start for local development (with hot reload), and npm run build for static output (site/README.md).
A custom Docusaurus plugin generates two auxiliary files at build time:
llms-full.txt— a concatenated dump of all docs content, useful for LLM-based search and indexing.llms.txt— a structured index (titles, paths, descriptions) of the documentation set.
Both files are emitted into the build root and require no extra configuration from site authors (site/README.md).
MCP Server Surface
The MCP (Model Context Protocol) server is implemented under src/commands/mcp and is the integration point for external agents that want to drive promptfoo programmatically. The tool catalog is centralized in a single TOOL_DEFINITIONS array declared in src/commands/mcp/lib/toolRegistry.ts, which is the authoritative source for tool names, descriptions, parameter signatures, and MCP annotations (readOnlyHint, idempotentHint, longRunningHint).
The published tools fall into four categories:
| Category | Examples | Notable Parameters |
|---|---|---|
| Evaluation | list_evaluations, get_evaluation_details, run_evaluation, share_evaluation | Pagination, dataset filter, test-case indices, prompt/provider filters, concurrency, timeout |
| Red team | (redteam tools, defined in same registry) | Dataset-scoped operations |
| Configuration | Prompt / provider authoring helpers | Static promptfooconfig discovery |
| Meta | Server health, capability introspection | Read-only, idempotent |
Each run_evaluation call accepts maxConcurrency (1–20) and timeoutMs (1s–5min) directly from the tool signature, exposing runtime controls over the underlying evaluator (src/commands/mcp/lib/toolRegistry.ts).
Server utilities in src/commands/mcp/lib/utils.ts standardize the wire format:
createToolResponse(tool, success, data?, error?)wraps payloads in a{ tool, success, timestamp, data?, error? }envelope and serializes them as a singletextcontent block.isErroris set to the inverse ofsuccess, so MCP clients can branch on it.DEFAULT_TOOL_TIMEOUT_MSis fixed at 5 minutes and is the value used when callers omittimeoutMs.withTimeout(promise, ms, message)races the underlying promise against asetTimeoutreject and always clears the timer in a.finally, preventing leaked handles on long-running tools such asrun_evaluation.
flowchart LR Agent[External Agent / IDE] -->|MCP request| Server[MCP Server<br/>src/commands/mcp] Server -->|lookup| Registry[TOOL_DEFINITIONS<br/>toolRegistry.ts] Server -->|wrap response| Utils[createToolResponse<br/>utils.ts] Server -->|invoke| Eval[promptfoo evaluator] Eval -->|results| UI[Web UI<br/>src/app] Eval -->|docs / examples| Site[Docusaurus site] Server -->|scan request| CSA[code-scan-action]
Code Scanning GitHub Action
The code-scan-action package is a standalone GitHub Action that wraps promptfoo's code-scans run CLI. Its entry point builds the CLI invocation in code-scan-action/src/main.ts, composing the following positional and flag arguments:
repoPath(defaulting toGITHUB_WORKSPACEorcwd),--config <path>,--base <baseBranch>,--compare HEAD,--jsonfor structured output,--github-pr owner/repo#numberto link findings to the originating PR,- an optional
--api-hostwhen a self-hosted promptfoo API is configured.
Recent releases expand the deployment surface of the action:
0.1.6added SARIF output support, so findings can be ingested by GitHub code-scanning dashboards and other SARIF consumers (release code-scan-action-0.1.6).0.1.7fixed structured output for the fork-PR skip path, ensuring the action still emits a parseable response when the run is skipped on forks (release code-scan-action-0.1.7).
For local development and CI dry-runs, createMockScanResponse() produces a deterministic ScanResponse with two representative findings (a hardcoded API key and a SQL injection). This makes it easy to validate downstream consumers (PR comment rendering, SARIF uploaders) without calling the real scanner (code-scan-action/src/main.ts).
Deployment Considerations
- Node runtime. Pin to Node
^20.20.0or>=22.22.0; older 20.x patch releases and Node 21 are not supported (package.json). - MCP timeouts. The default 5-minute ceiling in
utils.tsis the maximum a single tool call may hold the event loop. Long-running redteam or evaluation jobs should be split into smaller batches, or the caller's timeout increased in lockstep withrun_evaluation.timeoutMs. - Code scan in forks. Forked pull requests cannot access repo secrets; the action's structured skip output (since
0.1.7) is what consumers should rely on for a clean no-op signal (release code-scan-action-0.1.7). - Web UI assets. The
src/appworkspace ships prebuilt assets; deploying the evaluator in a container typically means copyingdist/and pointing the host port at the server's bind address. - Docs site. The Docusaurus build emits
llms-full.txtandllms.txtautomatically, so any static host (e.g., Netlify, S3 + CloudFront) can serve the documentation without additional post-processing (site/README.md).
See Also
- Anthropic Web Tools example — provider configuration that pairs naturally with the Web UI's trace viewer.
- Anthropic Structured Outputs example — exercises schema-validated tool calls surfaced in the evaluator and UI.
- OpenAI Codex App Server example — agentic provider whose sessions are visible in the eval viewer.
Source: https://github.com/promptfoo/promptfoo / Human Manual
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
Upgrade or migration may change expected behavior: 0.121.8
Upgrade or migration may change expected behavior: code-scan-action: 0.1.6
May increase setup, validation, or first-run risk for the user.
Upgrade or migration may change expected behavior: 0.121.15
Doramagic Pitfall Log
Found 19 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.
1. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Developers should check this installation risk before relying on the project: 0.121.8
- User impact: Upgrade or migration may change expected behavior: 0.121.8
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: 0.121.8. Context: Source discussion did not expose a precise runtime context.
- Evidence: failure_mode_cluster:github_release | https://github.com/promptfoo/promptfoo/releases/tag/0.121.8
2. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Developers should check this installation risk before relying on the project: code-scan-action: 0.1.6
- User impact: Upgrade or migration may change expected behavior: code-scan-action: 0.1.6
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: code-scan-action: 0.1.6. Context: Source discussion did not expose a precise runtime context.
- Evidence: failure_mode_cluster:github_release | https://github.com/promptfoo/promptfoo/releases/tag/code-scan-action-0.1.6
3. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: capability.host_targets | https://github.com/promptfoo/promptfoo
4. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Developers should check this configuration risk before relying on the project: 0.121.15
- User impact: Upgrade or migration may change expected behavior: 0.121.15
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: 0.121.15. Context: Observed during version upgrade or migration.
- Evidence: failure_mode_cluster:github_release | https://github.com/promptfoo/promptfoo/releases/tag/0.121.15
5. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Developers should check this configuration risk before relying on the project: Per-test-case
repeatoption to control how many times individual tests run - User impact: Developers may misconfigure credentials, environment, or host setup: Per-test-case
repeatoption to control how many times individual tests run - Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Per-test-case
repeatoption to control how many times individual tests run. Context: Source discussion did not expose a precise runtime context. - Evidence: failure_mode_cluster:github_issue | https://github.com/promptfoo/promptfoo/issues/9700
6. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/promptfoo/promptfoo/issues/9700
7. Capability evidence risk: Capability evidence risk requires verification
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: capability.assumptions | https://github.com/promptfoo/promptfoo
8. Runtime risk: Runtime risk requires verification
- Severity: medium
- Finding: Developers should check this runtime risk before relying on the project: 0.121.12
- User impact: Upgrade or migration may change expected behavior: 0.121.12
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: 0.121.12. Context: Observed when using node
- Evidence: failure_mode_cluster:github_release | https://github.com/promptfoo/promptfoo/releases/tag/0.121.12
9. Runtime risk: Runtime risk requires verification
- Severity: medium
- Finding: Developers should check this runtime risk before relying on the project: 0.121.14
- User impact: Upgrade or migration may change expected behavior: 0.121.14
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: 0.121.14. Context: Source discussion did not expose a precise runtime context.
- Evidence: failure_mode_cluster:github_release | https://github.com/promptfoo/promptfoo/releases/tag/0.121.14
10. Maintenance risk: Maintenance risk requires verification
- Severity: medium
- Finding: Developers should check this migration risk before relying on the project: 0.121.13
- User impact: Upgrade or migration may change expected behavior: 0.121.13
- Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: 0.121.13. Context: Observed when using node
- Evidence: failure_mode_cluster:github_release | https://github.com/promptfoo/promptfoo/releases/tag/0.121.13
11. Maintenance risk: Maintenance risk requires verification
- Severity: medium
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/promptfoo/promptfoo
12. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: downstream_validation.risk_items | https://github.com/promptfoo/promptfoo
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using promptfoo with real data or production workflows.
- Per-test-case
repeatoption to control how many times individual tests - github / github_issue - code-scan-action: 0.1.8 - github / github_release
- 0.121.17 - github / github_release
- 0.121.16 - github / github_release
- 0.121.15 - github / github_release
- 0.121.14 - github / github_release
- code-scan-action: 0.1.7 - github / github_release
- 0.121.13 - github / github_release
- code-scan-action: 0.1.6 - github / github_release
- 0.121.12 - github / github_release
- 0.121.11 - github / github_release
- 0.121.10 - github / github_release
Source: Project Pack community evidence and pitfall evidence