promptfoo Manual Preview

Doramagic Project Pack · Human Manual

promptfoo

Promptfoo is described in its manifest as an "LLM eval & testing toolkit" distributed as a Node.js ES module with dual entry points for import and require, and ships CLI binaries promptfoo...

Core Evaluation Engine & Architecture

Related topics: LLM Provider Ecosystem & Custom Integrations, Web UI, Code Scanning, Server & Deployment

Section Related Pages

Continue reading this section for the full explanation and source context.

Core Evaluation Engine & Architecture

Purpose and Scope

Promptfoo is a comprehensive LLM evaluation and testing toolkit, distributed as a Node.js package with two binary entry points (promptfoo and pf) defined in package.json. The package targets Node.js ^20.20.0 || >=22.22.0, uses ES modules by default, and is organized as a monorepo with workspaces for src/app and site directories. Its core evaluation engine is responsible for orchestrating prompt execution, provider invocation, assertion grading, and result aggregation across the many supported LLM backends.

The engine is designed around three interacting subsystems:

Configuration and provider resolution — declarative YAML/JSON configurations that reference any of 50+ supported model providers.
Test execution and concurrency — scheduled fan-out of prompt/test-case combinations with retries, caching, and timeout handling.
Assertion grading and result reporting — pluggable graders ranging from deterministic string matchers to LLM-as-judge rubrics and multimodal scorers.

MCP Tool Surface

Promptfoo exposes its evaluation capabilities as a set of Model Context Protocol (MCP) tools, allowing AI agents to drive evaluations programmatically. The tool registry in src/commands/mcp/lib/toolRegistry.ts acts as the single source of truth for tool metadata. Core evaluation tools include list_evaluations, get_evaluation_details, run_evaluation, and share_evaluation, each decorated with MCP-spec annotations (readOnlyHint, idempotentHint, longRunningHint) that follow the 2025-03-26 specification.

run_evaluation accepts a configPath, optional testCaseIndices filter (single index, array, or {start, end} range), promptFilter, providerFilter, maxConcurrency (1–20), and timeoutMs (1s–5min). Result pagination uses resultLimit and resultOffset, with a default page size of 20.

Supporting utilities in src/commands/mcp/lib/utils.ts standardize responses: createToolResponse wraps payloads into TextContent JSON blocks with success, timestamp, and isError fields; withTimeout races a promise against a 5-minute timer; and safeStringify handles circular references and BigInt values encountered during provider response serialization.

Provider and Assertion Architecture

The engine is provider-agnostic. Each provider integration (OpenAI, Anthropic, Google Vertex, xAI, Cerebras, Bedrock, Codex SDK, Claude Agent SDK, OpenCode SDK, and more) implements a uniform call interface that returns a normalized output plus latency. Provider-specific features are surfaced through configuration blocks, for example Anthropic's output_format JSON schema in examples/anthropic/structured-outputs/README.md, xAI's image and search tools in examples/xai/chat/README.md, and OpenAI's MCP tools array in examples/openai-mcp/README.md.

Assertions operate on the normalized provider output. The schema-driven is-valid-openai-tools-call assertion validates both function and MCP tool invocations, while weighted compositions such as contains-any plus llm-rubric allow multi-layered grading. The release history indicates that 0.121.15 added multimodal output grading (#9617) and that 0.121.12 added repeat-aware fetch cache options (#6844) — both directly relevant to community requests for per-test-case repeat semantics discussed in #9700.

Redteam Subsystem

The redteam layer reuses the core engine but adds iterative attack providers. The README in src/redteam/providers/README.md documents an iteration loop with on-topic checking, target invocation, vision-based judging, score comparison, and an unblocking feature that detects when a target model asks a clarifying question and generates an automated response. The examples/redteam-guardrails/README.md preset bundles 42 plugins spanning prompt injection, harmful content, and PII — each plugin contributes test cases that the same evaluation engine grades, demonstrating how the architecture cleanly separates plugin definitions from execution.

flowchart LR
    Config[YAML Config] --> Engine[Eval Engine]
    MCP[MCP Tool Call] --> Engine
    Engine --> Providers[Provider Adapters]
    Providers --> Graders[Assertion Graders]
    Graders --> Results[Aggregated Results]
    Redteam[Redteam Plugins] --> Engine
    Engine --> Cache[(Fetch Cache)]

Configuration Flexibility

TypeScript configs are supported via loaders, as noted in examples/config-ts/README.md, which uses Zod schemas for structured outputs across OpenAI and Gemini providers. Node.js 20+ requires the --import flag or a tsx loader; native TypeScript execution is a future direction.

LLM Provider Ecosystem & Custom Integrations

Related topics: Core Evaluation Engine & Architecture, Red Teaming & Adversarial Security Testing

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Hosted Model Providers

Continue reading this section for the full explanation and source context.

Section Tool and Protocol Integrations

Continue reading this section for the full explanation and source context.

Section Agent Runtime Providers

Continue reading this section for the full explanation and source context.

LLM Provider Ecosystem & Custom Integrations

Overview

Promptfoo is described in its manifest as an "LLM eval & testing toolkit" distributed as a Node.js ES module with dual entry points for import and require, and ships CLI binaries promptfoo and pf (Source: package.json). Beyond its core evaluation engine, the project maintains a large surface area of provider adapters that let users evaluate prompts against hosted APIs, open-source models served through gateways, agent runtimes, and protocol-based tool servers. The ecosystem is intentionally heterogeneous: it covers commercial chat models, image generators, TTS providers, embeddings/vectorization endpoints, agent SDKs, Model Context Protocol (MCP) tool servers, and bespoke self-hosted gateways.

Provider adapters are exposed through example directories under examples/ that pair a runnable promptfooconfig.yaml with prose documentation, allowing users to bootstrap new evaluations through npx promptfoo@latest init --example <name> (Source: examples/provider-atlascloud/README.md).

Provider Categories

The provider ecosystem organizes adapters into several functional groups, each with distinct configuration patterns and runtime expectations.

Category	Representative Providers	Defining Trait
Hosted chat / completion	OpenAI, Anthropic (Claude), Google Vertex, Cerebras, Atlas Cloud, Replicate-hosted Llama	API key + base URL; standard generation parameters (temperature, max_tokens, top_p)
Multimodal generation	`openai:image:gpt-image-1.5`, `openai:image:gpt-image-1`, ElevenLabs TTS, QuiverAI image pipelines	Size/quality/format options; per-image moderation; pronunciation dictionaries
Agent runtimes	Claude Agent SDK, OpenAI Codex app-server, OpenAI Deep Research	Spawn subprocess; permission policy and sandbox mode controls
Tool / protocol servers	OpenAI MCP integration, A2A provider	Server URL, allowed tools list, approval policy
Red-team adversarial targets	Custom, Crescendo, GOAT red-team providers	Multi-turn loop with on-topic scoring and feedback iteration
Documentation site	Docusaurus 2 build with `llms-full.txt` / `llms.txt` plugin	Static site generation for `promptfoo.dev`

Source: examples/openai-images/README.md, examples/provider-elevenlabs/tts-advanced/README.md, examples/claude-agent-sdk/README.md, examples/openai-mcp/README.md, src/redteam/providers/README.md, site/README.md.

Configuration Patterns

Hosted Model Providers

Hosted providers expose a consistent id:<model> shorthand and a config: block for transport and sampling parameters. Cerebras demonstrates JSON schema enforcement on structured outputs, returning deterministic fields such as cuisine, difficulty, and ingredients (Source: examples/provider-cerebras/README.md). Atlas Cloud shows that OpenAI-compatible gateways can be retargeted via apiBaseUrl and a custom apiKeyEnvar, enabling evaluation across model families behind a single provider account (Source: examples/provider-atlascloud/README.md).

Google Vertex configurations are split per model family (promptfooconfig.gemini.yaml, promptfooconfig.claude.yaml, promptfooconfig.llama.yaml, promptfooconfig.search.yaml, promptfooconfig.response-schema.yaml) and run with promptfoo eval -c <file> (Source: examples/google-vertex/README.md). Replicate examples highlight typical model knobs such as temperature, max_tokens, and top_p while documenting that cold-start latency for some models can reach 30–60 seconds (Source: examples/provider-replicate/llama4-scout/README.md).

Tool and Protocol Integrations

The OpenAI MCP example shows how to declare remote tool servers in YAML:

tools:
  - type: mcp
    server_label: deepwiki
    server_url: https://mcp.deepwiki.com/mcp
    allowed_tools: ['ask_question', 'read_wiki_structure']
    require_approval: never

Approval can be scoped granularly with require_approval.never.tool_names so that only specific tools bypass confirmation. Assertion patterns for MCP workflows include is-valid-openai-tools-call for structural validation of function and MCP tool invocations, plus content-level checks such as contains: 'MCP Tool Result' and not-contains: 'MCP Tool Error' (Source: examples/openai-mcp/README.md).

Anthropic's web tools example pairs web_search_20250305 and web_fetch_20250910 tool definitions with allowed_domains, citation control, and max_content_tokens to bound retrieval (Source: examples/anthropic/web-tools/README.md).

Agent Runtime Providers

The Claude Agent SDK example surfaces several advanced controls: sandbox configuration with network restrictions, JavaScript runtime selection (node, bun, deno), extra CLI arguments, setting sources for SDK initialization, permission bypass for automated testing, and AskUserQuestion handling via ask_user_question.behavior plus append_allowed_tools (Source: examples/claude-agent-sdk/README.md). The OpenAI Codex app-server provider starts its own codex app-server subprocess rather than attaching to an already-running Codex Desktop process, and defaults to sandbox_mode: read-only, approval_policy: never, and skip_git_repo_check: true (Source: examples/openai-codex-app-server/README.md). OpenAI Deep Research documents that reasoning models burn significant tokens and recommends validating citation URLs before relying on outputs (Source: examples/openai-deep-research/README.md).

TypeScript Configurations

Promptfoo supports TypeScript configuration via external loaders; Node.js 20+ can run ES modules with the --import flag, and tsx is recommended for the best developer experience. Configs can use Zod schemas that the system automatically adapts for different providers (OpenAI and Gemini), enforcing strict schema adherence for structured JSON outputs (Source: examples/config-ts/README.md).

MCP Server-Side Surface

Promptfoo also exposes itself as an MCP tool server. The TOOL_DEFINITIONS array in toolRegistry.ts is the single source of truth for tool documentation, including list_evaluations, get_evaluation_details, run_evaluation, and share_evaluation. Each entry declares parameters, annotations such as readOnlyHint, idempotentHint, and longRunningHint, and a category such as evaluation (Source: src/commands/mcp/lib/toolRegistry.ts). run_evaluation is annotated as longRunningHint: true and accepts configPath, testCaseIndices, promptFilter, providerFilter, maxConcurrency (1–20), timeoutMs (1s–5min), resultLimit, and resultOffset, reflecting how external agents can drive evaluations through MCP.

Companion Tooling

Outside the core CLI, the repository ships code-scan-action, a GitHub Action that uses AI agents to find LLM-related vulnerabilities such as prompt injection, PII exposure, and excessive agency. It can post PR review comments and, with sarif-output-path configured, surface findings in GitHub Code Scanning via github/codeql-action/upload-sarif (Source: code-scan-action/README.md). The accompanying documentation site is built with Docusaurus 2 and includes a custom plugin that emits llms-full.txt and llms.txt so LLM-based tools can analyze or search the documentation corpus (Source: site/README.md).

Recent Ecosystem Evolution

Release notes show continuous provider expansion: 0.121.8 added Claude Agent SDK bumps and a GPT model (release); 0.121.9 added gpt-5.5 model support (release); 0.121.11 brought QuiverAI Arrow 1.1, a vectorize endpoint, and a GPT Image-2 pipeline (release); 0.121.12 introduced repeat-aware fetch cache options (release); 0.121.13 added the MiniMax provider and Claude Opus 4.8 support (release); 0.121.14 added the A2A provider and an agent-rubric grader (release); and 0.121.15 enabled grading of multimodal outputs (release). Community requests such as a per-test-case repeat option (#9700) reflect ongoing demand for finer control over how many times individual tests run, particularly for negative assertions that may otherwise pass by chance.

Red Teaming & Adversarial Security Testing

Related topics: LLM Provider Ecosystem & Custom Integrations, Web UI, Code Scanning, Server & Deployment

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Plugins

Continue reading this section for the full explanation and source context.

Section Strategies

Continue reading this section for the full explanation and source context.

Section Providers

Continue reading this section for the full explanation and source context.

Red Teaming & Adversarial Security Testing

Purpose and Scope

Promptfoo's red team module is a comprehensive framework for adversarial security testing of LLM applications and foundation models. It enables developers and security teams to systematically probe AI systems for vulnerabilities, safety issues, and policy violations before deployment, replacing ad-hoc manual probing with a reproducible, automated evaluation pipeline. Source: README.md:1-15

The framework targets a broad spectrum of systems:

Foundation models such as DeepSeek R1 0528 and GPT-5.4, evaluated side-by-side for security responses. Source: examples/redteam-deepseek-foundation/README.md:1-25
Agentic systems including autonomous coding agents tested for prompt injection, terminal output injection, secret exfiltration, sandbox escapes, and verifier sabotage. Source: examples/redteam-coding-agent/README.md:1-30
MCP-enabled assistants that use Model Context Protocol for tool discovery, parameter handling, and recursive function execution. Source: examples/redteam-mcp/README.md:1-20
Multi-modal models that accept image inputs, evaluated with UnsafeBench and VLGuard datasets. Source: examples/redteam-multi-modal/README.md:1-55
API-driven applications instrumented with OWASP API Security Top 10 vulnerabilities. Source: examples/redteam-api-top-10/README.md:1-15

Core Architecture

The red team system is organized into four cooperating components: plugins, strategies, providers, and tracing.

flowchart LR
    A[promptfooconfig.yaml] --> P[Plugins]
    A --> S[Strategies]
    P --> T[Test Cases]
    S --> T
    T --> PR[Providers]
    PR --> TM[Target Model]
    PR --> AM[Adversarial Model]
    TM --> G[Grading & Assertions]
    AM --> G
    G --> R[Reports]

Plugins

Plugins generate adversarial test cases from curated datasets or heuristic generators. The guardrails preset, for instance, activates 42 plugins spanning prompt injection, jailbreaking, harmful content, system prompt override, and prompt extraction. Source: examples/redteam-guardrails/README.md:10-40 Other notable plugins include HarmBench for standardized jailbreak evaluation and VLGuard for multi-modal content moderation. Source: examples/redteam-harmbench/README.md:1-20, Source: examples/redteam-multi-modal/README.md:35-55

Strategies

Strategies modify or wrap test cases to maximize attack success. The iterative provider generates adversarial prompts and refines them based on a judge's feedback loop, penalizing certain phrases and enforcing on-topic checks. Source: src/redteam/providers/README.md:1-30 Other strategies include indirect-web-pwn, which tests data exfiltration by injecting hidden instructions into fetched web pages. Source: examples/redteam-indirect-web-pwn/README.md:1-25

Providers

Providers supply the target under test and any auxiliary adversarial or grading models. Recent releases added MiniMax, A2A, and broader multi-modal support (community release notes v0.121.13–0.121.15). For OpenAI-hosted MCP integration, providers expose tool configuration including server_label, server_url, allowed_tools, and selective approval policies. Source: examples/openai-mcp/README.md:1-30

Tracing

Tracing captures internal execution spans (LLM calls, guardrail decisions, tool invocations) and feeds them into attack generation or grading. Trace data is sanitizable to strip sensitive attributes and can be filtered by span name pattern or limited by depth and count. Source: examples/redteam-tracing-example/README.md:1-50

Configuration and Usage

Red team evaluations are configured in promptfooconfig.yaml and executed via:

npx promptfoo@latest redteam run
npx promptfoo@latest view

Source: examples/redteam-deepseek-foundation/README.md:25-45, Source: examples/redteam-indirect-web-pwn/README.md:30-45

A representative configuration specifies plugins, strategies, and a target provider. For example, tracing can be enabled globally or overridden per strategy so that crescendo-style attacks see guardrail decisions while GOAT-style attacks see tool call data. Source: examples/redteam-tracing-example/README.md:10-50

For coding agents, the target provider's working_dir should point at a disposable checkout, and synthetic canary secrets should be used in place of production credentials. Source: examples/redteam-coding-agent/README.md:25-40

Common Failure Modes

When running red team evaluations, watch for:

Missing API credentials — Foundation-model targets require provider keys (e.g., OPENROUTER_API_KEY, OPENAI_API_KEY); absent keys cause silent skips. Source: examples/redteam-deepseek-foundation/README.md:15-25
Remote-generation dependency — Plugin collections like coding-agent:core require the remote generation service. Setting PROMPTFOO_DISABLE_REDTEAM_REMOTE_GENERATION=true disables these collections entirely. Source: examples/redteam-coding-agent/README.md:20-30
Credential leakage in traces — Trace spans can capture sensitive payloads; sanitize attributes when capturing them. Source: examples/redteam-tracing-example/README.md:30-50
Flaky pass/fail on negative assertions — A model may pass a refusal check once by chance; the community has requested a per-test-case repeat option to verify consistency (community issue #9700). Recent cache changes in v0.121.12 added repeat-aware fetch options, supporting this direction.

Community-Driven Evolution

Recent releases show the red team module expanding along community-requested lines:

Multimodal grading (v0.121.15) extends assertions to non-text outputs.
Agent-rubric grader and A2A provider (v0.121.14) broaden agentic evaluation.
Repeat-aware fetch cache (v0.121.12) supports reliability checks via repeated runs.
Humanized per-cell latency (v0.121.10) improves result-table readability.

These changes illustrate promptfoo's pattern of treating red teaming as an evolving, community-shaped discipline rather than a static checklist.

Web UI, Code Scanning, Server & Deployment

Related topics: Core Evaluation Engine & Architecture, Red Teaming & Adversarial Security Testing

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Evaluation Web UI (src/app)

Continue reading this section for the full explanation and source context.

Section Documentation Site (site)

Continue reading this section for the full explanation and source context.

Web UI, Code Scanning, Server & Deployment

Promptfoo ships as a multi-surface product: a CLI for evaluations, a React-based web UI for browsing results, a Docusaurus documentation site, an MCP server for tool-based interactions, and a dedicated GitHub Action for code scanning. This page describes the moving parts of those surfaces, how they relate to the CLI entry point, and the most important deployment and operational considerations.

Repository Surfaces and Workspaces

The npm package promptfoo (currently 0.121.17) declares two npm workspaces in addition to the core CLI package: src/app and site. This structure signals that the evaluation Web UI and the public documentation site are first-class deliverables shipped alongside the evaluator (package.json).

The CLI exposes two binary entry points — promptfoo and pf — both pointing at dist/src/entrypoint.js. The package supports both ESM and CJS consumers via conditional exports, and ships a separate ./contracts entry for shared type definitions. Node engines are pinned to ^20.20.0 || >=22.22.0, which is the supported runtime for both the CLI, the server, and the bundled UIs (package.json).

The published files whitelist deliberately excludes dist/test, mocks, and source maps, so production deployments only carry the compiled evaluator, the bundled Web UI assets, and the MCP server surface.

Web UI and Documentation Site

Evaluation Web UI (`src/app`)

The src/app workspace hosts the React-based evaluation viewer launched by promptfoo view. Although the workspace README is not part of the fetched context, the package wiring confirms it is built and shipped as a workspace artifact. Release 0.121.10 added "humanize per-cell latency in eval results table" to the Web UI, showing the surface is actively maintained for readability of evaluation output (release 0.121.10). The viewer is intended for inspecting prompts, providers, test cases, and per-cell latency / pass-rate metrics produced by the evaluator.

Companion example READMEs demonstrate the kinds of outputs the UI is expected to render — for example, the eval-markdown-rendering example is explicitly paired with promptfoo eval followed by promptfoo view, indicating the viewer is the canonical way to read structured and markdown-bearing evaluation results (examples/eval-markdown-rendering/README.md).

Documentation Site (`site`)

The public documentation is a Docusaurus 2 site. The standard commands are npm install, npm start for local development (with hot reload), and npm run build for static output (site/README.md).

A custom Docusaurus plugin generates two auxiliary files at build time:

llms-full.txt — a concatenated dump of all docs content, useful for LLM-based search and indexing.
llms.txt — a structured index (titles, paths, descriptions) of the documentation set.

Both files are emitted into the build root and require no extra configuration from site authors (site/README.md).

MCP Server Surface

The MCP (Model Context Protocol) server is implemented under src/commands/mcp and is the integration point for external agents that want to drive promptfoo programmatically. The tool catalog is centralized in a single TOOL_DEFINITIONS array declared in src/commands/mcp/lib/toolRegistry.ts, which is the authoritative source for tool names, descriptions, parameter signatures, and MCP annotations (readOnlyHint, idempotentHint, longRunningHint).

The published tools fall into four categories:

Category	Examples	Notable Parameters
Evaluation	`list_evaluations`, `get_evaluation_details`, `run_evaluation`, `share_evaluation`	Pagination, dataset filter, test-case indices, prompt/provider filters, concurrency, timeout
Red team	(redteam tools, defined in same registry)	Dataset-scoped operations
Configuration	Prompt / provider authoring helpers	Static `promptfooconfig` discovery
Meta	Server health, capability introspection	Read-only, idempotent

Each run_evaluation call accepts maxConcurrency (1–20) and timeoutMs (1s–5min) directly from the tool signature, exposing runtime controls over the underlying evaluator (src/commands/mcp/lib/toolRegistry.ts).

Server utilities in src/commands/mcp/lib/utils.ts standardize the wire format:

createToolResponse(tool, success, data?, error?) wraps payloads in a { tool, success, timestamp, data?, error? } envelope and serializes them as a single text content block. isError is set to the inverse of success, so MCP clients can branch on it.
DEFAULT_TOOL_TIMEOUT_MS is fixed at 5 minutes and is the value used when callers omit timeoutMs.
withTimeout(promise, ms, message) races the underlying promise against a setTimeout reject and always clears the timer in a .finally, preventing leaked handles on long-running tools such as run_evaluation.

flowchart LR
  Agent[External Agent / IDE] -->|MCP request| Server[MCP Server<br/>src/commands/mcp]
  Server -->|lookup| Registry[TOOL_DEFINITIONS<br/>toolRegistry.ts]
  Server -->|wrap response| Utils[createToolResponse<br/>utils.ts]
  Server -->|invoke| Eval[promptfoo evaluator]
  Eval -->|results| UI[Web UI<br/>src/app]
  Eval -->|docs / examples| Site[Docusaurus site]
  Server -->|scan request| CSA[code-scan-action]

Code Scanning GitHub Action

The code-scan-action package is a standalone GitHub Action that wraps promptfoo's code-scans run CLI. Its entry point builds the CLI invocation in code-scan-action/src/main.ts, composing the following positional and flag arguments:

repoPath (defaulting to GITHUB_WORKSPACE or cwd),
--config <path>, --base <baseBranch>, --compare HEAD,
--json for structured output,
--github-pr owner/repo#number to link findings to the originating PR,
an optional --api-host when a self-hosted promptfoo API is configured.

Recent releases expand the deployment surface of the action:

0.1.6 added SARIF output support, so findings can be ingested by GitHub code-scanning dashboards and other SARIF consumers (release code-scan-action-0.1.6).
0.1.7 fixed structured output for the fork-PR skip path, ensuring the action still emits a parseable response when the run is skipped on forks (release code-scan-action-0.1.7).

For local development and CI dry-runs, createMockScanResponse() produces a deterministic ScanResponse with two representative findings (a hardcoded API key and a SQL injection). This makes it easy to validate downstream consumers (PR comment rendering, SARIF uploaders) without calling the real scanner (code-scan-action/src/main.ts).

Deployment Considerations

Node runtime. Pin to Node ^20.20.0 or >=22.22.0; older 20.x patch releases and Node 21 are not supported (package.json).
MCP timeouts. The default 5-minute ceiling in utils.ts is the maximum a single tool call may hold the event loop. Long-running redteam or evaluation jobs should be split into smaller batches, or the caller's timeout increased in lockstep with run_evaluation.timeoutMs.
Code scan in forks. Forked pull requests cannot access repo secrets; the action's structured skip output (since 0.1.7) is what consumers should rely on for a clean no-op signal (release code-scan-action-0.1.7).
Web UI assets. The src/app workspace ships prebuilt assets; deploying the evaluator in a container typically means copying dist/ and pointing the host port at the server's bind address.
Docs site. The Docusaurus build emits llms-full.txt and llms.txt automatically, so any static host (e.g., Netlify, S3 + CloudFront) can serve the documentation without additional post-processing (site/README.md).

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

medium Installation risk requires verification

Upgrade or migration may change expected behavior: 0.121.8

medium Installation risk requires verification

Upgrade or migration may change expected behavior: code-scan-action: 0.1.6

medium Configuration risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Configuration risk requires verification

Upgrade or migration may change expected behavior: 0.121.15

Doramagic Pitfall Log

Found 19 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

1. Installation risk: Installation risk requires verification

Severity: medium
Finding: Developers should check this installation risk before relying on the project: 0.121.8
User impact: Upgrade or migration may change expected behavior: 0.121.8
Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: 0.121.8. Context: Source discussion did not expose a precise runtime context.
Evidence: failure_mode_cluster:github_release | https://github.com/promptfoo/promptfoo/releases/tag/0.121.8

2. Installation risk: Installation risk requires verification

Severity: medium
Finding: Developers should check this installation risk before relying on the project: code-scan-action: 0.1.6
User impact: Upgrade or migration may change expected behavior: code-scan-action: 0.1.6
Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: code-scan-action: 0.1.6. Context: Source discussion did not expose a precise runtime context.
Evidence: failure_mode_cluster:github_release | https://github.com/promptfoo/promptfoo/releases/tag/code-scan-action-0.1.6

3. Configuration risk: Configuration risk requires verification

Severity: medium
Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: capability.host_targets | https://github.com/promptfoo/promptfoo

4. Configuration risk: Configuration risk requires verification

Severity: medium
Finding: Developers should check this configuration risk before relying on the project: 0.121.15
User impact: Upgrade or migration may change expected behavior: 0.121.15
Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: 0.121.15. Context: Observed during version upgrade or migration.
Evidence: failure_mode_cluster:github_release | https://github.com/promptfoo/promptfoo/releases/tag/0.121.15

5. Configuration risk: Configuration risk requires verification

Severity: medium
Finding: Developers should check this configuration risk before relying on the project: Per-test-case repeat option to control how many times individual tests run
User impact: Developers may misconfigure credentials, environment, or host setup: Per-test-case repeat option to control how many times individual tests run
Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Per-test-case repeat option to control how many times individual tests run. Context: Source discussion did not expose a precise runtime context.
Evidence: failure_mode_cluster:github_issue | https://github.com/promptfoo/promptfoo/issues/9700

6. Configuration risk: Configuration risk requires verification

Severity: medium
Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/promptfoo/promptfoo/issues/9700

7. Capability evidence risk: Capability evidence risk requires verification

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: capability.assumptions | https://github.com/promptfoo/promptfoo

8. Runtime risk: Runtime risk requires verification

Severity: medium
Finding: Developers should check this runtime risk before relying on the project: 0.121.12
User impact: Upgrade or migration may change expected behavior: 0.121.12
Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: 0.121.12. Context: Observed when using node
Evidence: failure_mode_cluster:github_release | https://github.com/promptfoo/promptfoo/releases/tag/0.121.12

9. Runtime risk: Runtime risk requires verification

Severity: medium
Finding: Developers should check this runtime risk before relying on the project: 0.121.14
User impact: Upgrade or migration may change expected behavior: 0.121.14
Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: 0.121.14. Context: Source discussion did not expose a precise runtime context.
Evidence: failure_mode_cluster:github_release | https://github.com/promptfoo/promptfoo/releases/tag/0.121.14

10. Maintenance risk: Maintenance risk requires verification

Severity: medium
Finding: Developers should check this migration risk before relying on the project: 0.121.13
User impact: Upgrade or migration may change expected behavior: 0.121.13
Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: 0.121.13. Context: Observed when using node
Evidence: failure_mode_cluster:github_release | https://github.com/promptfoo/promptfoo/releases/tag/0.121.13

11. Maintenance risk: Maintenance risk requires verification

Severity: medium
Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/promptfoo/promptfoo

12. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: downstream_validation.risk_items | https://github.com/promptfoo/promptfoo

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using promptfoo with real data or production workflows.

Per-test-case repeat option to control how many times individual tests - github / github_issue
code-scan-action: 0.1.8 - github / github_release
0.121.17 - github / github_release
0.121.16 - github / github_release
0.121.15 - github / github_release
0.121.14 - github / github_release
code-scan-action: 0.1.7 - github / github_release
0.121.13 - github / github_release
code-scan-action: 0.1.6 - github / github_release
0.121.12 - github / github_release
0.121.11 - github / github_release
0.121.10 - github / github_release

Source: Project Pack community evidence and pitfall evidence

promptfoo

Core Evaluation Engine & Architecture

Related Pages

Core Evaluation Engine & Architecture

Purpose and Scope

MCP Tool Surface

Provider and Assertion Architecture

Redteam Subsystem

Configuration Flexibility

See Also

LLM Provider Ecosystem & Custom Integrations

Related Pages

LLM Provider Ecosystem & Custom Integrations

Overview

Provider Categories

Configuration Patterns

Hosted Model Providers

Tool and Protocol Integrations

Agent Runtime Providers

TypeScript Configurations

MCP Server-Side Surface

Companion Tooling

Recent Ecosystem Evolution

See Also

Red Teaming & Adversarial Security Testing

Related Pages

Red Teaming & Adversarial Security Testing

Purpose and Scope

Core Architecture

Plugins

Strategies

Providers

Tracing

Configuration and Usage

Common Failure Modes

Community-Driven Evolution

See Also

Web UI, Code Scanning, Server & Deployment

Related Pages

Web UI, Code Scanning, Server & Deployment

Repository Surfaces and Workspaces

Web UI and Documentation Site

Evaluation Web UI (`src/app`)

Documentation Site (`site`)

MCP Server Surface

Code Scanning GitHub Action

Deployment Considerations

See Also

Doramagic Pitfall Log

Doramagic Pitfall Log

1. Installation risk: Installation risk requires verification

2. Installation risk: Installation risk requires verification

3. Configuration risk: Configuration risk requires verification

4. Configuration risk: Configuration risk requires verification

5. Configuration risk: Configuration risk requires verification

6. Configuration risk: Configuration risk requires verification

7. Capability evidence risk: Capability evidence risk requires verification

8. Runtime risk: Runtime risk requires verification

9. Runtime risk: Runtime risk requires verification

10. Maintenance risk: Maintenance risk requires verification

11. Maintenance risk: Maintenance risk requires verification

12. Security or permission risk: Security or permission risk requires verification

Community Discussion Evidence

Community Discussion Evidence