agentassay Manual - Doramagic.ai

Doramagic Project Pack · Human Manual

agentassay

AgentAssay is an agent testing framework that delivers statistical guarantees for non-deterministic AI agent workflows without burning token budgets. The project's tagline — "Test More. Sp...

Introduction and Layered Architecture

Related topics: Token-Efficient Testing Pipeline and Statistical Engine, Framework Adapters, CLI, Dashboard and pytest Integration

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Layer 1 — Core

Continue reading this section for the full explanation and source context.

Section Layer 2 — Statistics

Continue reading this section for the full explanation and source context.

Section Layer 3 — Verdicts

Continue reading this section for the full explanation and source context.

Introduction and Layered Architecture

Overview

AgentAssay is an agent testing framework that delivers statistical guarantees for non-deterministic AI agent workflows without burning token budgets. The project's tagline — "Test More. Spend Less. Ship Confident." — captures the core value proposition: produce rigorous verdicts on agent behavior while reducing token costs 5–20x compared to naïve re-run strategies. As described in README.md:1-30, AgentAssay is part of the Qualixar AI Agent Reliability Platform and is backed by the paper "AgentAssay: Formal Regression Testing for Non-Deterministic AI Agent Workflows" (arXiv:2603.02601).

The framework ships ten framework adapters, a pytest plugin, a CLI with five commands (run, compare, mutate, coverage, report), and a Streamlit dashboard. Release v0.1.2 is a production-grade build with zero CI warnings, full mypy strict-mode compliance, and ruff-clean code across Python 3.10/3.11/3.12 (per the v0.1.2 release notes).

The Six-Layer Stack

The codebase is organized as a six-layer architecture, with each layer building on the abstractions of the layer below. Layers 1–5 ensure that every test run produces rigorous results; Layer 6 optimizes *how many* runs are needed. The full stack is documented in README.md:120-145.

flowchart TB
    L6["Layer 6: Efficiency<br/>Fingerprinting · Budget Optimization · Trace Analysis<br/>Multi-Fidelity · Warm-Start Sequential"]
    L5["Layer 5: Integration<br/>Framework Adapters · pytest Plugin · CLI · Reporting"]
    L4["Layer 4: Analysis<br/>Coverage (5D) · Mutation · Metamorphic · Contract Oracle"]
    L3["Layer 3: Verdicts<br/>Stochastic Verdicts · Deployment Gates"]
    L2["Layer 2: Statistics<br/>Hypothesis Tests · Confidence Intervals · SPRT · Effect Size"]
    L1["Layer 1: Core<br/>Data Models · Execution Engine · Trace Format"]
    L6 --> L5 --> L4 --> L3 --> L2 --> L1

Layer 1 — Core

The core layer provides the foundational data models, the execution engine that runs scenarios through an agent, and the canonical ExecutionTrace format. The execution engine captures per-step traces, latencies, token counts, and success flags consumed by every higher layer. Frameworks plug into the core via a uniform run(input_data) -> ExecutionTrace contract, as seen in the MCP adapter at src/agentassay/integrations/mcp_adapter.py:120-150 and the Semantic Kernel adapter at src/agentassay/integrations/semantic_kernel_adapter.py:60-100.

Layer 2 — Statistics

Layer 2 implements the statistical machinery: hypothesis tests, Wilson confidence intervals, Sequential Probability Ratio Tests (SPRT), and effect-size calculations. These primitives are what allow AgentAssay to produce three-valued verdicts (PASS / FAIL / INCONCLUSIVE) with confidence intervals rather than misleading point estimates. The Wilson interval is exposed in CLI code at src/agentassay/cli/cmd_report.py:60-75 for HTML report generation.

Layer 3 — Verdicts

Layer 3 consumes Layer 2 statistics to issue stochastic verdicts and deployment gates. A gate decision is a structured record (DEPLOY / BLOCK / WARN with reason text and rules JSON) persisted for audit, as shown in src/agentassay/persistence/storage.py:80-130. Gates can be wired into CI/CD pipelines to block broken deployments with statistical evidence.

Layer 4 — Analysis

Layer 4 adds four analytical passes over traces and runs: five-dimensional coverage (tool, path, state, boundary, model), mutation testing, metamorphic testing, and a contract oracle. The mutation subsystem is defined in src/agentassay/mutation/base.py:1-80, which establishes a MutationOperator abstract base class with name, category (one of prompt, tool, model, context), and description class attributes plus an abstract mutate(config, scenario) -> tuple[AgentConfig, TestScenario]. Concrete operators such as PromptSynonymMutator in src/agentassay/mutation/prompt_ops.py:60-110 replace key prompt words with synonyms to validate test sensitivity. Operators are required to deep-copy inputs so the originals are never mutated (src/agentassay/mutation/base.py:55-70).

Layer 5 — Integration

Layer 5 is the surface area users actually touch: framework adapters, the pytest plugin, the CLI, and the dashboard. Framework adapters exist for LangGraph, CrewAI, OpenAI Agents, AutoGen/AG2, smolagents, MCP (direct and Anthropic bridge), Semantic Kernel, and AWS Bedrock — see src/agentassay/integrations/mcp_anthropic.py:1-60 for the Anthropic bridge and src/agentassay/integrations/bedrock_adapter.py:1-80 for the Bedrock adapter signature. The CLI report command in src/agentassay/cli/cmd_report.py:1-50 generates a self-contained HTML report with verdict, pass-rate table, and Wilson confidence interval. The Streamlit dashboard is a thin router in src/agentassay/dashboard/app.py:1-60 that dispatches to four views: Overview, Test Run, History, and Fingerprints. All views share utility helpers from src/agentassay/dashboard/helpers.py:1-40 for metric cards and dark-themed Plotly charts. Working examples for every framework live under examples/framework-specific/ as catalogued in examples/README.md:1-30.

Layer 6 — Efficiency

Layer 6 is the differentiator. It houses behavioral fingerprinting, adaptive budget optimization, trace-first offline analysis, multi-fidelity proxy testing, and warm-start sequential testing. Together these techniques achieve the headline 5–20x token reduction: trace-first analysis reuses existing production traces for coverage, contracts, and metamorphic checks at zero token cost, while adaptive budgets calibrate variance and compute the exact minimum N per scenario.

Persistence and Audit Trail

Behind every layer is a SQLite-backed persistence layer with a fixed schema covering projects, runs, trials, verdicts, coverage, fingerprints, gate decisions, and cost accounting. The DDL is defined in src/agentassay/persistence/schema.py:1-80 and the ResultStore in src/agentassay/persistence/storage.py:1-50 provides insert methods for each entity. The default database path is ~/.agentassay/results.db, overridable via the AGENTASSAY_DB_PATH environment variable (src/agentassay/dashboard/app.py:30-40).

Token-Efficient Testing Pipeline and Statistical Engine

Related topics: Introduction and Layered Architecture, Framework Adapters, CLI, Dashboard and pytest Integration, Analysis Methods, Persistence, Reporting and Deployment Operations

Section Related Pages

Continue reading this section for the full explanation and source context.

Token-Efficient Testing Pipeline and Statistical Engine

Overview

AgentAssay's token-efficient testing pipeline is the differentiator that separates it from conventional agent evaluation frameworks. The system sits on Layer 6 (Efficiency) of the documented six-layer architecture, directly above the statistical engine, and is responsible for collapsing the trial count required to reach a statistically defensible verdict. Source: README.md:1-200

The headline claim — "5–20x token cost reduction" — comes from composing three independent techniques. Source: README.md:1-150:

Behavioral Fingerprinting — extract compact, low-dimensional trace vectors and detect regressions with Hotelling's T² instead of comparing noisy raw text. Source: examples/README.md:1-60.
Adaptive Budget Optimization — calibrate behavioral variance on a small pilot, then compute the exact minimum N for the target confidence. Source: README.md:1-150.
Trace-First Offline Analysis — re-use existing production traces to compute coverage, contracts, and metamorphic relations at zero additional token cost. Source: README.md:1-150.

These techniques are bundled with multi-fidelity proxy testing and warm-start sequential testing, allowing cheaper models and prior results to short-circuit the full evaluation. Source: README.md:1-200.

Architecture

flowchart TD
    A[Scenario Definition] --> B[Pilot Run n=5-10]
    B --> C[Behavioral Variance Estimation]
    C --> D[Adaptive Budget N*]
    D --> E[Fingerprint Extraction]
    E --> F[Hotelling T² or Fisher Test]
    F --> G{SPRT Decision}
    G -->|Pass| H[Verdict: PASS]
    G -->|Fail| I[Verdict: FAIL]
    G -->|Insufficient Evidence| J[Verdict: INCONCLUSIVE]
    H --> K[Wilson CI + Effect Size]
    I --> K
    J --> L[Warm-Start Prior for Next Run]
    L --> A
    K --> M[HTML / JSON Report]

The pipeline mirrors the layered architecture described in the README. Layer 1 (Core) provides the ExecutionTrace and StepTrace data models used as input; Layer 2 (Statistics) supplies the wilson_interval and hypothesis primitives; Layer 3 (Verdicts) consumes them; Layer 4 (Analysis) feeds coverage and metamorphic checks from existing traces; Layer 5 (Integration) exposes them via the CLI; and Layer 6 (Efficiency) orchestrates the short-circuit. Source: README.md:1-200.

Statistical Engine

The statistical engine is the rigorous core that the efficiency layer leans on. It exposes three families of primitives:

Confidence Intervals — wilson_interval is invoked from both the CLI report generator and the test-report path. Source: src/agentassay/cli/cmd_report.py:1-90, src/agentassay/cli/cmd_test_report.py:1-60.
Hypothesis Tests — The legacy module enumerates FISHER, CHI2, BARNARD for binary regression detection and MANN_WHITNEY, KS, WELCH_T for continuous score comparison, returning a frozen HypothesisResult dataclass. Source: src/agentassay/statistics/hypothesis_legacy.py:1-90.
Effect Size — cohens_h is re-exported from agentassay.statistics.effect_size and consumed by the hypothesis module. Source: src/agentassay/statistics/hypothesis_legacy.py:1-60, src/agentassay/statistics/effect_size.py:1-40.

These primitives are pure functions, which makes them safe to invoke from both the trial runner and offline trace analyzers. Verdicts are not a single PASS/FAIL bit: the engine distinguishes PASS, FAIL, and INCONCLUSIVE based on confidence-interval overlap and SPRT boundaries, then attaches an effect size and Wilson CI for context. Source: src/agentassay/verdicts/verdict.py:1-60.

Behavioral Fingerprinting

BehavioralFingerprint is the entry point for trace compression. It produces a low-dimensional vector representation of what an agent *did* — tool sequences, state transitions, decision patterns — rather than what it *said*. Source: src/agentassay/efficiency/fingerprint.py:1-60.

The fingerprint is consumed by the regression test in efficiency.regression, which applies Hotelling's T² for multivariate change detection. Because the fingerprint dimensionality is small, the test requires far fewer samples than a string-similarity comparison, yielding the 3–5x savings quoted in the examples documentation. Source: examples/README.md:1-60, src/agentassay/efficiency/regression.py:1-60.

The same fingerprint is also fed to the coverage collector so that downstream 5D coverage metrics (tool, path, state, boundary, model) can be computed on a fingerprint histogram without re-running the agent. Source: src/agentassay/coverage/aggregate.py:1-60, README.md:1-200.

Adaptive Budget & Sequential Testing

agentassay.efficiency.budget is the module that removes the "guess N=30" problem. The flow is:

Run a small calibration set (5–10 trials).
Measure the behavioral variance from the fingerprint vectors.
Solve for the smallest N that achieves the configured power and alpha.

This delivers the 2–4x reduction in trial count quoted in the documentation. Source: examples/README.md:1-60, src/agentassay/efficiency/budget.py:1-60.

warm_start and multi_fidelity extend the same logic across *runs* and *models*. warm_start lets a new run incorporate the posterior from the previous run via SPRT, reaching a verdict earlier. multi_fidelity lets a cheap model screen the scenario first, escalating to the expensive model only when the cheap run is ambiguous. Source: src/agentassay/efficiency/warm_start.py:1-60, src/agentassay/efficiency/multi_fidelity.py:1-60.

Trace-First Offline Analysis

Trace-first analysis is what makes "5–20x" possible in practice. Because coverage, contract, and metamorphic checks operate on ExecutionTrace objects already in the database, they run for free. Source: src/agentassay/persistence/schema.py:1-90, src/agentassay/persistence/storage.py:1-80.

The AgentCoverageCollector aggregates fingerprints across the stored trials and feeds the HTML reporter. Source: src/agentassay/cli/cmd_demo.py:1-90, src/agentassay/coverage/aggregate.py:1-60. The HTMLReporter and JSONExporter then emit the self-contained artifacts used by CI gates. Source: src/agentassay/reporting/html.py:1-60, src/agentassay/reporting/json_export.py:1-60.

Common Failure Modes

Insufficient pilot size — If the calibration run has fewer than 5 trials, variance estimates become unstable and the adaptive budget falls back to a conservative N. Source: src/agentassay/efficiency/budget.py:1-60.
Schema mismatch on stored traces — Missing columns (e.g. cost, token_count) on the trials table cause the offline analyzers to skip those fields silently. Source: src/agentassay/persistence/schema.py:1-90.
Verdict stuck at INCONCLUSIVE — Usually indicates that the confidence interval straddles the threshold; rerun with a tighter alpha or use warm-start to accumulate prior evidence. Source: src/agentassay/verdicts/verdict.py:1-60.

Framework Adapters, CLI, Dashboard and pytest Integration

Related topics: Token-Efficient Testing Pipeline and Statistical Engine, Analysis Methods, Persistence, Reporting and Deployment Operations

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Framework-specific adapters

Continue reading this section for the full explanation and source context.

Framework Adapters, CLI, Dashboard and pytest Integration

AgentAssay exposes its stochastic testing engine through four integration surfaces: a uniform framework-adapter API for instrumenting popular agent runtimes, a Click-based CLI for headless and CI usage, an experimental Streamlit dashboard for interactive exploration, and a pytest plugin for users who prefer pytest-style authoring. Together they form Layer 5 ("Integration") of the six-layer architecture described in the project README, wrapping the statistical core (Layers 1–4) with the ergonomics different teams need.

1. The `AgentAdapter` Contract

Every framework integration implements the abstract base class AgentAdapter defined in src/agentassay/integrations/base.py. The contract is intentionally narrow: a single run(input_data: dict) -> ExecutionTrace method, plus constructor metadata (model, agent_name, metadata). A FrameworkNotInstalledError is raised by adapters that depend on optional SDKs, accompanied by a copy-pasteable install hint such as pip install agentassay[mcp] (see the _INSTALL_HINT constant in src/agentassay/integrations/mcp_adapter.py).

Framework-specific adapters

Adapter	Source file	Purpose
`CustomAdapter`	`integrations/custom_adapter.py`	Wrap any user-supplied callable; the entry point for the `quickstart.py` example
`LangGraphAdapter`	`integrations/langgraph_adapter.py`	Capture node-level traces from compiled LangGraph state machines
`CrewAIAdapter`	`integrations/crewai_adapter.py`	Wrap CrewAI crews and instruments agent/role execution
`OpenAIAgentsAdapter`	`integrations/openai_adapter.py`	Bridge the OpenAI Agents SDK tool-call loop
`AutoGenAdapter`	`integrations/autogen_adapter.py`	Support AutoGen/AG2 group chats
`SmolagentsAdapter`	`integrations/smolagents_adapter.py`	HuggingFace smolagents `CodeAgent` and `ToolCallingAgent`
`SemanticKernelAdapter`	`integrations/semantic_kernel_adapter.py`	Invoke Microsoft Semantic Kernel plugin functions, optionally registering `FunctionInvocationFilter` hooks for per-function timing (src/agentassay/integrations/semantic_kernel_adapter.py)
`MCPToolsAdapter`	`integrations/mcp_adapter.py`	Operate in two modes — direct MCP client or Anthropic-API mode (`mcp_mode="anthropic"`) where MCP tool definitions are converted to Anthropic's tool schema via src/agentassay/integrations/mcp_anthropic.py
`BedrockAdapter`	`integrations/bedrock_adapter.py`	Invoke AWS Bedrock Agents by `agent_id` / `agent_alias_id` (e.g. `TSTALIASID` for draft), capturing trace, rationale, and observation steps

flowchart LR
    A[Scenario Input] --> B[AgentAdapter subclass]
    B --> C[Framework SDK]
    C --> B
    B --> D[ExecutionTrace]
    D --> E[Stochastic Engine]
    E --> F[Verdict / Coverage / Mutation]

Installation uses framework-scoped extras: pip install agentassay[langgraph], [crewai], [openai], [autogen], [smolagents], [mcp], or [all]. Optional frameworks are not hard dependencies, so the base wheel stays slim (examples/README.md).

2. Command-Line Interface

The CLI is defined in src/agentassay/cli/main.py as a click.group with an attributed _version_message printing the Qualixar/AGPL notice. Subcommands are mounted from sibling modules:

agentassay run — execute a stochastic trial plan (defined in cmd_run.py)
agentassay compare — diff two result sets to detect regression (cmd_compare.py)
agentassay coverage — compute 5-dimensional coverage from a directory of traces (cmd_coverage.py); supports --traces and a tool whitelist like --tools search,book,cancel
agentassay mutate — apply prompt, tool, or model mutation operators (cmd_mutate.py)
agentassay report — produce a self-contained HTML report using wilson_interval for the displayed pass-rate band (src/agentassay/cli/cmd_report.py)
agentassay test-report — run the full pytest suite and emit an HTML dashboard with module-level pass/fail breakdown (src/agentassay/cli/cmd_test_report.py)
agentassay demo and agentassay dashboard — interactive entry points

The dashboard command is marked [EXPERIMENTAL] in its docstring and accepts --port (default 8501), --host (default localhost), --no-browser, and a --theme choice of dark or light (src/agentassay/cli/cmd_dashboard.py).

3. Streamlit Dashboard

The dashboard entry point is src/agentassay/dashboard/app.py. It uses st.set_page_config and a sidebar radio control exposing four pages: Overview, Test Run, History, and Fingerprints (the version caption is rendered from agentassay.__version__). Persistence is backed by a SQLite ResultStore whose path is taken from the AGENTASSAY_DB_PATH environment variable, defaulting to ~/.agentassay/results.db. A QueryAPI wraps the store and is passed into per-view renderers (view_overview, view_test_run, view_history, view_fingerprints) — these imports are lazy to keep cold-start time low.

4. pytest Integration

For teams already invested in pytest, AgentAssay provides a marker-based entry point. A representative test from the documentation looks like:

import pytest

@pytest.mark.agentassay(n=30, threshold=0.80)
def test_booking_agent_passes():
    ...

The full integration is shipped via the [dev] extra (pip install agentassay[dev]) and is demonstrated by examples/pytest-plugin/test_my_agent.py. Statistical assertions — Wilson confidence intervals, hypothesis tests, and three-valued PASS/FAIL/INCONCLUSIVE verdicts — are surfaced as native pytest outcomes, so the same suite runs in local development and in CI deployment gates.

Analysis Methods, Persistence, Reporting and Deployment Operations

Related topics: Token-Efficient Testing Pipeline and Statistical Engine, Framework Adapters, CLI, Dashboard and pytest Integration

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Mutation Framework

Continue reading this section for the full explanation and source context.

Section Other Analysis Layers

Continue reading this section for the full explanation and source context.

Section Schema

Continue reading this section for the full explanation and source context.

Analysis Methods, Persistence, Reporting and Deployment Operations

Overview

AgentAssay treats AI agent testing as a stochastic, statistical problem rather than a deterministic unit-test problem. Beyond running trials, the framework provides a complete post-execution pipeline: analysis methods that probe test sensitivity, persistence that stores every run for reproducibility, reporting that surfaces verdicts to humans, and deployment operations that gate CI/CD pipelines with statistical evidence. These capabilities are described in the architecture overview in README.md as Layers 4 through 6 of the layered design.

The v0.1.2 production release (per the v0.1.2 release notes) hardened all of these subsystems with strict mypy typing, ruff-clean linting, and CI gates across Python 3.10/3.11/3.12, so the analysis, persistence, and reporting paths are considered production-ready.

Analysis Methods

Mutation Framework

Analysis begins with mutation testing — perturbing the agent or its inputs to verify that the test suite can detect the perturbation. The foundation is defined in src/agentassay/mutation/base.py, where MutationOperator is an abstract base class declaring the contract every operator must satisfy:

Attribute / Method	Purpose
`name`	Stable identifier (e.g. `"prompt_synonym"`).
`category`	One of `"prompt"`, `"tool"`, `"model"`, `"context"`.
`description`	Human-readable explanation of the mutation.
`mutate(config, scenario)`	Returns a new `(AgentConfig, TestScenario)` pair; must deep-copy inputs.
`describe_mutation()`	Returns a record of what was changed in the most recent call.

Source: src/agentassay/mutation/base.py

A concrete example is PromptSynonymMutator in src/agentassay/mutation/prompt_ops.py. It targets the "prompt" or "instructions" key in scenario.input_data, replacing key words with synonyms drawn from a built-in verb/modifier table (e.g. "create" -> ["generate", "produce", "build", ...]). If neither prompt key exists, the mutation is a no-op, which is by design rather than a failure. The mutator accepts max_replacements, an optional custom synonym_table, and an RNG seed for reproducibility.

Other Analysis Layers

The README positions coverage (5-dimensional), metamorphic testing, and the contract oracle alongside mutation as Layer 4 analysis primitives. These consume the ExecutionTrace produced by the adapters described below and emit structured signals that downstream statistical verdicts operate on.

Persistence

Schema

Persistence is implemented as a SQLite-backed result store. The DDL in src/agentassay/persistence/schema.py defines a normalized relational model with foreign-key relationships:

projects — top-level container, keyed by id.
runs — references projects(id), stores agent_name, agent_version, model, framework, config_json, timing, and status.
trials — references runs(id), stores per-trial success, latency_ms, cost, token_count, step_count, error_msg, and the full trace_json.
verdicts, coverage, fingerprints, gate_decisions, costs — derived tables that attach statistical outcomes, coverage measurements, behavioral fingerprints, and gate decisions back to the originating run.

Source: src/agentassay/persistence/schema.py

Storage API

src/agentassay/persistence/storage.py wraps the schema with typed insertion helpers. For example, save_gate_decision() accepts run_id, pipeline (identifier such as "main" or "staging"), decision ("DEPLOY", "BLOCK", or "WARN"), a reason string, a JSON-encoded rules_json, and optional commit_sha / pr_number fields — exactly the metadata a CI/CD system needs to audit a gate event. save_cost() records token and dollar accounting per run, providing the cost signal that drives Layer 6 efficiency optimizations.

Dashboard

src/agentassay/dashboard/app.py is a Streamlit application that mounts a ResultStore against AGENTASSAY_DB_PATH (defaulting to ~/.agentassay/results.db) and routes between four views: Overview, Test Run, History, and Fingerprints. Each view imports a dedicated renderer module on demand, keeping startup cost low. Source: src/agentassay/dashboard/app.py

Reporting

The CLI reporting path is implemented in src/agentassay/cli/cmd_report.py. The agentassay report command accepts --results (a JSON file of trial outcomes) and --output (default agentassay-report.html), loads the results through load_json, extracts the per-trial pass/fail vector via extract_passed_list, and computes a Wilson confidence interval using wilson_interval from src/agentassay/statistics/confidence.py. The resulting HTML is self-contained — pass rate, CI bounds, and a verdict panel — suitable for archiving or sharing without external assets. The CLI also exposes agentassay run, compare, mutate, coverage, and test-report commands per README.md.

Deployment Operations

Deployment operations integrate framework adapters, statistical verdicts, and the persistence layer into a single gateable workflow:

flowchart LR
    A[Framework Adapter] --> B[Trial Execution]
    B --> C[ExecutionTrace]
    C --> D[Persistence / ResultStore]
    C --> E[Analysis: Coverage + Mutation]
    E --> F[Stochastic Verdict]
    F --> G{Gate Decision}
    G -->|DEPLOY| H[CI/CD Pass]
    G -->|BLOCK/WARN| I[CI/CD Fail]
    D --> J[Dashboard / Report]

The adapter layer feeds deployment operations with structured traces. src/agentassay/integrations/mcp_adapter.py exposes direct MCP and Anthropic-MCP modes, returning an ExecutionTrace with per-tool-call StepTrace entries and timing captured via time.perf_counter(). src/agentassay/integrations/bedrock_adapter.py wraps AWS Bedrock Agents and captures each rationale, observation, and knowledge-base lookup as a step. src/agentassay/integrations/semantic_kernel_adapter.py registers a temporary FunctionInvocationFilter on the kernel to capture function invocations non-intrusively. src/agentassay/integrations/langgraph_adapter.py classifies LangGraph node outputs by inspecting tool_calls, message content, or node-name heuristics so each graph node becomes a typed step.

All adapter output funnels into the ResultStore, where verdicts, gate decisions, and cost records are persisted. The gate_decisions table ties every DEPLOY/BLOCK/WARN outcome back to a commit_sha and pr_number, giving auditors a complete chain of custody from code change to statistical evidence.

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

medium Capability evidence risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Maintenance risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Security or permission risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Security or permission risk requires verification

May increase setup, validation, or first-run risk for the user.

Doramagic Pitfall Log

Found 6 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Capability evidence risk - Capability evidence risk requires verification.

1. Capability evidence risk: Capability evidence risk requires verification

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: capability.assumptions | https://github.com/qualixar/agentassay

2. Maintenance risk: Maintenance risk requires verification

Severity: medium
Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/qualixar/agentassay

3. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: downstream_validation.risk_items | https://github.com/qualixar/agentassay

4. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: risks.scoring_risks | https://github.com/qualixar/agentassay

5. Maintenance risk: Maintenance risk requires verification

Severity: low
Finding: issue_or_pr_quality=unknown。
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/qualixar/agentassay

6. Maintenance risk: Maintenance risk requires verification

Severity: low
Finding: release_recency=unknown。
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/qualixar/agentassay

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 3

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using agentassay with real data or production workflows.

v0.1.2 — Production Release - github / github_release
v0.1.1 — Initial Public Release - github / github_release
Capability evidence risk requires verification - GitHub / issue

Source: Project Pack community evidence and pitfall evidence

agentassay

Introduction and Layered Architecture

Related Pages

Introduction and Layered Architecture

Overview

The Six-Layer Stack

Layer 1 — Core

Layer 2 — Statistics

Layer 3 — Verdicts

Layer 4 — Analysis

Layer 5 — Integration

Layer 6 — Efficiency

Persistence and Audit Trail

See Also

Token-Efficient Testing Pipeline and Statistical Engine

Related Pages

Token-Efficient Testing Pipeline and Statistical Engine

Overview

Architecture

Statistical Engine

Behavioral Fingerprinting

Adaptive Budget & Sequential Testing

Trace-First Offline Analysis

Common Failure Modes

See Also

Framework Adapters, CLI, Dashboard and pytest Integration

Related Pages

Framework Adapters, CLI, Dashboard and pytest Integration

1. The `AgentAdapter` Contract

Framework-specific adapters

2. Command-Line Interface

3. Streamlit Dashboard

4. pytest Integration

See Also

Analysis Methods, Persistence, Reporting and Deployment Operations

Related Pages

Analysis Methods, Persistence, Reporting and Deployment Operations

Overview

Analysis Methods

Mutation Framework

Other Analysis Layers

Persistence

Schema

Storage API

Dashboard

Reporting

Deployment Operations

See Also

Doramagic Pitfall Log

Doramagic Pitfall Log

1. Capability evidence risk: Capability evidence risk requires verification

2. Maintenance risk: Maintenance risk requires verification

3. Security or permission risk: Security or permission risk requires verification

4. Security or permission risk: Security or permission risk requires verification

5. Maintenance risk: Maintenance risk requires verification

6. Maintenance risk: Maintenance risk requires verification

Community Discussion Evidence

Community Discussion Evidence