# https://github.com/qualixar/agentassay Project Manual

Generated at: 2026-06-20 05:22:01 UTC

## Table of Contents

- [Introduction and Layered Architecture](#page-1)
- [Token-Efficient Testing Pipeline and Statistical Engine](#page-2)
- [Framework Adapters, CLI, Dashboard and pytest Integration](#page-3)
- [Analysis Methods, Persistence, Reporting and Deployment Operations](#page-4)

<a id='page-1'></a>

## Introduction and Layered Architecture

### Related Pages

Related topics: [Token-Efficient Testing Pipeline and Statistical Engine](#page-2), [Framework Adapters, CLI, Dashboard and pytest Integration](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/qualixar/agentassay/blob/main/README.md)
- [src/agentassay/mutation/base.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/mutation/base.py)
- [src/agentassay/mutation/prompt_ops.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/mutation/prompt_ops.py)
- [src/agentassay/integrations/mcp_adapter.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/mcp_adapter.py)
- [src/agentassay/integrations/mcp_anthropic.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/mcp_anthropic.py)
- [src/agentassay/integrations/semantic_kernel_adapter.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/semantic_kernel_adapter.py)
- [src/agentassay/integrations/bedrock_adapter.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/bedrock_adapter.py)
- [src/agentassay/cli/cmd_report.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/cli/cmd_report.py)
- [src/agentassay/persistence/schema.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/persistence/schema.py)
- [src/agentassay/persistence/storage.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/persistence/storage.py)
- [src/agentassay/dashboard/app.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/dashboard/app.py)
- [src/agentassay/dashboard/helpers.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/dashboard/helpers.py)
- [examples/README.md](https://github.com/qualixar/agentassay/blob/main/examples/README.md)
</details>

# Introduction and Layered Architecture

## Overview

AgentAssay is an agent testing framework that delivers statistical guarantees for non-deterministic AI agent workflows without burning token budgets. The project's tagline — "Test More. Spend Less. Ship Confident." — captures the core value proposition: produce rigorous verdicts on agent behavior while reducing token costs 5–20x compared to naïve re-run strategies. As described in [README.md:1-30](https://github.com/qualixar/agentassay/blob/main/README.md), AgentAssay is part of the Qualixar AI Agent Reliability Platform and is backed by the paper "AgentAssay: Formal Regression Testing for Non-Deterministic AI Agent Workflows" (arXiv:2603.02601).

The framework ships ten framework adapters, a pytest plugin, a CLI with five commands (`run`, `compare`, `mutate`, `coverage`, `report`), and a Streamlit dashboard. Release v0.1.2 is a production-grade build with zero CI warnings, full mypy strict-mode compliance, and ruff-clean code across Python 3.10/3.11/3.12 (per the [v0.1.2 release notes](https://github.com/qualixar/agentassay/releases/tag/v0.1.2)).

## The Six-Layer Stack

The codebase is organized as a six-layer architecture, with each layer building on the abstractions of the layer below. Layers 1–5 ensure that every test run produces rigorous results; Layer 6 optimizes *how many* runs are needed. The full stack is documented in [README.md:120-145](https://github.com/qualixar/agentassay/blob/main/README.md).

```mermaid
flowchart TB
    L6["Layer 6: Efficiency<br/>Fingerprinting · Budget Optimization · Trace Analysis<br/>Multi-Fidelity · Warm-Start Sequential"]
    L5["Layer 5: Integration<br/>Framework Adapters · pytest Plugin · CLI · Reporting"]
    L4["Layer 4: Analysis<br/>Coverage (5D) · Mutation · Metamorphic · Contract Oracle"]
    L3["Layer 3: Verdicts<br/>Stochastic Verdicts · Deployment Gates"]
    L2["Layer 2: Statistics<br/>Hypothesis Tests · Confidence Intervals · SPRT · Effect Size"]
    L1["Layer 1: Core<br/>Data Models · Execution Engine · Trace Format"]
    L6 --> L5 --> L4 --> L3 --> L2 --> L1
```

### Layer 1 — Core

The core layer provides the foundational data models, the execution engine that runs scenarios through an agent, and the canonical `ExecutionTrace` format. The execution engine captures per-step traces, latencies, token counts, and success flags consumed by every higher layer. Frameworks plug into the core via a uniform `run(input_data) -> ExecutionTrace` contract, as seen in the MCP adapter at [src/agentassay/integrations/mcp_adapter.py:120-150](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/mcp_adapter.py) and the Semantic Kernel adapter at [src/agentassay/integrations/semantic_kernel_adapter.py:60-100](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/semantic_kernel_adapter.py).

### Layer 2 — Statistics

Layer 2 implements the statistical machinery: hypothesis tests, Wilson confidence intervals, Sequential Probability Ratio Tests (SPRT), and effect-size calculations. These primitives are what allow AgentAssay to produce three-valued verdicts (PASS / FAIL / INCONCLUSIVE) with confidence intervals rather than misleading point estimates. The Wilson interval is exposed in CLI code at [src/agentassay/cli/cmd_report.py:60-75](https://github.com/qualixar/agentassay/blob/main/src/agentassay/cli/cmd_report.py) for HTML report generation.

### Layer 3 — Verdicts

Layer 3 consumes Layer 2 statistics to issue stochastic verdicts and deployment gates. A gate decision is a structured record (`DEPLOY` / `BLOCK` / `WARN` with reason text and rules JSON) persisted for audit, as shown in [src/agentassay/persistence/storage.py:80-130](https://github.com/qualixar/agentassay/blob/main/src/agentassay/persistence/storage.py). Gates can be wired into CI/CD pipelines to block broken deployments with statistical evidence.

### Layer 4 — Analysis

Layer 4 adds four analytical passes over traces and runs: five-dimensional coverage (tool, path, state, boundary, model), mutation testing, metamorphic testing, and a contract oracle. The mutation subsystem is defined in [src/agentassay/mutation/base.py:1-80](https://github.com/qualixar/agentassay/blob/main/src/agentassay/mutation/base.py), which establishes a `MutationOperator` abstract base class with `name`, `category` (one of `prompt`, `tool`, `model`, `context`), and `description` class attributes plus an abstract `mutate(config, scenario) -> tuple[AgentConfig, TestScenario]`. Concrete operators such as `PromptSynonymMutator` in [src/agentassay/mutation/prompt_ops.py:60-110](https://github.com/qualixar/agentassay/blob/main/src/agentassay/mutation/prompt_ops.py) replace key prompt words with synonyms to validate test sensitivity. Operators are required to deep-copy inputs so the originals are never mutated ([src/agentassay/mutation/base.py:55-70](https://github.com/qualixar/agentassay/blob/main/src/agentassay/mutation/base.py)).

### Layer 5 — Integration

Layer 5 is the surface area users actually touch: framework adapters, the pytest plugin, the CLI, and the dashboard. Framework adapters exist for LangGraph, CrewAI, OpenAI Agents, AutoGen/AG2, smolagents, MCP (direct and Anthropic bridge), Semantic Kernel, and AWS Bedrock — see [src/agentassay/integrations/mcp_anthropic.py:1-60](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/mcp_anthropic.py) for the Anthropic bridge and [src/agentassay/integrations/bedrock_adapter.py:1-80](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/bedrock_adapter.py) for the Bedrock adapter signature. The CLI report command in [src/agentassay/cli/cmd_report.py:1-50](https://github.com/qualixar/agentassay/blob/main/src/agentassay/cli/cmd_report.py) generates a self-contained HTML report with verdict, pass-rate table, and Wilson confidence interval. The Streamlit dashboard is a thin router in [src/agentassay/dashboard/app.py:1-60](https://github.com/qualixar/agentassay/blob/main/src/agentassay/dashboard/app.py) that dispatches to four views: Overview, Test Run, History, and Fingerprints. All views share utility helpers from [src/agentassay/dashboard/helpers.py:1-40](https://github.com/qualixar/agentassay/blob/main/src/agentassay/dashboard/helpers.py) for metric cards and dark-themed Plotly charts. Working examples for every framework live under `examples/framework-specific/` as catalogued in [examples/README.md:1-30](https://github.com/qualixar/agentassay/blob/main/examples/README.md).

### Layer 6 — Efficiency

Layer 6 is the differentiator. It houses behavioral fingerprinting, adaptive budget optimization, trace-first offline analysis, multi-fidelity proxy testing, and warm-start sequential testing. Together these techniques achieve the headline 5–20x token reduction: trace-first analysis reuses existing production traces for coverage, contracts, and metamorphic checks at zero token cost, while adaptive budgets calibrate variance and compute the exact minimum `N` per scenario.

## Persistence and Audit Trail

Behind every layer is a SQLite-backed persistence layer with a fixed schema covering projects, runs, trials, verdicts, coverage, fingerprints, gate decisions, and cost accounting. The DDL is defined in [src/agentassay/persistence/schema.py:1-80](https://github.com/qualixar/agentassay/blob/main/src/agentassay/persistence/schema.py) and the `ResultStore` in [src/agentassay/persistence/storage.py:1-50](https://github.com/qualixar/agentassay/blob/main/src/agentassay/persistence/storage.py) provides insert methods for each entity. The default database path is `~/.agentassay/results.db`, overridable via the `AGENTASSAY_DB_PATH` environment variable ([src/agentassay/dashboard/app.py:30-40](https://github.com/qualixar/agentassay/blob/main/src/agentassay/dashboard/app.py)).

## See Also

- Token-Efficient Testing concept — the design rationale for Layer 6
- Stochastic Testing concept — the rationale for three-valued verdicts
- Coverage Metrics — the five-dimensional coverage model in Layer 4
- CLI Reference — full documentation of the `run`, `compare`, `mutate`, `coverage`, and `report` commands
- Framework Adapters — adapter-specific guides for LangGraph, CrewAI, MCP, Bedrock, etc.

---

<a id='page-2'></a>

## Token-Efficient Testing Pipeline and Statistical Engine

### Related Pages

Related topics: [Introduction and Layered Architecture](#page-1), [Framework Adapters, CLI, Dashboard and pytest Integration](#page-3), [Analysis Methods, Persistence, Reporting and Deployment Operations](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [src/agentassay/efficiency/fingerprint.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/efficiency/fingerprint.py)
- [src/agentassay/efficiency/budget.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/efficiency/budget.py)
- [src/agentassay/efficiency/multi_fidelity.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/efficiency/multi_fidelity.py)
- [src/agentassay/efficiency/warm_start.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/efficiency/warm_start.py)
- [src/agentassay/efficiency/regression.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/efficiency/regression.py)
- [src/agentassay/efficiency/distribution.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/efficiency/distribution.py)
- [src/agentassay/statistics/confidence.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/statistics/confidence.py)
- [src/agentassay/statistics/hypothesis_legacy.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/statistics/hypothesis_legacy.py)
- [src/agentassay/statistics/effect_size.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/statistics/effect_size.py)
- [src/agentassay/verdicts/verdict.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/verdicts/verdict.py)
- [src/agentassay/coverage/aggregate.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/coverage/aggregate.py)
- [src/agentassay/reporting/html.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/reporting/html.py)
- [src/agentassay/reporting/json_export.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/reporting/json_export.py)
</details>

# Token-Efficient Testing Pipeline and Statistical Engine

## Overview

AgentAssay's token-efficient testing pipeline is the differentiator that separates it from conventional agent evaluation frameworks. The system sits on **Layer 6 (Efficiency)** of the documented six-layer architecture, directly above the statistical engine, and is responsible for collapsing the trial count required to reach a statistically defensible verdict. Source: [README.md:1-200]()

The headline claim — "5–20x token cost reduction" — comes from composing three independent techniques. Source: [README.md:1-150]():

1. **Behavioral Fingerprinting** — extract compact, low-dimensional trace vectors and detect regressions with Hotelling's T² instead of comparing noisy raw text. Source: [examples/README.md:1-60]().
2. **Adaptive Budget Optimization** — calibrate behavioral variance on a small pilot, then compute the exact minimum N for the target confidence. Source: [README.md:1-150]().
3. **Trace-First Offline Analysis** — re-use existing production traces to compute coverage, contracts, and metamorphic relations at zero additional token cost. Source: [README.md:1-150]().

These techniques are bundled with multi-fidelity proxy testing and warm-start sequential testing, allowing cheaper models and prior results to short-circuit the full evaluation. Source: [README.md:1-200]().

## Architecture

```mermaid
flowchart TD
    A[Scenario Definition] --> B[Pilot Run n=5-10]
    B --> C[Behavioral Variance Estimation]
    C --> D[Adaptive Budget N*]
    D --> E[Fingerprint Extraction]
    E --> F[Hotelling T² or Fisher Test]
    F --> G{SPRT Decision}
    G -->|Pass| H[Verdict: PASS]
    G -->|Fail| I[Verdict: FAIL]
    G -->|Insufficient Evidence| J[Verdict: INCONCLUSIVE]
    H --> K[Wilson CI + Effect Size]
    I --> K
    J --> L[Warm-Start Prior for Next Run]
    L --> A
    K --> M[HTML / JSON Report]
```

The pipeline mirrors the layered architecture described in the README. Layer 1 (Core) provides the `ExecutionTrace` and `StepTrace` data models used as input; Layer 2 (Statistics) supplies the `wilson_interval` and hypothesis primitives; Layer 3 (Verdicts) consumes them; Layer 4 (Analysis) feeds coverage and metamorphic checks from existing traces; Layer 5 (Integration) exposes them via the CLI; and Layer 6 (Efficiency) orchestrates the short-circuit. Source: [README.md:1-200]().

## Statistical Engine

The statistical engine is the rigorous core that the efficiency layer leans on. It exposes three families of primitives:

- **Confidence Intervals** — `wilson_interval` is invoked from both the CLI report generator and the test-report path. Source: [src/agentassay/cli/cmd_report.py:1-90](), [src/agentassay/cli/cmd_test_report.py:1-60]().
- **Hypothesis Tests** — The legacy module enumerates `FISHER`, `CHI2`, `BARNARD` for binary regression detection and `MANN_WHITNEY`, `KS`, `WELCH_T` for continuous score comparison, returning a frozen `HypothesisResult` dataclass. Source: [src/agentassay/statistics/hypothesis_legacy.py:1-90]().
- **Effect Size** — `cohens_h` is re-exported from `agentassay.statistics.effect_size` and consumed by the hypothesis module. Source: [src/agentassay/statistics/hypothesis_legacy.py:1-60](), [src/agentassay/statistics/effect_size.py:1-40]().

These primitives are pure functions, which makes them safe to invoke from both the trial runner and offline trace analyzers. Verdicts are not a single PASS/FAIL bit: the engine distinguishes PASS, FAIL, and INCONCLUSIVE based on confidence-interval overlap and SPRT boundaries, then attaches an effect size and Wilson CI for context. Source: [src/agentassay/verdicts/verdict.py:1-60]().

## Behavioral Fingerprinting

`BehavioralFingerprint` is the entry point for trace compression. It produces a low-dimensional vector representation of what an agent *did* — tool sequences, state transitions, decision patterns — rather than what it *said*. Source: [src/agentassay/efficiency/fingerprint.py:1-60]().

The fingerprint is consumed by the regression test in `efficiency.regression`, which applies Hotelling's T² for multivariate change detection. Because the fingerprint dimensionality is small, the test requires far fewer samples than a string-similarity comparison, yielding the 3–5x savings quoted in the examples documentation. Source: [examples/README.md:1-60](), [src/agentassay/efficiency/regression.py:1-60]().

The same fingerprint is also fed to the coverage collector so that downstream 5D coverage metrics (tool, path, state, boundary, model) can be computed on a fingerprint histogram without re-running the agent. Source: [src/agentassay/coverage/aggregate.py:1-60](), [README.md:1-200]().

## Adaptive Budget & Sequential Testing

`agentassay.efficiency.budget` is the module that removes the "guess N=30" problem. The flow is:

1. Run a small calibration set (5–10 trials).
2. Measure the behavioral variance from the fingerprint vectors.
3. Solve for the smallest N that achieves the configured power and alpha.

This delivers the 2–4x reduction in trial count quoted in the documentation. Source: [examples/README.md:1-60](), [src/agentassay/efficiency/budget.py:1-60]().

`warm_start` and `multi_fidelity` extend the same logic across *runs* and *models*. `warm_start` lets a new run incorporate the posterior from the previous run via SPRT, reaching a verdict earlier. `multi_fidelity` lets a cheap model screen the scenario first, escalating to the expensive model only when the cheap run is ambiguous. Source: [src/agentassay/efficiency/warm_start.py:1-60](), [src/agentassay/efficiency/multi_fidelity.py:1-60]().

## Trace-First Offline Analysis

Trace-first analysis is what makes "5–20x" possible in practice. Because coverage, contract, and metamorphic checks operate on `ExecutionTrace` objects already in the database, they run for free. Source: [src/agentassay/persistence/schema.py:1-90](), [src/agentassay/persistence/storage.py:1-80]().

The `AgentCoverageCollector` aggregates fingerprints across the stored trials and feeds the HTML reporter. Source: [src/agentassay/cli/cmd_demo.py:1-90](), [src/agentassay/coverage/aggregate.py:1-60](). The `HTMLReporter` and `JSONExporter` then emit the self-contained artifacts used by CI gates. Source: [src/agentassay/reporting/html.py:1-60](), [src/agentassay/reporting/json_export.py:1-60]().

## Common Failure Modes

- **Insufficient pilot size** — If the calibration run has fewer than 5 trials, variance estimates become unstable and the adaptive budget falls back to a conservative N. Source: [src/agentassay/efficiency/budget.py:1-60]().
- **Schema mismatch on stored traces** — Missing columns (e.g. `cost`, `token_count`) on the `trials` table cause the offline analyzers to skip those fields silently. Source: [src/agentassay/persistence/schema.py:1-90]().
- **Verdict stuck at INCONCLUSIVE** — Usually indicates that the confidence interval straddles the threshold; rerun with a tighter alpha or use warm-start to accumulate prior evidence. Source: [src/agentassay/verdicts/verdict.py:1-60]().

## See Also

- [Architecture Overview](architecture-overview.md)
- [Coverage Metrics](coverage-metrics.md)
- [Stochastic Testing](stochastic-testing.md)
- [CLI Reference](cli-reference.md)

---

<a id='page-3'></a>

## Framework Adapters, CLI, Dashboard and pytest Integration

### Related Pages

Related topics: [Token-Efficient Testing Pipeline and Statistical Engine](#page-2), [Analysis Methods, Persistence, Reporting and Deployment Operations](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [src/agentassay/integrations/base.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/base.py)
- [src/agentassay/integrations/custom_adapter.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/custom_adapter.py)
- [src/agentassay/integrations/langgraph_adapter.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/langgraph_adapter.py)
- [src/agentassay/integrations/crewai_adapter.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/crewai_adapter.py)
- [src/agentassay/integrations/openai_adapter.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/openai_adapter.py)
- [src/agentassay/integrations/semantic_kernel_adapter.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/semantic_kernel_adapter.py)
- [src/agentassay/integrations/mcp_adapter.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/mcp_adapter.py)
- [src/agentassay/integrations/bedrock_adapter.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/bedrock_adapter.py)
- [src/agentassay/cli/main.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/cli/main.py)
- [src/agentassay/cli/cmd_run.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/cli/cmd_run.py)
- [src/agentassay/cli/cmd_compare.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/cli/cmd_compare.py)
- [src/agentassay/cli/cmd_coverage.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/cli/cmd_coverage.py)
- [src/agentassay/cli/cmd_mutate.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/cli/cmd_mutate.py)
- [src/agentassay/cli/cmd_report.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/cli/cmd_report.py)
- [src/agentassay/cli/cmd_dashboard.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/cli/cmd_dashboard.py)
- [src/agentassay/dashboard/app.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/dashboard/app.py)
</details>

# Framework Adapters, CLI, Dashboard and pytest Integration

AgentAssay exposes its stochastic testing engine through four integration surfaces: a uniform **framework-adapter** API for instrumenting popular agent runtimes, a **Click-based CLI** for headless and CI usage, an experimental **Streamlit dashboard** for interactive exploration, and a **pytest plugin** for users who prefer pytest-style authoring. Together they form Layer 5 ("Integration") of the six-layer architecture described in the project README, wrapping the statistical core (Layers 1–4) with the ergonomics different teams need.

## 1. The `AgentAdapter` Contract

Every framework integration implements the abstract base class `AgentAdapter` defined in [src/agentassay/integrations/base.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/base.py). The contract is intentionally narrow: a single `run(input_data: dict) -> ExecutionTrace` method, plus constructor metadata (`model`, `agent_name`, `metadata`). A `FrameworkNotInstalledError` is raised by adapters that depend on optional SDKs, accompanied by a copy-pasteable install hint such as `pip install agentassay[mcp]` (see the `_INSTALL_HINT` constant in [src/agentassay/integrations/mcp_adapter.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/mcp_adapter.py)).

### Framework-specific adapters

| Adapter | Source file | Purpose |
|---------|-------------|---------|
| `CustomAdapter` | `integrations/custom_adapter.py` | Wrap any user-supplied callable; the entry point for the `quickstart.py` example |
| `LangGraphAdapter` | `integrations/langgraph_adapter.py` | Capture node-level traces from compiled LangGraph state machines |
| `CrewAIAdapter` | `integrations/crewai_adapter.py` | Wrap CrewAI crews and instruments agent/role execution |
| `OpenAIAgentsAdapter` | `integrations/openai_adapter.py` | Bridge the OpenAI Agents SDK tool-call loop |
| `AutoGenAdapter` | `integrations/autogen_adapter.py` | Support AutoGen/AG2 group chats |
| `SmolagentsAdapter` | `integrations/smolagents_adapter.py` | HuggingFace smolagents `CodeAgent` and `ToolCallingAgent` |
| `SemanticKernelAdapter` | `integrations/semantic_kernel_adapter.py` | Invoke Microsoft Semantic Kernel plugin functions, optionally registering `FunctionInvocationFilter` hooks for per-function timing ([src/agentassay/integrations/semantic_kernel_adapter.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/semantic_kernel_adapter.py)) |
| `MCPToolsAdapter` | `integrations/mcp_adapter.py` | Operate in two modes — direct MCP client **or** Anthropic-API mode (`mcp_mode="anthropic"`) where MCP tool definitions are converted to Anthropic's tool schema via [src/agentassay/integrations/mcp_anthropic.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/mcp_anthropic.py) |
| `BedrockAdapter` | `integrations/bedrock_adapter.py` | Invoke AWS Bedrock Agents by `agent_id` / `agent_alias_id` (e.g. `TSTALIASID` for draft), capturing trace, rationale, and observation steps |

```mermaid
flowchart LR
    A[Scenario Input] --> B[AgentAdapter subclass]
    B --> C[Framework SDK]
    C --> B
    B --> D[ExecutionTrace]
    D --> E[Stochastic Engine]
    E --> F[Verdict / Coverage / Mutation]
```

Installation uses framework-scoped extras: `pip install agentassay[langgraph]`, `[crewai]`, `[openai]`, `[autogen]`, `[smolagents]`, `[mcp]`, or `[all]`. Optional frameworks are not hard dependencies, so the base wheel stays slim ([examples/README.md](https://github.com/qualixar/agentassay/blob/main/examples/README.md)).

## 2. Command-Line Interface

The CLI is defined in [src/agentassay/cli/main.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/cli/main.py) as a `click.group` with an attributed `_version_message` printing the Qualixar/AGPL notice. Subcommands are mounted from sibling modules:

- `agentassay run` — execute a stochastic trial plan (defined in `cmd_run.py`)
- `agentassay compare` — diff two result sets to detect regression (`cmd_compare.py`)
- `agentassay coverage` — compute 5-dimensional coverage from a directory of traces (`cmd_coverage.py`); supports `--traces` and a tool whitelist like `--tools search,book,cancel`
- `agentassay mutate` — apply prompt, tool, or model mutation operators (`cmd_mutate.py`)
- `agentassay report` — produce a self-contained HTML report using `wilson_interval` for the displayed pass-rate band ([src/agentassay/cli/cmd_report.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/cli/cmd_report.py))
- `agentassay test-report` — run the full pytest suite and emit an HTML dashboard with module-level pass/fail breakdown ([src/agentassay/cli/cmd_test_report.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/cli/cmd_test_report.py))
- `agentassay demo` and `agentassay dashboard` — interactive entry points

The dashboard command is marked `[EXPERIMENTAL]` in its docstring and accepts `--port` (default 8501), `--host` (default `localhost`), `--no-browser`, and a `--theme` choice of `dark` or `light` ([src/agentassay/cli/cmd_dashboard.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/cli/cmd_dashboard.py)).

## 3. Streamlit Dashboard

The dashboard entry point is [src/agentassay/dashboard/app.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/dashboard/app.py). It uses `st.set_page_config` and a sidebar radio control exposing four pages: **Overview**, **Test Run**, **History**, and **Fingerprints** (the version caption is rendered from `agentassay.__version__`). Persistence is backed by a SQLite `ResultStore` whose path is taken from the `AGENTASSAY_DB_PATH` environment variable, defaulting to `~/.agentassay/results.db`. A `QueryAPI` wraps the store and is passed into per-view renderers (`view_overview`, `view_test_run`, `view_history`, `view_fingerprints`) — these imports are lazy to keep cold-start time low.

## 4. pytest Integration

For teams already invested in pytest, AgentAssay provides a marker-based entry point. A representative test from the documentation looks like:

```python
import pytest

@pytest.mark.agentassay(n=30, threshold=0.80)
def test_booking_agent_passes():
    ...
```

The full integration is shipped via the `[dev]` extra (`pip install agentassay[dev]`) and is demonstrated by `examples/pytest-plugin/test_my_agent.py`. Statistical assertions — Wilson confidence intervals, hypothesis tests, and three-valued PASS/FAIL/INCONCLUSIVE verdicts — are surfaced as native pytest outcomes, so the same suite runs in local development and in CI deployment gates.

## See Also

- [Architecture Overview](architecture-overview.md)
- [Token-Efficient Testing](concepts/token-efficient-testing.md)
- [Stochastic Testing](concepts/stochastic-testing.md)
- [CLI Reference](reference/cli.md)
- [Examples Index](https://github.com/qualixar/agentassay/blob/main/examples/README.md)

---

<a id='page-4'></a>

## Analysis Methods, Persistence, Reporting and Deployment Operations

### Related Pages

Related topics: [Token-Efficient Testing Pipeline and Statistical Engine](#page-2), [Framework Adapters, CLI, Dashboard and pytest Integration](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [src/agentassay/mutation/base.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/mutation/base.py)
- [src/agentassay/mutation/prompt_ops.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/mutation/prompt_ops.py)
- [src/agentassay/cli/cmd_report.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/cli/cmd_report.py)
- [src/agentassay/persistence/schema.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/persistence/schema.py)
- [src/agentassay/persistence/storage.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/persistence/storage.py)
- [src/agentassay/dashboard/app.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/dashboard/app.py)
- [src/agentassay/integrations/mcp_adapter.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/mcp_adapter.py)
- [src/agentassay/integrations/bedrock_adapter.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/bedrock_adapter.py)
- [src/agentassay/integrations/semantic_kernel_adapter.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/semantic_kernel_adapter.py)
- [src/agentassay/integrations/langgraph_adapter.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/langgraph_adapter.py)
- [README.md](https://github.com/qualixar/agentassay/blob/main/README.md)
</details>

# Analysis Methods, Persistence, Reporting and Deployment Operations

## Overview

AgentAssay treats AI agent testing as a stochastic, statistical problem rather than a deterministic unit-test problem. Beyond running trials, the framework provides a complete post-execution pipeline: analysis methods that probe test sensitivity, persistence that stores every run for reproducibility, reporting that surfaces verdicts to humans, and deployment operations that gate CI/CD pipelines with statistical evidence. These capabilities are described in the architecture overview in [README.md](https://github.com/qualixar/agentassay/blob/main/README.md) as Layers 4 through 6 of the layered design.

The v0.1.2 production release (per the [v0.1.2 release notes](https://github.com/qualixar/agentassay/releases/tag/v0.1.2)) hardened all of these subsystems with strict mypy typing, ruff-clean linting, and CI gates across Python 3.10/3.11/3.12, so the analysis, persistence, and reporting paths are considered production-ready.

## Analysis Methods

### Mutation Framework

Analysis begins with mutation testing — perturbing the agent or its inputs to verify that the test suite can detect the perturbation. The foundation is defined in [src/agentassay/mutation/base.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/mutation/base.py), where `MutationOperator` is an abstract base class declaring the contract every operator must satisfy:

| Attribute / Method | Purpose |
|---|---|
| `name` | Stable identifier (e.g. `"prompt_synonym"`). |
| `category` | One of `"prompt"`, `"tool"`, `"model"`, `"context"`. |
| `description` | Human-readable explanation of the mutation. |
| `mutate(config, scenario)` | Returns a new `(AgentConfig, TestScenario)` pair; **must** deep-copy inputs. |
| `describe_mutation()` | Returns a record of what was changed in the most recent call. |

Source: [src/agentassay/mutation/base.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/mutation/base.py)

A concrete example is `PromptSynonymMutator` in [src/agentassay/mutation/prompt_ops.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/mutation/prompt_ops.py). It targets the `"prompt"` or `"instructions"` key in `scenario.input_data`, replacing key words with synonyms drawn from a built-in verb/modifier table (e.g. `"create" -> ["generate", "produce", "build", ...]`). If neither prompt key exists, the mutation is a no-op, which is by design rather than a failure. The mutator accepts `max_replacements`, an optional custom `synonym_table`, and an RNG `seed` for reproducibility.

### Other Analysis Layers

The README positions coverage (5-dimensional), metamorphic testing, and the contract oracle alongside mutation as Layer 4 analysis primitives. These consume the `ExecutionTrace` produced by the adapters described below and emit structured signals that downstream statistical verdicts operate on.

## Persistence

### Schema

Persistence is implemented as a SQLite-backed result store. The DDL in [src/agentassay/persistence/schema.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/persistence/schema.py) defines a normalized relational model with foreign-key relationships:

- `projects` — top-level container, keyed by `id`.
- `runs` — references `projects(id)`, stores `agent_name`, `agent_version`, `model`, `framework`, `config_json`, timing, and `status`.
- `trials` — references `runs(id)`, stores per-trial `success`, `latency_ms`, `cost`, `token_count`, `step_count`, `error_msg`, and the full `trace_json`.
- `verdicts`, `coverage`, `fingerprints`, `gate_decisions`, `costs` — derived tables that attach statistical outcomes, coverage measurements, behavioral fingerprints, and gate decisions back to the originating run.

Source: [src/agentassay/persistence/schema.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/persistence/schema.py)

### Storage API

[src/agentassay/persistence/storage.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/persistence/storage.py) wraps the schema with typed insertion helpers. For example, `save_gate_decision()` accepts `run_id`, `pipeline` (identifier such as `"main"` or `"staging"`), `decision` (`"DEPLOY"`, `"BLOCK"`, or `"WARN"`), a `reason` string, a JSON-encoded `rules_json`, and optional `commit_sha` / `pr_number` fields — exactly the metadata a CI/CD system needs to audit a gate event. `save_cost()` records token and dollar accounting per run, providing the cost signal that drives Layer 6 efficiency optimizations.

### Dashboard

[src/agentassay/dashboard/app.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/dashboard/app.py) is a Streamlit application that mounts a `ResultStore` against `AGENTASSAY_DB_PATH` (defaulting to `~/.agentassay/results.db`) and routes between four views: **Overview**, **Test Run**, **History**, and **Fingerprints**. Each view imports a dedicated renderer module on demand, keeping startup cost low. Source: [src/agentassay/dashboard/app.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/dashboard/app.py)

## Reporting

The CLI reporting path is implemented in [src/agentassay/cli/cmd_report.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/cli/cmd_report.py). The `agentassay report` command accepts `--results` (a JSON file of trial outcomes) and `--output` (default `agentassay-report.html`), loads the results through `load_json`, extracts the per-trial pass/fail vector via `extract_passed_list`, and computes a Wilson confidence interval using `wilson_interval` from [src/agentassay/statistics/confidence.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/statistics/confidence.py). The resulting HTML is self-contained — pass rate, CI bounds, and a verdict panel — suitable for archiving or sharing without external assets. The CLI also exposes `agentassay run`, `compare`, `mutate`, `coverage`, and `test-report` commands per [README.md](https://github.com/qualixar/agentassay/blob/main/README.md).

## Deployment Operations

Deployment operations integrate framework adapters, statistical verdicts, and the persistence layer into a single gateable workflow:

```mermaid
flowchart LR
    A[Framework Adapter] --> B[Trial Execution]
    B --> C[ExecutionTrace]
    C --> D[Persistence / ResultStore]
    C --> E[Analysis: Coverage + Mutation]
    E --> F[Stochastic Verdict]
    F --> G{Gate Decision}
    G -->|DEPLOY| H[CI/CD Pass]
    G -->|BLOCK/WARN| I[CI/CD Fail]
    D --> J[Dashboard / Report]
```

The adapter layer feeds deployment operations with structured traces. [src/agentassay/integrations/mcp_adapter.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/mutation/mcp_adapter.py) exposes direct MCP and Anthropic-MCP modes, returning an `ExecutionTrace` with per-tool-call `StepTrace` entries and timing captured via `time.perf_counter()`. [src/agentassay/integrations/bedrock_adapter.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/bedrock_adapter.py) wraps AWS Bedrock Agents and captures each rationale, observation, and knowledge-base lookup as a step. [src/agentassay/integrations/semantic_kernel_adapter.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/semantic_kernel_adapter.py) registers a temporary `FunctionInvocationFilter` on the kernel to capture function invocations non-intrusively. [src/agentassay/integrations/langgraph_adapter.py](https://github.com/qualixar/agentassay/blob/main/src/agentassay/integrations/langgraph_adapter.py) classifies LangGraph node outputs by inspecting `tool_calls`, message `content`, or node-name heuristics so each graph node becomes a typed step.

All adapter output funnels into the `ResultStore`, where verdicts, gate decisions, and cost records are persisted. The `gate_decisions` table ties every DEPLOY/BLOCK/WARN outcome back to a `commit_sha` and `pr_number`, giving auditors a complete chain of custody from code change to statistical evidence.

## See Also

- [Architecture Overview](https://github.com/qualixar/agentassay/blob/main/docs/architecture/overview.md)
- [CLI Reference](https://github.com/qualixar/agentassay/blob/main/docs/reference/cli.md)
- [Coverage Metrics](https://github.com/qualixar/agentassay/blob/main/docs/concepts/coverage.md)
- [Stochastic Testing](https://github.com/qualixar/agentassay/blob/main/docs/concepts/stochastic-testing.md)
- [Token-Efficient Testing](https://github.com/qualixar/agentassay/blob/main/docs/concepts/token-efficient-testing.md)

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: qualixar/agentassay

Summary: Found 6 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Capability evidence risk - Capability evidence risk requires verification.

## 1. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | https://github.com/qualixar/agentassay

## 2. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/qualixar/agentassay

## 3. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: downstream_validation.risk_items | https://github.com/qualixar/agentassay

## 4. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: risks.scoring_risks | https://github.com/qualixar/agentassay

## 5. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/qualixar/agentassay

## 6. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/qualixar/agentassay

<!-- canonical_name: qualixar/agentassay; human_manual_source: deepwiki_human_wiki -->