# https://github.com/tangle-network/browser-agent-driver Project Manual

Generated at: 2026-06-21 00:54:38 UTC

## Table of Contents

- [Overview & Core Agent Architecture](#page-1)
- [Stealth, Anti-Bot & CAPTCHA (v0.23.0 / Gen 27)](#page-2)
- [Design Audit & Auto-Fix](#page-3)
- [Benchmarking, Evaluation, Memory & Wallet](#page-4)

<a id='page-1'></a>

## Overview & Core Agent Architecture

### Related Pages

Related topics: [Stealth, Anti-Bot & CAPTCHA (v0.23.0 / Gen 27)](#page-2), [Design Audit & Auto-Fix](#page-3), [Benchmarking, Evaluation, Memory & Wallet](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/tangle-network/browser-agent-driver/blob/main/README.md)
- [package.json](https://github.com/tangle-network/browser-agent-driver/blob/main/package.json)
- [src/cli/commands/run.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/cli/commands/run.ts)
- [src/cli/commands/showcase.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/cli/commands/showcase.ts)
- [src/brain/tasks/goal-verification.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/goal-verification.ts)
- [src/brain/tasks/link-scout.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/link-scout.ts)
- [src/brain/tasks/knowledge.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/knowledge.ts)
- [src/brain/tasks/design-audit.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/design-audit.ts)
- [src/brain/tasks/evaluate.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/evaluate.ts)
- [src/providers/sandbox-backend.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/providers/sandbox-backend.ts)
- [bench/scenarios/README.md](https://github.com/tangle-network/browser-agent-driver/blob/main/bench/scenarios/README.md)
- [bench/competitive/README.md](https://github.com/tangle-network/browser-agent-driver/blob/main/bench/competitive/README.md)
</details>

# Overview & Core Agent Architecture

## Purpose and Scope

`@tangle-network/browser-agent-driver` (binary: `bad`) is a general-purpose agentic browser automation system that completes real user outcomes on arbitrary websites: search, extraction, form filling, price comparison, and complex UI navigation. The package is published as a dual CLI and library; the CLI binary `bad` is declared in [package.json:8-10](https://github.com/tangle-network/browser-agent-driver/blob/main/package.json), and the library entry point is exposed through the package `exports` field. According to the README headline metrics, the system reaches 91.3% on WebVoyager (590 tasks across 15 sites) at $0.09 per task, with the default model being `gpt-5.4`.

The scope spans three primary use modes surfaced in [README.md](https://github.com/tangle-network/browser-agent-driver/blob/main/README.md):

- A one-shot CLI (`bad run --goal "..."`) for ad-hoc automation.
- A programmatic SDK (`new BrowserAgent({ driver, config })`) for application integration.
- A benchmark harness under `bench/` for CI-grade regression and competitive evaluation.

## Architecture Overview

The agent loop is decoupled from the browser through a `Driver` interface, allowing the same decision engine to run against a local Playwright Chromium, a Steel cloud browser, or any other conforming implementation. The diagram below summarizes how the CLI, Brain, Driver, and reporting layer interact for a typical `bad run` invocation.

```mermaid
flowchart LR
  CLI[bad CLI<br/>run.ts] --> Brain[Brain<br/>decision engine]
  Brain -->|generate| ModelProvider[(Model Provider<br/>OpenAI / Anthropic /<br/>sandbox-backend)]
  Brain -->|decide / verify / scout| Driver[Driver<br/>Playwright or Steel]
  Driver --> Browser[(Chromium / Cloud)]
  CLI --> Reports[Reporters<br/>json / md / html / junit]
  CLI --> Renderer[Live Renderer<br/>stdout / TUI]
```

The CLI entry point in [src/cli/commands/run.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/cli/commands/run.ts) handles argument parsing, reporter fan-out, stream webhooks, and clean shutdown of browser, persistent context, and single-driver resources. A secondary command, `bad showcase`, delegates to a capture-and-evaluate pipeline via the thin wrapper in [src/cli/commands/showcase.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/cli/commands/showcase.ts), which simply forwards CLI flags to the internal `handleShowcase` implementation.

## Brain Decision Engine

The Brain is the LLM-driven core that observes page state and emits actions. To keep the file maintainable as decision subtasks grow, the Brain uses a **delegate + host-interface pattern**: each task lives in its own module under `src/brain/tasks/`, and the Brain class `implements` a small host interface that exposes only the slice of state the task needs. This makes missing or mistyped members a compile-time error.

The tasks observed in the source tree include:

| Task module | Responsibility |
|---|---|
| [goal-verification.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/goal-verification.ts) | Judge whether the agent's claimed result actually achieved the goal on the current page. Supports a dedicated verifier model and falls back to the navigation model under adaptive routing. |
| [link-scout.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/link-scout.ts) | Recommend the single best next visible link from a scored candidate list, using only the top 5 to save 2–8k tokens. |
| [knowledge.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/knowledge.ts) | Distill a completed trajectory into reusable timing/selector/pattern/quirk facts for future runs. |
| [design-audit.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/design-audit.ts) | Vision-based layout, typography, spacing, contrast, and UX analysis returning structured findings with severities. |
| [evaluate.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/evaluate.ts) | Rate the visual quality and professional polish of the current page via a dedicated `EVALUATE_PROMPT`. |

Each host interface declares only the slice of Brain state the task reads — typically `provider`, `navProvider`, `navModelName`, `buildUserContent`, and `generate` — keeping the dependency surface explicit and testable.

## Driver Layer and Model Routing

The Driver abstraction decouples the agent loop from the browser transport. The default is a local Playwright driver, while a Steel driver is used for anti-bot, residential proxies, and CAPTCHA solve-as-a-service. Because the Brain only knows the `Driver` interface, alternative implementations can be substituted without touching decision logic.

Models are configurable per role. The README documents a `models` map covering `planner`, `executor`, `verifier`, and `supervisor`, letting cheaper models handle navigation while a stronger model supervises or audits. The goal verifier in [src/brain/tasks/goal-verification.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/goal-verification.ts) demonstrates this: if `verifierProvider` is unset, it falls back to `navProvider` when `adaptiveModelRouting` is enabled, otherwise it reuses the main `provider`. Link scouting in [src/brain/tasks/link-scout.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/link-scout.ts) follows the same precedence: `scoutProvider` → `navProvider` → `provider`.

For sandboxed execution, [src/providers/sandbox-backend.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/providers/sandbox-backend.ts) builds a transcript from `ModelMessage[]`, serializes text and image attachments, and infers the backend type from the model name (`claude`/`sonnet`/`opus`/`haiku` → `claude-code`, `gpt`/`o1`/`o3`/`o4`/`codex` → `codex`), throwing if inference fails.

## Reporting, Scenarios, and Failure Modes

After every run, [src/cli/commands/run.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/cli/commands/run.ts) writes report files for each requested format (`json`, `markdown`, `html`, `junit`) under the configured report directory, renders them through the live view, and finally — in JSON mode — echoes the structured result to stdout. Cleanup runs in a `finally` block that detaches the interrupt controller, flushes the webhook streamer, and closes driver, persistent context, and browser resources regardless of success.

The benchmark tracks documented in [bench/scenarios/README.md](https://github.com/tangle-network/browser-agent-driver/blob/main/bench/scenarios/README.md) split tasks into `local-deterministic`, `staging-auth`, `public-web`, `webbench`, and `restricted-manual` to keep flaky internet and policy-sensitive flows (captchas, third-party account provisioning) out of CI reliability metrics. The competitive harness in [bench/competitive/README.md](https://github.com/tangle-network/browser-agent-driver/blob/main/bench/competitive/README.md) targets comparability against browser-use, Stagehand, Skyvern, and Computer Use — measuring cost-per-task alongside success rate.

Common failure modes worth understanding:

- **Provider inference failure**: the sandbox backend throws if model name matches neither Claude nor GPT/Codex patterns.
- **Reporter errors are swallowed**: report generation is best-effort, so missing templates will not abort the run.
- **Cleanup throws are caught**: `close()` calls use `.catch(() => {})`, so resource leaks may be silent.
- **Stealth regressions**: per the v0.23.0 release notes (Gen 27), previously-blocked sites now require System Chrome, Patchright, and Bezier mouse humanization to be active.

## See Also

- [Stealth & Anti-Bot Configuration](./Stealth-and-Anti-Bot.md)
- [Brain Decision Tasks](./Brain-Decision-Tasks.md)
- [Driver Implementations](./Drivers.md)
- [Benchmark Suites](./Benchmarks.md)
- [CLI Reference](./CLI-Reference.md)

---

<a id='page-2'></a>

## Stealth, Anti-Bot & CAPTCHA (v0.23.0 / Gen 27)

### Related Pages

Related topics: [Overview & Core Agent Architecture](#page-1), [Benchmarking, Evaluation, Memory & Wallet](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/tangle-network/browser-agent-driver/blob/main/README.md)
- [src/cli/commands/run.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/cli/commands/run.ts)
- [src/brain/tasks/goal-verification.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/goal-verification.ts)
- [src/brain/tasks/link-scout.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/link-scout.ts)
- [src/brain/tasks/design-audit.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/design-audit.ts)
- [src/cli/commands/showcase.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/cli/commands/showcase.ts)
- [bench/scenarios/README.md](https://github.com/tangle-network/browser-agent-driver/blob/main/bench/scenarios/README.md)
- [bench/competitive/README.md](https://github.com/tangle-network/browser-agent-driver/blob/main/bench/competitive/README.md)
- [package.json](https://github.com/tangle-network/browser-agent-driver/blob/main/package.json)
</details>

# Stealth, Anti-Bot & CAPTCHA (v0.23.0 / Gen 27)

## Overview and Scope

v0.23.0 (Gen 27) consolidates browser-agent-driver's evasion surface into a single release focused on real-world anti-bot blocking, CAPTCHA handling, and form intelligence. The release claims that 9 of 13 previously-blocked sites now pass on the WebbBench-50 evaluation, with system Chrome, Patchright, and CAPTCHA solvers shipped as defaults.

The stealth subsystem sits at the browser-launch layer and combines four independent evasion techniques: TLS/JA3 fingerprinting, CDP protocol patching, mouse kinematics, and proxy routing. CAPTCHA handling is treated as a recovery job the main agent loop can invoke when blocked. The integration point for all of this is `bad run`, which is auto-detected for both CLI and SDK consumers and where CLI flags override config values.

## Stealth and Anti-Bot Evasion

The browser launch code in `src/cli/commands/run.ts` is the primary integration point for stealth configuration. For stealth profiles, the launch plan upgrades the bundled Chromium channel to system Chrome: `...(isStealthProfile && browserName === 'chromium' ? { channel: 'chrome' } : {})`. Source: [src/cli/commands/run.ts](). System Chrome provides a real TLS/JA3/HTTP2 fingerprint that bundled Chromium cannot reproduce. The launch code itself documents why the upgrade is gated to stealth profiles only: system Chrome renders differently than bundled Chromium on some sites, producing Allrecipes click timeouts and Amazon layout shifts.

Proxy support is wired through the launch plan: `...(launchPlan.proxyServer ? { proxy: { server: launchPlan.proxyServer, ...(launchPlan.proxyBypass ? { bypass: launchPlan.proxyBypass } : {}) } } : {})`. Source: [src/cli/commands/run.ts](). Residential, SOCKS5, and HTTP proxies are accepted; the `--proxy` CLI flag and `BAD_PROXY_URL` environment variable both feed `launchPlan.proxyServer`.

Headless Chromium exposes itself with `HeadlessChrome/...` in the default User-Agent, which CDNs like Akamai reject with `ERR_HTTP2_PROTOCOL_ERROR` before any JS stealth patch can run. The launch code builds a clean UA from the live browser version and a platform-specific token:

```typescript
const ver = browser.version()
const platformToken = process.platform === 'win32'
  ? 'Windows NT 10.0; Win64; x64'
  : process.platform === 'linux'
    ? 'X11; Linux x86_64'
    : 'Macintosh; Intel Mac OS X 10_15_7'
return `Mozilla/5.0 (${platformToken}) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/${ver} Safari/537.36`
```

Source: [src/cli/commands/run.ts]().

Additional stealth layers documented in the README include Patchright (a Playwright fork that patches CDP protocol leaks), mouse humanization with Bezier curves (8–15 control points plus gaussian click offset), browser fingerprint patches for `navigator.webdriver`, plugins, languages, WebGL, and canvas noise, and a blocklist of 99+ analytics/tracking domains. The `--use-gl=desktop` flag enables real GPU WebGL rendering. Source: [README.md]().

## CAPTCHA Solving

CAPTCHA handling is enabled by default and configured via the `captcha` option: `{ captcha: { enabled: true, maxAttempts: 5 } }`. Source: [README.md](). Three CAPTCHA families are supported:

- **reCAPTCHA v2** — checkbox click followed by an LLM-vision image-grid solver.
- **Cloudflare Turnstile** — checkbox with behavioral click heuristics.
- **Google "unusual traffic"** — detected on the page and a solver attempted automatically.

Recovery is automatic: cookie consent, modal blockers, A-B-A-B oscillation loops, form-field resets, date-picker stalls, and CAPTCHA challenges are all handled in the agent loop before the run terminates. Source: [README.md](). The `goal-verification` task consults the current page state and a screenshot via `buildUserContent(textContent, state.screenshot, true)` before declaring success, so a CAPTCHA interstitial that survives the run will cause verification to fail rather than report a false positive. Source: [src/brain/tasks/goal-verification.ts]().

The `link-scout` task can use its own cheaper model (`scoutProvider`, `scoutModelName`) to pick the next visible link from a candidate list, reducing the cost of recovery loops on anti-bot pages. Source: [src/brain/tasks/link-scout.ts]().

## Configuration, CLI Flags, and Failure Modes

| Concern | Surface | Notes |
|---|---|---|
| Proxy routing | `--proxy` flag or `BAD_PROXY_URL` env | Reads `launchPlan.proxyServer` and optional `launchPlan.proxyBypass`. Source: [src/cli/commands/run.ts]() |
| Real WebGL | `--use-gl=desktop` | Avoids software-renderer fingerprint. Source: [README.md]() |
| CAPTCHA policy | `captcha: { enabled, maxAttempts }` | Default enabled, `maxAttempts: 5`. Source: [README.md]() |
| Profile gating | `isStealthProfile` | Channel upgrade only fires when stealth profile is active. Source: [src/cli/commands/run.ts]() |
| Benchmark suite | `bench:scoreboard`, `webbench:import` | Package scripts under `package.json`. Source: [package.json]() |

The scenario suite splits high-friction flows into a `restricted-manual` track that requires human-in-the-loop and is never run unattended in CI, while `webbench` and `public-web` tracks capture realistic anti-bot exposure under benchmark profiles (`default`, `webbench`, `webvoyager`). Source: [bench/scenarios/README.md](). Competitive benchmarking against browser-use, Stagehand, Skyvern, and the foundation-model Computer Use agents lives under `bench/competitive/` and is the empirical ground truth for whether stealth and CAPTCHA work buys net task-completion improvement. Source: [bench/competitive/README.md]().

Known failure modes from the source:

| Symptom | Likely cause |
|---|---|
| Site rejected with `ERR_HTTP2_PROTOCOL_ERROR` | Headless Chromium default UA still in flight; confirm `isStealthProfile` triggers system Chrome |
| Layout shifts or click timeouts on Allrecipes/Amazon | System Chrome renders differently than bundled Chromium — stealth upgrade is intentionally profile-scoped |
| CAPTCHA loops repeat past `maxAttempts` | LLM-vision solver exhausted; raise `captcha.maxAttempts` and inspect screenshots in the run directory |

Source: [src/cli/commands/run.ts](), [README.md]().

## See Also

- Configuration Reference — [README.md]()
- CLI Reference — `bad run`, `bad snapshot`, `bad design-audit`, `bad view`, `bad competitive`
- Benchmark Suite — [bench/scenarios/README.md](), [bench/competitive/README.md]()
- Brain tasks — [src/brain/tasks/goal-verification.ts](), [src/brain/tasks/link-scout.ts]()

---

<a id='page-3'></a>

## Design Audit & Auto-Fix

### Related Pages

Related topics: [Overview & Core Agent Architecture](#page-1), [Benchmarking, Evaluation, Memory & Wallet](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [src/brain/tasks/design-audit.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/design-audit.ts)
- [src/cli/commands/design-audit.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/cli/commands/design-audit.ts)
- [src/cli/args.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/cli/args.ts)
- [src/brain/tasks/evaluate.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/evaluate.ts)
- [src/brain/tasks/goal-verification.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/goal-verification.ts)
- [README.md](https://github.com/tangle-network/browser-agent-driver/blob/main/README.md)
</details>

# Design Audit & Auto-Fix

## Overview

`bad design-audit` is a dedicated subsystem inside Browser Agent Driver that grades the visual quality, layout, typography, contrast, and overall UX polish of a target URL and emits structured findings. Unlike the agentic loop used by `bad run`, design audit is a single-pass vision-driven evaluation that scores a page against an explicit checklist of checkpoints. It is extracted from `brain/index.ts` as a delegate-and-host module under `src/brain/tasks/design-audit.ts`, and the file's own header documents the split: `Brain.auditDesign` keeps a thin delegator while the body lives in `auditDesignImpl` and reads Brain state through the `BrainDesignAuditHost` interface ([src/brain/tasks/design-audit.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/design-audit.ts)).

The audit serves two related goals:

1. **Objective scoring** — produce a numeric score plus categorized findings (category, severity, ROI hints).
2. **Optional patch passthrough** — the same module carries `roi`, `reference`, and `judge` knobs that feed an auto-fix loop, so a high-severity finding can drive a remediation candidate rather than just a report line.

The CLI entry point is `runDesignAudit`, wired in `src/cli/commands/design-audit.ts`. It forwards every relevant flag (`--url`, `--pages`, `--profile`, `--model`, `--reference`, `--judge`, `--evolve`, etc.) into a single typed options bag ([src/cli/commands/design-audit.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/cli/commands/design-audit.ts)).

## Architecture

```mermaid
flowchart LR
    CLI["CLI: bad design-audit"] --> CMD["runDesignAudit()"]
    CMD --> Brain["Brain.auditDesign (delegator)"]
    Brain --> Impl["auditDesignImpl()"]
    Impl --> Host["BrainDesignAuditHost<br/>(generate, buildUserContent, debug)"]
    Host --> Model["LLM (vision-capable)"]
    Model --> Parse["JSON parse → DesignFinding[]"]
    Parse --> Score["score + designSystemScore"]
    Parse --> Optional["Patch / ROI passthrough"]
    Optional --> Report["json | html | junit sink"]
```

The host interface is intentionally narrow. It only exposes `debug`, `buildUserContent`, and `generate`, so the body cannot reach into unrelated Brain state. Because Brain `implements BrainDesignAuditHost`, a missing or mistyped member fails `tsc` at compile time ([src/brain/tasks/design-audit.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/design-audit.ts)).

## Pipeline and Prompt

`auditDesignImpl` builds a single user message that mixes the goal, the explicit checkpoint list, the current URL/title, and the page snapshot, then attaches the screenshot with `forceVision: true`. This guarantees a vision-capable model is consulted even if vision is disabled elsewhere in the session ([src/brain/tasks/design-audit.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/design-audit.ts)).

```text
GOAL: <goal>
CHECKPOINTS:
1. <c1>
2. <c2>
...

CURRENT PAGE:
URL: <state.url>
Title: <state.title>
ELEMENTS:
<state.snapshot>
```

The system prompt defaults to `DESIGN_AUDIT_PROMPT` from `brain/prompts.ts`, but a caller can override it via the `systemPrompt` argument. The response is sent through `generate(..., 8000)` to cap output, then parsed with two layers of fallback: trim surrounding fences first, then regex-extract a JSON object if the model returns prose or a truncated payload. Both branches assign into a `parseError` field rather than throwing, so a malformed response still produces a report line ([src/brain/tasks/design-audit.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/design-audit.ts)).

## Configuration Surface

`bad design-audit` exposes more than 20 flags. The most relevant ones for the audit + auto-fix loop are summarized below.

| Flag | Purpose |
|------|---------|
| `--url` / `--pages` | Target URL and number of pages to crawl |
| `--profile`, `--model`, `--provider`, `--api-key`, `--base-url` | Model selection (vision-capable) |
| `--sink` | Output format: `json`, `markdown`, `html`, or `junit` |
| `--json`, `--headless`, `--debug` | Output verbosity and runtime mode |
| `--storage-state` | Reuse cookies/storage for authenticated audits |
| `--extract-tokens` | Extract design tokens (colors, fonts) from the page |
| `--evolve`, `--evolve-rounds` | Iterative auto-fix loop: re-audit after each patch |
| `--project-dir` | Where to write patches |
| `--reproducibility` | Lock seeds/snapshots for a reproducible audit |
| `--rubrics-dir` | Override the builtin rubric set with a custom one |
| `--audit-passes` | Multi-pass auditing for richer findings |
| `--skip-ethics` | Bypass the Layer 7 ethics rollup floor (testing only) |
| `--ethics-rules-dir` | Override builtin ethics rules |
| `--audience`, `--regulatory-context`, `--audience-vulnerability`, `--modality` | Audience predicates that weight findings |
| `--reference`, `--reference-grounded` | Opt-in reference-grounded taste judge (v1) |
| `--judge`, `--judge-models` | Judge mode (`text` or `vision`) and ensemble list |

All flags are declared in `src/cli/args.ts` and forwarded unchanged through `runDesignAudit` ([src/cli/args.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/cli/args.ts), [src/cli/commands/design-audit.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/cli/commands/design-audit.ts)).

## Auto-Fix Loop

The auto-fix path is opt-in via `--evolve`. Each round:

1. Run `auditDesignImpl` on the current state of the page.
2. Emit `DesignFinding[]` (category, severity, ROI).
3. For findings above the ROI threshold, project them into a patch candidate and write it under `--project-dir`.
4. Re-render and re-audit until either `--evolve-rounds` is exhausted or the score crosses a stop band.

The same module ships the `designSystemScore` map and `tokensUsed` counter so a downstream agent can decide whether to spend another round. Two sibling tasks in `src/brain/tasks/` illustrate the related decision patterns the auto-fix loop composes with:

- `evaluateImpl` (`src/brain/tasks/evaluate.ts`) — produces a `QualityEvaluation` (subjective taste rating) using `EVALUATE_PROMPT`, the same transport funnel (`buildUserContent` + `generate`).
- `verifyGoalCompletionImpl` (`src/brain/tasks/goal-verification.ts`) — confirms a user-stated goal is satisfied before a run is marked complete, including a `buildFirstPartyBoundaryNote` site-boundary check that prevents the verifier from claiming success on a first-party page that hasn't actually moved.

Both follow the same delegate-and-host pattern, which means an auto-fix pass can mix "did the design improve?" (audit) with "did the user's stated outcome improve?" (goal verification) and "does the page look professional now?" (evaluate) without duplicating transport plumbing.

## Usage

Minimal:

```bash
bad design-audit --url https://example.com
```

With auto-fix, custom rubric, and reference-grounded judging:

```bash
bad design-audit \
  --url https://example.com \
  --pages 3 \
  --evolve --evolve-rounds 5 \
  --project-dir ./audit-out \
  --rubrics-dir ./my-rubrics \
  --reference ./ref.png --reference-grounded \
  --judge vision --judge-models gpt-5.4,claude-opus-4.6
```

Reporters follow the same multi-format pattern used by `bad run`: `json`, `markdown` (with turn detail), `html`, and `junit` are emitted in parallel when requested, with best-effort error handling so a broken reporter never aborts the audit ([src/cli/commands/run.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/cli/commands/run.ts)).

## Common Failure Modes

- **Malformed model output.** Handled by the trim-fence + regex-extract fallback in `auditDesignImpl`; the raw text and `parseError` are still surfaced in the report so a downstream tool can re-prompt.
- **Non-vision model selected.** Mitigated by `forceVision: true` in the audit's `buildUserContent` call.
- **Ethics floor blocks the run.** Use `--skip-ethics` only in test scenarios; production should leave the Layer 7 gate on ([src/cli/args.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/cli/args.ts)).
- **Auto-fix stalls.** Increase `--evolve-rounds` or widen the ROI threshold; the score and `tokensUsed` per round are emitted so progress can be measured externally.

## See Also

- [Brain Decision Engine](./brain-decision-engine.md)
- [CLI Reference](./cli-reference.md)
- [Configuration Guide](./configuration.md)
- [Goal Verification](./goal-verification.md)

---

<a id='page-4'></a>

## Benchmarking, Evaluation, Memory & Wallet

### Related Pages

Related topics: [Overview & Core Agent Architecture](#page-1), [Stealth, Anti-Bot & CAPTCHA (v0.23.0 / Gen 27)](#page-2)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/tangle-network/browser-agent-driver/blob/main/README.md)
- [bench/scenarios/README.md](https://github.com/tangle-network/browser-agent-driver/blob/main/bench/scenarios/README.md)
- [bench/competitive/README.md](https://github.com/tangle-network/browser-agent-driver/blob/main/bench/competitive/README.md)
- [package.json](https://github.com/tangle-network/browser-agent-driver/blob/main/package.json)
- [src/brain/tasks/evaluate.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/evaluate.ts)
- [src/brain/tasks/design-audit.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/design-audit.ts)
- [src/brain/tasks/goal-verification.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/goal-verification.ts)
- [src/brain/tasks/knowledge.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/knowledge.ts)
- [src/brain/tasks/link-scout.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/brain/tasks/link-scout.ts)
- [src/cli/commands/run.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/cli/commands/run.ts)
- [src/providers/sandbox-backend.ts](https://github.com/tangle-network/browser-agent-driver/blob/main/src/providers/sandbox-backend.ts)
</details>

# Benchmarking, Evaluation, Memory & Wallet

The `browser-agent-driver` ("bad") project ships a tightly integrated suite for measuring agent quality, scoring design and goal outcomes, distilling reusable knowledge from completed runs, and exercising browser-extension wallets during DeFi-style tasks. This page documents each pillar, the scripts that drive them, and how they compose.

## 1. Benchmarking Infrastructure

The benchmark layer is split into two complementary harnesses.

### Scenario Suite

`bench/scenarios/README.md` defines five tracks. `local-deterministic` runs on controlled fixtures and is required in CI; `staging-auth` exercises real product flows with seeded storage state; `public-web` validates against stable public pages (non-critical in CI); `webbench` derives cases from the Halluminate WebBench corpus for cross-agent comparability; and `restricted-manual` covers captcha/phone-verified flows that must never run unattended. Each task is tagged with categories such as `navigation`, `form-completion`, `product-usage`, `research`, `scraping`, `auth`, and `blocker-recovery`. Source: [bench/scenarios/README.md:1-50]().

The canonical runner is `scripts/run-scenario-track.mjs`, invoked as `node scripts/run-scenario-track.mjs --cases <file> --config <file> --model <id> --benchmark-profile <profile> --modes <list>`. Benchmark profiles tune the noise/cost tradeoff: `default` is balanced, `webbench` is fast and low-noise, and `webvoyager` is evidence-rich. Source: [bench/scenarios/README.md:50-80]().

A/B experiments use `npm run ab:experiment`, with outputs `summary.json` (Wilson CIs, bootstrap delta CI), `runs.csv`, `passrate-series.csv`, `summary.md`, and blocker-adjusted `cleanPassRate` metrics. The `Tier1 Reliability Gate` runs deterministic fixtures with `--min-full-pass-rate 1 --min-fast-pass-rate 1` and emits `tier1-gate-summary.{json,md}`. Source: [bench/scenarios/README.md:80-130]().

### Competitive Harness

`bench/competitive/README.md` frames a head-to-head comparison of `bad` against `browser-use`, `Stagehand`, `Skyvern`, OpenAI Computer Use, and Claude Computer Use. Each `(framework, task)` cell captures `success`, `wallTimeSeconds`, `turnCount`, `llmCallCount`, token buckets, and `costUsd` computed from a shared pricing table. Source: [bench/competitive/README.md:1-50]().

The driver scripts `pnpm bench:competitive:setup`, `:run`, and `:dashboard` install runners, execute cells, and render `results/_dashboard.md`. Reported headline numbers: 91.3% on WebVoyager (590 tasks, 15 sites) at \$0.09/task, 100% on a held-out competitive bench, and 95.7% on WebbBench-50 excluding DataDome sites. Source: [README.md:1-30]().

| Track | CI Required | Drift Risk | Example Categories |
|-------|-------------|------------|--------------------|
| `local-deterministic` | Yes | None | navigation, form-completion |
| `staging-auth` | Yes | Low | auth, product-usage |
| `public-web` | No | High | research, scraping |
| `webbench` | No | High | navigation, scraping |
| `restricted-manual` | Never (human-in-loop) | High | blocker-recovery, auth |

## 2. Brain Evaluation Tasks

The Brain decision engine exposes three structured evaluation tasks, each implemented as a thin delegator on `Brain` plus a host-interface slice so the compiler proves completeness.

- **`evaluateImpl`** rates a page on a 1–10 scale and returns `{ score, assessment, strengths, issues, suggestions, raw, tokensUsed }`. It always forces vision (`forceVision: true`) so the model sees the actual screenshot. Source: [src/brain/tasks/evaluate.ts:30-80]().
- **`auditDesignImpl`** returns `{ score, findings, raw }` where `findings` carry categories, severities, and optional ROI/patch passthrough for design regressions. Source: [src/brain/tasks/design-audit.ts:20-60]().
- **`verifyGoalCompletionImpl`** asks the verifier model (which may differ from the main model via `verifierProvider` / `verifierModel` or adaptive routing on `navModelName`) whether the claimed result actually matches the live `PageState`. Source: [src/brain/tasks/goal-verification.ts:20-70]().
- **`recommendLinkCandidateImpl`** (link scout) picks the single best next visible link from a deterministic top-5 ranking, optionally using vision when `scoutUseVision` is set. Source: [src/brain/tasks/link-scout.ts:20-60]().

```mermaid
flowchart LR
  A[Page State] --> B[evaluate]
  A --> C[auditDesign]
  A --> D[verifyGoalCompletion]
  E[Link Candidates] --> F[linkScout]
  B --> G[score 1-10]
  C --> H[findings]
  D --> I{achieved?}
  F --> J[next ref]
```

## 3. Knowledge Extraction (Memory)

`extractKnowledgeImpl` distills a completed trajectory into a bounded list of reusable facts with explicit `type` values: `timing` (wait durations), `selector` (reliable element handles), `pattern` (multi-step interaction sequences), and `quirk` (app-specific gotchas). The model is capped at 10 facts and must respond with raw JSON (the parser strips ```json fences before validation). Source: [src/brain/tasks/knowledge.ts:20-80]().

These facts feed downstream caches that speed up repeat visits to the same domain — a behaviour implied by the prompt design ("help an agent complete similar tasks faster next time"). Quality-over-quantity is enforced both in the system prompt and by post-validating each entry against the `VALID_TYPES` allow-list. Source: [src/brain/tasks/knowledge.ts:60-90]().

## 4. Wallet & DeFi Testing

Wallet flows are first-class in the run pipeline. The CLI shutdown sequence guarantees the auto-approver is stopped and the persistent context is closed before the process exits, even on error. Source: [src/cli/commands/run.ts:200-240]().

`package.json` exposes setup-time helpers:

- `wallet:setup` — installs the wallet extension via `bench/wallet/setup-extension.mjs`.
- `wallet:onboard` — drives onboarding through `bench/wallet/setup-onboarding.mjs`.
- `wallet:configure` — configures the extension via `bench/wallet/...` (truncated in context).

Source: [package.json:1-40](). These scripts pair with the run-time `stopWalletAutoApprover` hook so wallet UX flows can be exercised end-to-end without manual seeding.

## See Also

- `README.md` — headline benchmarks, install, CLI quick-start.
- `bench/scenarios/README.md` — full scenario track taxonomy and CLI flags.
- `bench/competitive/README.md` — competitor runner design and metrics.
- `src/brain/tasks/knowledge.ts` — memory fact schema and extraction prompt.

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: tangle-network/browser-agent-driver

Summary: Found 7 structured pitfall item(s), including 0 high/blocking item(s). Top priority: Configuration risk - Configuration risk requires verification.

## 1. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.host_targets | https://github.com/tangle-network/browser-agent-driver

## 2. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | https://github.com/tangle-network/browser-agent-driver

## 3. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/tangle-network/browser-agent-driver

## 4. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: downstream_validation.risk_items | https://github.com/tangle-network/browser-agent-driver

## 5. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: risks.scoring_risks | https://github.com/tangle-network/browser-agent-driver

## 6. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/tangle-network/browser-agent-driver

## 7. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/tangle-network/browser-agent-driver

<!-- canonical_name: tangle-network/browser-agent-driver; human_manual_source: deepwiki_human_wiki -->