Doramagic Project Pack · Human Manual

agent-browser

Agent Browser serves as a bridge between AI agents and web browsers, enabling autonomous web navigation, interaction, and data extraction. It is compatible with a wide range of AI agent pl...

Introduction to Agent Browser

Related topics: Installation Guide, Architecture Overview

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Accessibility-Tree Snapshots

Continue reading this section for the full explanation and source context.

Section Element Reference Notation

Continue reading this section for the full explanation and source context.

Section Navigation Commands

Continue reading this section for the full explanation and source context.

Related topics: Installation Guide, Architecture Overview

Introduction to Agent Browser

Agent Browser is a high-performance, native Rust CLI tool designed for browser automation and AI agent integration. Unlike traditional browser automation frameworks that rely on Node.js wrappers or third-party libraries, Agent Browser communicates directly with Chrome/Chromium via the Chrome DevTools Protocol (CDP), providing a lightweight and reliable solution for web interaction tasks.

Overview

Agent Browser serves as a bridge between AI agents and web browsers, enabling autonomous web navigation, interaction, and data extraction. It is compatible with a wide range of AI agent platforms including Cursor, Claude Code, Codex, Continue, and Windsurf.

AspectDescription
LanguageRust (native CLI)
ProtocolChrome DevTools Protocol (CDP)
DependenciesNo Playwright or Puppeteer dependency
PlatformChrome/Chromium
LicenseSee repository LICENSE

Sources: skills/agent-browser/SKILL.md

Architecture

Agent Browser follows a modular architecture with distinct layers for CLI handling, native browser control, and extensible skills.

graph TD
    A[User / AI Agent] --> B[CLI Layer<br/>Rust Commands]
    B --> C[Native Actions Layer<br/>CDP Dispatcher]
    C --> D[Chrome/Chromium<br/>via CDP]
    
    E[Skills System] --> B
    E --> F[Core Skills]
    E --> G[Specialized Skills]
    
    G --> G1[Electron Apps]
    G --> G2[Slack Workspace]
    G --> G3[Exploratory Testing]
    G --> G4[Cloud Providers]
    
    H[Session Management] --> C
    H --> H1[Auth Vault]
    H --> H2[State Persistence]
    H --> H3[Video Recording]

Sources: skill-data/core/SKILL.md, skills/agent-browser/SKILL.md

Core Concepts

Accessibility-Tree Snapshots

Agent Browser generates accessibility-tree snapshots that provide structured, human-readable representations of web pages. Each interactive element receives a unique reference ID (e.g., @e1, @e2) that can be used for subsequent interactions.

Example snapshot output:

Page: Example - Log in
URL: https://example.com/login

@e1 [heading] "Log in"
@e2 [form]
  @e3 [input type="email"] placeholder="Email"
  @e4 [input type="password"] placeholder="Password"
  @e5 [button type="submit"] "Continue"
  @e6 [link] "Forgot password?"

Sources: skill-data/core/references/snapshot-refs.md, skill-data/core/SKILL.md

Element Reference Notation

Element references follow a consistent notation pattern:

@e1 [tag attribute="value"] "text content" placeholder="hint"
ComponentDescription
@e1Unique reference ID
tagHTML tag name
attribute="value"Key attributes
"text content"Visible text
placeholder="hint"Additional attributes

Sources: skill-data/core/references/snapshot-refs.md

Command Reference

Navigation Commands

CommandDescription
agent-browser open [url]Launch browser with optional navigation
agent-browser backNavigate backward
agent-browser forwardNavigate forward
agent-browser reloadReload current page
agent-browser closeClose browser
agent-browser connect <port>Connect to existing browser via CDP

Sources: skill-data/core/references/commands.md

Interaction Commands

CommandDescription
agent-browser click <ref>Click an element
agent-browser fill <ref> <text>Type text into input
agent-browser select <ref> <value>Select dropdown option
agent-browser check <ref>Check a checkbox
agent-browser scroll <direction> <pixels>Scroll page

Sources: cli/src/native/actions.rs

Data Retrieval Commands

CommandDescription
agent-browser snapshot [-i]Get page snapshot (interactive only with -i)
agent-browser screenshot [path]Capture screenshot
agent-browser get text <ref>Get visible text
agent-browser get attr <ref> <name>Get attribute value
agent-browser get urlGet current URL
agent-browser get titleGet page title

Sources: cli/src/output.rs, cli/src/native/actions.rs

Network Control Commands

CommandDescription
agent-browser network route <url>Intercept network request
agent-browser network unroute <url>Remove interception
agent-browser network requests [--clear]View/clear network requests
`agent-browser network har <start\stop> [path]`Capture HAR file

Sources: skill-data/core/references/commands.md, cli/src/output.rs

agent-browser cookies get           # View all cookies
agent-browser cookies set --url <url> --name <name> --value <val>
agent-browser cookies clear         # Clear all cookies
agent-browser storage local         # Manage localStorage
agent-browser storage session       # Manage sessionStorage

Sources: cli/src/output.rs

Browser Settings Commands

CommandDescription
agent-browser set viewport <w> <h>Set viewport size
agent-browser set device <name>Emulate device
agent-browser set geo <lat> <lng>Set geolocation
`agent-browser set offline on\off`Toggle offline mode
agent-browser set headers <json>Set custom headers
`agent-browser set media dark\light`Set color scheme

Sources: cli/src/output.rs

Sessions and State Management

Agent Browser supports multiple concurrent browser sessions with state persistence.

graph LR
    A[Session A] --> B[State File A]
    C[Session B] --> D[State File B]
    E[Auth Vault] --> A
    E[Auth Vault] --> C

Key Features:

  • Named Sessions: --session <name> flag for multiple sessions
  • State Persistence: Save and restore browser state
  • Auth Vault: Secure credential storage
  • Video Recording: Capture browser activity

Sources: skill-data/core/SKILL.md, skills/agent-browser/SKILL.md

Skills System

Agent Browser uses an extensible skills system that provides specialized workflows for different environments.

Core Skills

agent-browser skills get core             # Core workflows and common patterns
agent-browser skills get core --full      # Include full command reference

Specialized Skills

SkillDescriptionCommand
ElectronDesktop app automationagent-browser skills get electron
SlackWorkspace automationagent-browser skills get slack
DogfoodExploratory testing/QAagent-browser skills get dogfood
Vercel SandboxCloud browser in microVMsagent-browser skills get vercel-sandbox
AgentCoreAWS Bedrock cloud browsersagent-browser skills get agentcore

Sources: skills/agent-browser/SKILL.md

React Developer Tools Integration

Agent Browser includes built-in React DevTools support for debugging React applications:

CommandDescription
agent-browser react_treeView React component tree
agent-browser react_inspectInspect component props/state
agent-browser react_renders_startTrack render counts
agent-browser react_renders_stopStop render tracking

Sources: cli/src/native/actions.rs, cli/src/react/suspense.rs

Suspense Boundary Analysis

Agent Browser can analyze React Suspense boundaries with actionability scoring:

Blocker KindWeightActionability
ClientHook790%
RequestApi688%
ServerFetch582%
Cache474%
Stream360%
Unknown235%
Framework118%

Sources: cli/src/react/suspense.rs

Dashboard Interface

Agent Browser includes a web-based dashboard for visual browser management:

graph TD
    A[Dashboard] --> B[Controls Panel]
    A --> C[Result Panel]
    A --> D[Network Panel]
    A --> E[Extensions Panel]
    
    B --> B1[URL Input]
    B --> B2[Mode Selector]
    B --> B3[Action Controls]
    
    C --> C1[Screenshot View]
    C --> C2[Snapshot View]
    C --> C3[Step History]
    
    D --> D1[Request List]
    D --> D2[HAR Export]
    
    E --> E1[Extension List]
    E --> E2[Extension Details]

The dashboard is built with React and supports:

  • Resizable panels for flexible layouts
  • Theme switching (light/dark)
  • Mobile-responsive design
  • Real-time step history

Sources: examples/environments/app/page.tsx, packages/dashboard/src/components/network-panel.tsx, packages/dashboard/src/components/extensions-panel.tsx

Best Practices

1. Always Snapshot Before Interacting

# CORRECT - Snapshot first to get refs
agent-browser open https://example.com
agent-browser snapshot -i          # Get refs first
agent-browser click @e1            # Use ref

# WRONG - Ref doesn't exist yet
agent-browser open https://example.com
agent-browser click @e1            # Will fail!

2. Re-snapshot After Navigation

Element references change when the page navigates. Always take a new snapshot after clicking links or navigating to new pages.

3. Use Sessions for Complex Workflows

agent-browser --session my-session open https://example.com
agent-browser --session my-session snapshot -i
# ... perform actions ...
agent-browser --session my-session close

Sources: skill-data/core/references/snapshot-refs.md

Installation and Setup

Prerequisites

  • Chrome or Chromium browser installed
  • Operating system: macOS, Linux, or Windows

Installation

Refer to the repository's installation instructions for your platform. Agent Browser is distributed as a native binary with no runtime dependencies.

Configuration Files

FilePurpose
~/.agent-browser/Default config directory
SessionsStored in config directory
Auth VaultEncrypted credential storage

Sources: AGENTS.md

Summary

Agent Browser provides a powerful, efficient, and AI-agent-friendly approach to browser automation. Its key differentiators include:

  • Native Rust implementation for high performance
  • Direct CDP communication without third-party dependencies
  • Accessibility-tree snapshots for reliable element targeting
  • Session management for complex multi-step workflows
  • Extensible skills system for specialized environments
  • Built-in React DevTools integration for debugging

These features make Agent Browser an ideal choice for AI agents, automated testing pipelines, and developer workflows requiring precise browser control.

Source: https://github.com/vercel-labs/agent-browser / Human Manual

Installation Guide

Related topics: Introduction to Agent Browser

Section Related Pages

Continue reading this section for the full explanation and source context.

Section System Requirements

Continue reading this section for the full explanation and source context.

Section Required Dependencies

Continue reading this section for the full explanation and source context.

Section Method 1: npm Package Installation (Recommended)

Continue reading this section for the full explanation and source context.

Related topics: Introduction to Agent Browser

Installation Guide

Overview

The agent-browser project is a native Rust CLI tool designed for browser automation, providing AI agents with reliable web interaction capabilities. Unlike traditional browser automation tools that rely on Node.js wrappers, agent-browser delivers a fast, lightweight solution built directly in Rust with Chrome/Chromium support via Chrome DevTools Protocol (CDP). The installation process handles downloading the necessary Chrome browser binaries, setting up platform-specific binaries, and configuring dependencies for the dashboard UI.

Sources: AGENTS.md

Prerequisites

System Requirements

Before installing agent-browser, ensure your system meets the following requirements:

RequirementDetails
Operating SystemmacOS, Linux, or Windows (7 platform binaries built)
Chrome/ChromiumRequired for browser automation functionality
Rust ToolchainRequired for building from source
Node.js/pnpmRequired for dashboard development

The project builds all 7 platform binaries during CI/CD, including native binaries for different architectures. Chrome is downloaded directly from Chrome for Testing during the installation process, eliminating the need for system-installed Chrome browsers.

Sources: AGENTS.md

Required Dependencies

DependencyPurposeInstallation Method
Chrome/ChromiumBrowser automation targetAuto-downloaded via install command
Cargo/RustBuilding CLI from sourcerustup.rs
pnpmDashboard package managementnpm install -g pnpm

Installation Methods

The recommended installation method uses the npm registry for cross-platform compatibility:

npm install -g @agent-browser/cli

After installation, you must run the setup command to download Chrome binaries:

agent-browser install

Sources: skills/agent-browser/SKILL.md

Method 2: Building from Source

For development or customization, build the CLI from source:

# Clone the repository
git clone https://github.com/vercel-labs/agent-browser.git
cd agent-browser

# Install dependencies and build
cd cli && cargo build --release

The Rust codebase architecture follows a modular structure:

    A[cli/src/native/] --> B[daemon/]
    A --> C[actions/]
    A --> D[browser/]
    A --> E[CDP client/]
    A --> F[snapshot/]
    A --> G[state/]

The --engine flag allows selecting between Chrome and Lightpanda browser engines, providing flexibility in automation scenarios.

Sources: AGENTS.md

Method 3: Docker Installation

For containerized environments, Docker builds are supported:

# Build from the project's Dockerfile
docker build -t agent-browser -f docker/Dockerfile.build .

Docker installation is particularly useful for CI/CD pipelines and reproducible automation environments where system dependencies need to be isolated.

Post-Installation Setup

Chrome Binary Download

After installing the CLI package, you must download the Chrome binary:

agent-browser install

This command retrieves Chrome directly from Chrome for Testing, ensuring a compatible and up-to-date browser binary is available for all automation tasks. The --download-path flag can specify a custom location:

agent-browser --download-path /custom/path install

Sources: cli/src/flags.rs:45-49

Verifying Installation

Verify the installation by checking the version and available commands:

agent-browser --version
agent-browser --help

The CLI provides comprehensive command documentation through the help system:

CommandDescription
agent-browser open <url>Open a URL in the browser
agent-browser snapshotCapture accessibility tree with element refs
agent-browser click @<ref>Click element by reference
agent-browser skills get <name>Retrieve skill documentation
agent-browser installDownload Chrome binaries

Sources: cli/src/output.rs

Skill Documentation Loading

Agent-browser uses a skill-based documentation system that loads content dynamically based on the installed version:

# Load core workflows and common patterns
agent-browser skills get core

# Include full command reference and templates
agent-browser skills get core --full

# List all available skills
agent-browser skills list

Available specialized skills:

SkillPurpose
electronElectron desktop apps (VS Code, Slack, Discord, Figma)
slackSlack workspace automation
dogfoodExploratory testing and QA
vercel-sandboxAgent-browser inside Vercel Sandbox microVMs
agentcoreAWS Bedrock AgentCore cloud browsers

Sources: skills/agent-browser/SKILL.md

Platform-Specific Considerations

macOS

On macOS, if you encounter security prompts about unsigned applications, you may need to allow the application in System Preferences > Security & Privacy, or run:

xattr -d com.apple.quarantine /path/to/agent-browser

Linux

Linux distributions require WebKit/GTK dependencies for Chrome. Install via your package manager:

# Debian/Ubuntu
sudo apt-get install libgtk-3-0 libnss3

# Fedora
sudo dnf install gtk3 nss

Windows

Windows installations automatically configure the required runtime dependencies. Ensure Windows Subsystem for Linux (WSL) compatibility if running in hybrid environments.

Running Tests

After installation, verify the setup by running the test suite:

# Unit tests (fast, no Chrome required)
cd cli && cargo test

# End-to-end tests (requires Chrome installed)
cd cli && cargo test e2e -- --ignored --test-threads=1

The project contains approximately 320 unit tests and 18 e2e tests. E2E tests launch real headless Chrome instances and must run serially to avoid instance contention.

Sources: AGENTS.md

Troubleshooting

Chrome Download Failures

If the install command fails to download Chrome:

  1. Check network connectivity to Chrome for Testing
  2. Verify write permissions to the download directory
  3. Use --download-path to specify an alternative location with proper permissions

Permission Denied Errors

Ensure the agent-browser binary has execute permissions:

chmod +x /path/to/agent-browser

Engine Selection

If Chrome automation fails, try specifying the engine explicitly:

agent-browser --engine chrome open https://example.com

The --engine flag supports Chrome (default) and Lightpanda engines for different automation scenarios.

Next Steps

After successful installation:

  1. Load core skill documentation: agent-browser skills get core --full
  2. Open a test URL: agent-browser open https://example.com
  3. Capture a snapshot: agent-browser snapshot -i
  4. Explore specialized skills for your use case

Sources: skills/agent-browser/SKILL.md

Sources: AGENTS.md

Element References System

Related topics: State Inspection Commands, Interaction Commands

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Reference Components

Continue reading this section for the full explanation and source context.

Section Option Reference

Continue reading this section for the full explanation and source context.

Section Direct Element Commands

Continue reading this section for the full explanation and source context.

Related topics: State Inspection Commands, Interaction Commands

Element References System

The Element References System is a core mechanism in agent-browser that provides stable, human-readable identifiers for DOM elements during browser automation tasks. Instead of relying on fragile CSS selectors or XPath expressions, the system assigns unique reference IDs (such as @e1, @e2) that persist across page states and can be used reliably in subsequent automation commands.

Overview

Element references serve as the primary interface between automation scripts and the browser's accessibility tree. When a snapshot is taken, each interactive element receives a reference ID that can be used in commands like click, fill, type, and get without requiring re-selection.

graph TD
    A[Browser Page] --> B[snapshot Command]
    B --> C[Accessibility Tree Traversal]
    C --> D[Element Identification]
    D --> E[Reference Assignment]
    E --> F[@e1 @e2 @e3 ...]
    F --> G[Automation Commands]
    G --> H[click @e1]
    G --> I[fill @e2]
    G --> J[get text @e3]

Reference Notation Format

Element references follow a standardized notation format that encodes element metadata:

@e1 [tag type="value"] "text content" placeholder="hint"
│    │   │             │               │
│    │   │             │               └─ Additional attributes
│    │   │             └─ Visible text
│    │   └─ Key attributes shown
│    └─ HTML tag name
└─ Unique ref ID

Sources: skill-data/core/references/snapshot-refs.md

Reference Components

ComponentDescriptionExample
@eNUnique reference identifier@e1, @e42
TagHTML element typebutton, input, link
Type attributeElement type classificationtype="email", type="password"
Text contentVisible text on element"Submit", "Log in"
PlaceholderInput placeholder textplaceholder="Email"

Common Reference Patterns

The snapshot system recognizes common element patterns and standardizes their reference notation:

@e1 [button] "Submit"                    # Button with text
@e2 [input type="email"]                 # Email input
@e3 [input type="password"]              # Password input
@e4 [a href="/page"] "Link Text"         # Anchor link
@e5 [select]                             # Dropdown
@e6 [textarea] placeholder="Message"     # Text area
@e7 [div class="modal"]                  # Container element
@e8 [img alt="Logo"]                     # Image with alt text
@e9 [checkbox] checked                   # Checked checkbox
@e10 [radio] selected                    # Selected radio button

Sources: skill-data/core/references/snapshot-refs.md

Snapshot Command Options

The snapshot command generates element references with various filtering and formatting options:

agent-browser snapshot                    # Full tree (verbose)
agent-browser snapshot -i                 # Interactive elements only (preferred)
agent-browser snapshot -i -u              # Include href URLs on links
agent-browser snapshot -i -c              # Compact mode (no empty structural nodes)
agent-browser snapshot -i -d 3            # Cap depth at 3 levels
agent-browser snapshot -s "#main"         # Scope to a CSS selector
agent-browser snapshot -i --json          # Machine-readable output

Sources: skill-data/core/SKILL.md

Option Reference

OptionPurposeUse Case
-iInteractive elements onlyPreferred for automation
-uInclude href URLsWhen link destinations matter
-cCompact outputComplex pages with many empty nodes
-d NDepth limitFocus on specific page sections
-s SELECTORCSS scopeTarget specific page regions
--jsonJSON formatProgrammatic processing

Element Reference Commands

Element references are used with various commands to interact with page elements:

Direct Element Commands

agent-browser click @e1                   # Click element
agent-browser click @e1 --new-tab          # Click and open in new tab
agent-browser fill @e2 "text"             # Fill input field
agent-browser type @e2 "text"             # Type character by character
agent-browser press Enter                 # Press key on focused element

State Inspection Commands

agent-browser get text @e1                # Get visible text
agent-browser get html @e1                # Get innerHTML
agent-browser get attr @e1 href           # Get specific attribute
agent-browser get value @e1               # Get input value
agent-browser get title                   # Get page title
agent-browser get url                     # Get current URL
agent-browser get count ".item"           # Count matching elements

State Checking Commands

The is command verifies element states:

agent-browser is visible @e1
agent-browser is enabled @e1
agent-browser is checked @e1

Sources: cli/src/output.rs

Find Command and Locators

The find command provides an alternative to snapshot-based reference acquisition by locating elements using various criteria:

agent-browser find <locator> <value> <action> [text]

Supported Locators

LocatorDescriptionExample
roleARIA role selectorfind role button click
textText content matchfind text "Submit" click
labelLabel text associationfind label "Email" fill
placeholderPlaceholder attributefind placeholder "Search"
altAlt text (images)find alt "Logo" click
titleTitle attributefind title "Help" click
testidTest identifierfind testid "submit-btn" click
firstFirst matching selectorfind first button click
lastLast matching selectorfind last link click
nthNth matching elementfind nth 5 button click

Sources: cli/src/commands.rs

Find Command Options

OptionPurpose
--exactPerform exact string matching
--name <name>Filter by accessible name (role locator)

Action Dispatch System

Element reference commands are dispatched to handlers through the action routing system:

graph LR
    A[Command Input] --> B["dispatch(\"click\", state)"]
    B --> C{Match Action}
    C -->|click| D[handle_click]
    C -->|fill| E[handle_fill]
    C -->|get| F[handle_get]
    C -->|is| G[handle_is]
    C -->|find| H[handle_find]

The action router maps command strings to their respective handlers in the native daemon:

"click" => handle_dispatch(cmd, state).await,
"fill" => handle_dispatch(cmd, state).await,
"get" => handle_dispatch(cmd, state).await,
"is" => handle_dispatch(cmd, state).await,
"find" => handle_dispatch(cmd, state).await,

Sources: cli/src/native/actions.rs

Available Element Actions

ActionHandlerPurpose
clickhandle_dispatchMouse click
fillhandle_dispatchFill input with text
typehandle_dispatchCharacter-by-character typing
presshandle_dispatchKeyboard press
hoverhandle_dispatchMouse hover
selecthandle_dispatchSelect dropdown option
checkhandle_dispatchCheck checkbox/radio
uncheckhandle_dispatchUncheck checkbox
focushandle_dispatchFocus element
blurhandle_dispatchBlur element

Iframe Support

Element references automatically handle iframe content. When a snapshot is taken, iframe elements are resolved and their child accessibility trees are included inline:

agent-browser snapshot -i
# Output:
# @e1 [heading] "Checkout"
# @e2 [Iframe] "payment-frame"
#   @e3 [input] "Card number"
#   @e4 [input] "Expiry"
#   @e5 [button] "Pay"
# @e6 [button] "Cancel"

References to elements inside iframes carry frame context, allowing direct interactions without manual frame switching:

agent-browser click @e3                    # Works inside iframe
agent-browser fill @e4 "12/25"

Sources: skill-data/core/references/snapshot-refs.md

Best Practices

Always Snapshot Before Interacting

# CORRECT
agent-browser open https://example.com
agent-browser snapshot -i          # Get refs first
agent-browser click @e1            # Use ref

# WRONG
agent-browser open https://example.com
agent-browser click @e1            # Ref doesn't exist yet!

Re-Snapshot After Navigation

agent-browser click @e5            # Navigates to new page
agent-browser snapshot -i          # Get new refs
agent-browser click @e1            # Use new refs

Re-Snapshot After Dynamic Changes

agent-browser click @e1            # Opens dropdown
agent-browser snapshot -i          # See dropdown items
agent-browser click @e7            # Select item

Snapshot Specific Regions

For complex pages, snapshot specific areas to reduce noise:

# Snapshot just a form
agent-browser snapshot @e9

Session-Dependent References

Element references are session-dependent and may vary between browser sessions. The same element on the same page might receive different reference IDs in different sessions:

ElementTypical Ref RangeHow to Find
Home tabe10-e20`snapshot -i \grep "Home"`
DMs tabe10-e20`snapshot -i \grep "DMs"`
Activity tabe10-e20`snapshot -i \grep "Activity"`
Searche5-e10`snapshot -i \grep "Search"`
More unreadse20-e30`snapshot -i \grep "More unreads"`
Channel refse30+`snapshot -i \grep "channel-name"`

Sources: skill-data/slack/references/slack-tasks.md

Architecture Summary

graph TD
    subgraph "CLI Layer"
        A[User Command] --> B[commands.rs Parser]
        B --> C[Command Dispatch]
    end
    
    subgraph "Native Daemon"
        C --> D[actions.rs Router]
        D --> E[State Manager]
        E --> F[CDP Client]
    end
    
    subgraph "Browser Layer"
        F --> G[Chrome DevTools Protocol]
        G --> H[Accessibility Tree]
    end
    
    subgraph "Reference Generation"
        H --> I[Element ID Assignment]
        I --> J[@eN Reference Labels]
    end
    
    J --> K[Snapshot Output]
    K --> L[Automation Commands]

The Element References System provides the foundation for reliable browser automation by abstracting DOM complexity behind human-readable identifiers that remain stable across page states and navigation events.

Sources: skill-data/core/references/snapshot-refs.md

Architecture Overview

Related topics: Daemon and CDP Protocol, Introduction to Agent Browser

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Daemon Architecture

Continue reading this section for the full explanation and source context.

Section Action Dispatch System

Continue reading this section for the full explanation and source context.

Section CDP Client Layer

Continue reading this section for the full explanation and source context.

Related topics: Daemon and CDP Protocol, Introduction to Agent Browser

Architecture Overview

agent-browser is a Rust-based browser automation framework that provides high-performance browser control through native CDP (Chrome DevTools Protocol) communication. The system is designed for AI agent integration, enabling reliable and observable browser automation.

System Architecture

The architecture follows a layered approach with clear separation between the CLI interface, daemon process, and browser engine.

graph TB
    subgraph "Client Layer"
        CLI[CLI Interface]
        Dashboard[Web Dashboard]
    end

    subgraph "Daemon Layer"
        WS[WebSocket Server]
        Dispatcher[Action Dispatcher]
        State[State Manager]
    end

    subgraph "CDP Layer"
        CDP[CDP Client]
        Protocol[Protocol Handler]
    end

    subgraph "Browser Engine"
        Chrome[Chrome/Chromium]
        Lightpanda[Lightpanda]
    end

    CLI --> WS
    Dashboard --> WS
    WS --> Dispatcher
    Dispatcher --> CDP
    CDP --> Chrome
    CDP --> Lightpanda
    Dispatcher --> State

Core Components

Daemon Architecture

The browser automation daemon is the central coordinator that manages browser sessions and handles command dispatching. It runs as a persistent process that maintains browser state across multiple operations.

Key Responsibilities:

ComponentResponsibility
WebSocket ServerAccepts client connections with origin validation
Action DispatcherRoutes commands to appropriate handlers
State ManagerMaintains session state and snapshots
CDP ClientManages protocol-level communication

Sources: cli/src/native/mod.rs

Action Dispatch System

The action system provides a comprehensive set of browser automation commands. Actions are dispatched based on command type and handle specific browser operations.

Action Categories:

CategoryCommands
Navigationgoto, back, forward, reload, waitforurl, waitforloadstate
Interactionclick, fill, press, select, check, uncheck, multiselect
Contentsnapshot, innertext, innerhtml, gettext, getattribute
Statecookies_get, cookies_set, storage_get, storage_set
Networkroute, unroute, requests, har
React Debugreact_tree, react_inspect, react_renders_start

Sources: cli/src/native/actions.rs:1-50

CDP Client Layer

The CDP (Chrome DevTools Protocol) client handles low-level communication with the browser engine. This abstraction allows the system to work with different browser engines through a unified interface.

Supported Engines:

EngineSelection Flag
Chrome/Chromium--engine chrome (default)
Lightpanda--engine lightpanda

Sources: cli/src/native/mod.rs

Communication Protocol

WebSocket Server

The daemon exposes a WebSocket server for client communication. Security is enforced through origin validation.

graph LR
    Client[Client App] -->|WebSocket| OriginCheck[Origin Check]
    OriginCheck -->|Allowed| Accept[Accept Connection]
    OriginCheck -->|Blocked| Reject[403 Forbidden]

Origin Validation:

The server validates the Origin header on incoming WebSocket requests. Connections from disallowed origins receive a 403 Forbidden response before any data exchange occurs.

if !is_allowed_origin(origin.as_deref()) {
    return Err(reject); // Status: FORBIDDEN
}

Sources: cli/src/native/stream/websocket.rs:15-30

Request/Response Flow

All commands follow a request-response pattern:

  1. Client sends JSON command via WebSocket
  2. Server validates origin
  3. Dispatcher routes to appropriate handler
  4. Handler executes CDP operation
  5. Result returned as JSON response

State Management

Session State

The daemon maintains persistent state for each browser session:

State ComponentDescription
TabsActive tab list and current tab reference
FrameCurrent frame hierarchy
ViewportWindow dimensions
RecordingVideo recording status

Sources: cli/src/native/stream/websocket.rs:5-15

Snapshot System

The snapshot system provides accessibility-tree based page representation with stable element references (@e1, @e2, etc.) for reliable element selection across page mutations.

Best Practice: Always snapshot before interacting with elements, as refs change after navigation or dynamic content changes.

Sources: skill-data/core/references/snapshot-refs.md

React Inspection System

For React-based applications, the daemon provides specialized inspection capabilities:

Blocker Detection

The system identifies React Suspense boundaries and classifies them by impact:

Blocker KindWeightActionability
ClientHook790
RequestApi688
ServerFetch582
Cache474
Stream360
Unknown235
Framework118

Boundary Classification

Boundary KindDescription
RouteSegmentNext.js/App Router segment boundary
ExplicitSuspenseUser-declared <Suspense> component
ComponentImplicit boundary from component structure

Sources: cli/src/native/react/suspense.rs:30-60

CLI Architecture

The CLI provides both interactive and scripted access to browser automation:

Command Structure

agent-browser <command> [args]

Primary Command Groups:

GroupPurpose
agent-browser openNavigate to URL
agent-browser <action>Execute automation action
agent-browser setConfigure browser settings
agent-browser networkManage network interception
agent-browser stateSave/load/restore sessions
agent-browser tabManage browser tabs
agent-browser screenshotCapture page images
agent-browser installDownload Chrome

Sources: cli/src/output.rs

Dashboard Architecture

The web-based dashboard provides visual monitoring and control:

graph TD
    Dashboard[Dashboard App] -->|API| Daemon
    Dashboard -->|Display| Results[screenshots/snapshots]
    Dashboard -->|Controls| Form[Control Form]

Dashboard Features:

  • Resizable split view (controls + results)
  • Responsive layout for mobile/desktop
  • Real-time screenshot display with base64 encoding
  • Snapshot viewer with step history
  • Step-by-step playback of automation sequences

Sources: packages/dashboard/src/components/extensions-panel.tsx

Installation and Dependencies

Chrome Installation

The install command downloads Chrome directly from Chrome for Testing:

agent-browser install

This ensures the Chrome binary is available for CDP communication without requiring system-wide Chrome installation.

Testing Architecture

Unit Tests

Fast tests (~320) that verify individual components without Chrome dependency:

cd cli && cargo test

End-to-End Tests

Integration tests that launch real headless Chrome:

cd cli && cargo test e2e -- --ignored --test-threads=1

Requirements:

  • Chrome must be installed
  • Tests run serially to avoid browser instance contention

Security Considerations

AspectImplementation
Origin ValidationWebSocket connections validated before acceptance
Session IsolationEach session maintains separate state
Credential StorageAuthentication vault for secure credential handling

Summary

agent-browser implements a clean three-tier architecture:

  1. Client Layer - CLI and dashboard provide user interfaces
  2. Daemon Layer - Rust-based server handles command dispatch and state
  3. CDP Layer - Browser-agnostic protocol client enables Chrome/Lightpanda support

The design prioritizes reliability (stable element refs), observability (snapshots, screenshots, video recording), and extensibility (skill-based system for specialized automation tasks).

Sources: cli/src/native/mod.rs

Daemon and CDP Protocol

Related topics: Architecture Overview, Browser Engine Integration

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Native Daemon Components

Continue reading this section for the full explanation and source context.

Section Action Dispatch

Continue reading this section for the full explanation and source context.

Section Browser Engine Selection

Continue reading this section for the full explanation and source context.

Related topics: Architecture Overview, Browser Engine Integration

Daemon and CDP Protocol

Overview

The agent-browser project implements a native Rust-based browser automation daemon that communicates with Chrome/Chromium browsers via the Chrome DevTools Protocol (CDP). The architecture separates the automation logic from browser control through WebSocket-based CDP connections, enabling AI agents to interact with web pages through a CLI interface.

Architecture Layer Diagram:

graph TD
    A[CLI Interface] --> B[Action Dispatcher]
    B --> C[CDP Client]
    C --> D[WebSocket Stream]
    D --> E[CDP Loop Handler]
    E --> F[Chrome Browser Instance]
    
    G[CDP Protocol Files] --> F
    H[Generated CDP Types] --> C

Daemon Architecture

Native Daemon Components

The daemon lives in cli/src/native/ and handles all browser automation tasks. The main components include:

ComponentLocationPurpose
Daemoncli/src/native/daemon/Process management and state coordination
Actionscli/src/native/actions.rsCommand handlers for browser operations
Browsercli/src/native/browser/Browser instance lifecycle
CDP Clientcli/src/native/cdp/client.rsProtocol communication
CDP Loopcli/src/native/stream/cdp_loop.rsMessage processing loop

Sources: cli/src/native/actions.rs

Action Dispatch

The action handler maps command names to their implementation functions. Supported actions include:

let result = match action {
    "launch" => handle_launch(cmd, state).await,
    "navigate" => handle_navigate(cmd, state).await,
    "url" => handle_url(state).await,
    "cdp_url" => handle_cdp_url(state),
    "inspect" => handle_inspect(state).await,
    "title" => handle_title(state).await,
    "content" => handle_content(state).await,
    "evaluate" => handle_evaluate(cmd, state).await,
    "close" => handle_close(state).await,
    "snapshot" => handle_snapshot(cmd, state).await,
    "screenshot" => handle_screenshot(cmd, state).await,
    "click" => handle_click(cmd, state).await,
    "dblclick" => handle_dblclick(cmd, state).await,
    "fill" => handle_fill(cmd, state).await,
    "type" => handle_type(cmd, state).await,
    "press" => handle_press(cmd, state).await,
    "hover" => handle_hover(cmd, state).await,
    "scroll" => handle_scroll(cmd, state).await,
    // ... additional actions
};

Sources: cli/src/native/actions.rs:50-75

Browser Engine Selection

The --engine flag selects between Chrome and Lightpanda browsers. Chrome is downloaded from Chrome for Testing via the install command.

CDP Protocol Implementation

Protocol Files

The CDP protocol definitions are stored in JSON format:

FileDescription
browser_protocol.jsonCore browser domains (Page, Network, Runtime, etc.)
js_protocol.jsonJavaScript debugging domains

Sources: cli/cdp-protocol/browser_protocol.json

Auto-Generated Types

CDP types are auto-generated from protocol JSON files:

/// Auto-generated CDP types from protocol JSON files in `cdp-protocol/`.
///
/// To populate: download `browser_protocol.json` and `js_protocol.json` from
/// <https://github.com/nicolo-ribaudo/nicolo-ribaudo.github.io/> (or any
/// Chromium source) into `cli/cdp-protocol/` and rebuild.
#[allow(clippy::upper_case_acronyms)]
pub mod generated {
    include!(concat!(env!("OUT_DIR"), "/cdp_generated.rs"));
}

Sources: cli/src/native/cdp/types.rs

CDP Client Structure

The CDP client manages communication with the browser:

graph LR
    A[Command] --> B[CDP Client]
    B --> C[WebSocket Writer]
    C --> D[Browser CDP Endpoint]
    
    E[Browser Events] --> F[WebSocket Reader]
    F --> G[Event Handler]
    G --> H[State Updates]

WebSocket Communication

Stream Module Architecture

The WebSocket communication is handled by the stream module located in cli/src/native/stream/:

ModuleFilePurpose
Stream Corecli/src/native/stream/mod.rsStream trait definitions and utilities
WebSocketcli/src/native/stream/websocket.rsWebSocket connection handling
CDP Loopcli/src/native/stream/cdp_loop.rsCDP message processing loop

WebSocket Connection

The WebSocket module establishes and maintains connections to the Chrome DevTools endpoint:

sequenceDiagram
    participant CLI as CLI Command
    participant Client as CDP Client
    participant WS as WebSocket
    participant Chrome as Chrome Browser
    
    CLI->>Client: connect(url)
    Client->>WS: establish_connection()
    WS->>Chrome: WebSocket Handshake
    Chrome-->>WS: 101 Switching Protocols
    WS-->>Client: Connected
    
    loop Message Exchange
        CLI->>Client: send_command()
        Client->>WS: write_message()
        WS->>Chrome: CDP JSON Message
        Chrome-->>WS: CDP Response/Event
        WS-->>Client: read_message()
        Client-->>CLI: Result
    end

CDP Loop Handler

The CDP loop processes incoming messages and manages the event queue:

  • Handles CDP events from the browser
  • Routes responses to pending command callbacks
  • Manages connection state and reconnection logic

Sources: cli/src/native/stream/cdp_loop.rs

Browser Connection

Connection Methods

The daemon supports multiple connection methods:

MethodCommandUse Case
Launch new browseragent-browser openFresh browser instance
Connect to existingagent-browser connect 9222Attach to running browser
# Launch with navigation
agent-browser open <url>

# Connect to running browser on specific port
agent-browser connect 9222

# Launch without navigation (clean slate)
agent-browser open

CDP WebSocket URL

The CDP WebSocket URL can be retrieved programmatically:

agent-browser cdp_url

This returns the WebSocket debugger URL for programmatic browser attachment.

Browser Version Info

The connection retrieves browser metadata:

#[derive(Deserialize)]
#[serde(rename_all = "camelCase")]
pub struct BrowserVersionInfo {
    #[serde(rename = "webSocketDebuggerUrl")]
    pub web_socket_debugger_url: Option<String>,
    #[serde(rename = "Browser")]
    pub browser: Option<String>,
}

Sources: cli/src/native/cdp/types.rs

CDP Protocol Domains

Supported Domains

The agent-browser supports CDP domains for:

DomainPurposeKey Commands
PagePage navigation and loadingnavigate, reload, back, forward
RuntimeJavaScript executionevaluate, callFunctionOn
DOMDOM manipulationgetDocument, describeNode
InputUser input simulationdispatchEvent, insertText
NetworkNetwork request interceptionsetRequestInterception, getResponseBody
TargetBrowser target managementcreateTarget, attachToTarget

Browser Automation Actions

The following high-level actions are available via CDP:

# Navigation
agent-browser open <url>
agent-browser back
agent-browser forward
agent-browser reload

# DOM Interaction
agent-browser click @e1
agent-browser fill @e2 "text"
agent-browser type @e3 "input"
agent-browser hover @e4
agent-browser scroll down 500

# State Queries
agent-browser snapshot
agent-browser screenshot
agent-browser get text @e1
agent-browser get attr @e1 href

# JavaScript
agent-browser evaluate "document.title"

Error Handling

WebDriver Fallback

The daemon gracefully handles unsupported actions when using WebDriver backend:

Err(anyhow::anyhow!(
    "Action '{}' is not supported on the WebDriver backend",
    action
))

CDP Error Propagation

CDP errors are propagated through the action chain, enabling detailed error messages for debugging failed browser operations.

Performance Considerations

Session Management

  • Each browser session maintains a persistent CDP connection
  • Sessions can be named and persisted for multi-session workflows
  • State persistence allows resuming automation tasks

Network Idle Detection

The daemon supports waiting for network idle states:

agent-browser wait --load networkidle

This is essential for SPAs and applications with dynamic content loading.

Security Model

Credential Management

The daemon provides a secure credential vault for browser authentication:

agent-browser set credentials <user> <pass>

Cookies can be set from various formats:

agent-browser cookies set --curl <file> [--domain <host>]

Auto-detects JSON, cURL, and Cookie-header file formats.

Extension Points

Custom CDP Scripts

Execute arbitrary JavaScript in the browser context:

agent-browser addscript <script>
agent-browser addinitscript <script>

Custom Styles

Inject CSS for visual testing:

agent-browser addstyle <css>

Summary

The Daemon and CDP Protocol architecture enables agent-browser to provide a performant, Rust-native browser automation solution. By implementing direct CDP communication over WebSockets, the project avoids dependencies on Node.js wrappers like Playwright or Puppeteer while maintaining full compatibility with Chrome's DevTools Protocol capabilities.

The separation of concerns between the action dispatcher, CDP client, and WebSocket stream layers ensures maintainability and enables future extensions for additional browser engines and protocol features.

Sources: cli/src/native/actions.rs

Interaction Commands

Related topics: Navigation Commands, State Inspection Commands, Element References System

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Component Responsibilities

Continue reading this section for the full explanation and source context.

Section Element Selection Commands

Continue reading this section for the full explanation and source context.

Section Mouse Commands

Continue reading this section for the full explanation and source context.

Related topics: Navigation Commands, State Inspection Commands, Element References System

Interaction Commands

Interaction Commands are the core primitives that enable AI agents to programmatically control and manipulate web pages in the agent-browser system. These commands provide atomic operations for clicking elements, entering text, scrolling, and capturing page state through an accessibility-tree based reference system.

Architecture Overview

The interaction system follows a command dispatch pattern where incoming commands are routed to appropriate handlers based on their operation type. The architecture separates concerns between command parsing, execution, and output formatting.

graph TD
    A[User/Agent Input] --> B[Command Parser]
    B --> C[actions.rs Dispatcher]
    C --> D[interaction.rs Handlers]
    D --> E[CDP Protocol Layer]
    E --> F[Browser Engine]
    F --> G[Page Response]
    G --> H[output.rs Formatter]
    H --> I[Terminal/Agent]
    
    C -.->|click, fill, type, scroll| D
    C -.->|mouse, keyboard| D
    C -.->|snapshot, screenshot| D

Component Responsibilities

ComponentFilePurpose
Command Dispatcheractions.rsRoutes commands to handlers
Interaction Handlersinteraction.rsExecutes atomic browser operations
Output Formatteroutput.rsFormats and presents results
CDP LayerNativeChrome DevTools Protocol communication

Element Reference System

Interaction commands use an element reference system (@e1, @e2, etc.) to identify targets on the page. These references are obtained through snapshot operations and represent unique identifiers in the accessibility tree.

graph LR
    A[Page HTML] --> B[Accessibility Tree]
    B --> C[Snapshot Command]
    C --> D[@e1 button "Submit"]
    C --> E[@e2 input "Email"]
    D --> F[Click @e1]
    E --> G[Fill @e2 "text"]

Reference Format:

@e1 [tag type="value"] "text content" placeholder="hint"
│    │   │             │               │
│    │   │             │               └─ Additional attributes
│    │   │             └─ Visible text
│    │   └─ Key attributes shown
│    └─ HTML tag name
└─ Unique ref ID

Sources: skill-data/core/references/snapshot-refs.md:1-50

Core Interaction Commands

Element Selection Commands

CommandDescriptionParameters
findFind elements by locator<locator> <value> [action] [text]
countCount matching elements<selector>
isCheck element state<what> <selector>

Locators supported: role, text, label, placeholder, alt, title, testid, first, last, nth

Sources: cli/src/output.rs:1-20

Mouse Commands

graph TD
    A[mouse] --> B[move <x> <y>]
    A --> C[down <btn>]
    A --> D[up <btn>]
    A --> E[wheel <dy> <dx>]
    
    B --> F[Dispatch mousemove event]
    C --> G[Dispatch mousedown event]
    D --> H[Dispatch mouseup event]
    E --> I[Dispatch wheel event]
CommandDescription
mouse move <x> <y>Move cursor to coordinates
mouse down [btn]Press mouse button (default: left)
mouse up [btn]Release mouse button
mouse wheel <dy> [dx]Scroll wheel (delta Y/X)

Sources: cli/src/native/actions.rs:1-30

Keyboard Commands

CommandDescriptionExample
typeType text (with key events)type @e1 "hello"
pressPress special keypress Enter
setvalueSet input value directlysetvalue @e1 "value"

Special Keys: Enter, Tab, Escape, Backspace, ArrowUp, ArrowDown, ArrowLeft, ArrowRight, F1-F12, Control, Alt, Shift

Sources: cli/src/native/actions.rs:1-30

Scroll Commands

CommandDescription
scroll down <px>Scroll down by pixels
scroll up <px>Scroll up by pixels
scroll left <px>Scroll left by pixels
scroll right <px>Scroll right by pixels

Sources: skill-data/core/SKILL.md:1-50

State Inspection Commands

graph TD
    A[get command] --> B{Property Type}
    B -->|attr| C[Get attribute value]
    B -->|value| D[Get input value]
    B -->|text| E[Get visible text]
    B -->|html| F[Get innerHTML]
    B -->|title| G[Get page title]
    B -->|url| H[Get current URL]
    B -->|box| I[Get bounding box]
    B -->|styles| J[Get computed styles]
CommandDescription
get text <ref>Get visible text of element
get value <ref>Get input field value
get attr <ref> <name>Get specific attribute
get html <ref>Get innerHTML
get titleGet page title
get urlGet current URL
get box <ref>Get bounding box coordinates
get styles <ref>Get computed CSS styles
get cdp-urlGet CDP debugging URL

Sources: cli/src/output.rs:1-20

Click Variations

The click command supports several modifiers for different interaction patterns:

CommandDescription
click <ref>Standard left-click
click <ref> --new-tabClick and open in new tab
click <ref> --doubleDouble-click
click <ref> --rightRight-click (context menu)
tap <ref>Mobile-style tap (touch events)

Sources: skill-data/core/SKILL.md:1-50

Form Input Commands

Text Input

graph LR
    A[Input Commands] --> B[type]
    A --> C[fill]
    A --> D[setvalue]
    
    B --> E[Triggers keydown/keyup]
    C --> F[Direct value set]
    D --> G[Direct value assignment]
CommandDescriptionBehavior
fill <ref> <text>Fill input fieldReplaces existing value, triggers input events
type <ref> <text>Type text character by characterTriggers full key event sequence
setvalue <ref> <value>Set value directlyBypasses sanitization

Sources: cli/src/native/actions.rs:1-30

Other Input Types

CommandTargetDescription
check <ref>CheckboxCheck a checkbox
uncheck <ref>CheckboxUncheck a checkbox
select <ref> <value>SelectSelect option by value
upload <ref> <path>File inputUpload file

Sources: cli/src/native/actions.rs:1-30

Wait and Timing

Wait commands control execution timing for dynamic content:

CommandDescription
wait <ms>Wait for milliseconds
wait --loadWait for page load event
wait networkidleWait for network to be idle
wait --load networkidleCombined load + network idle

Sources: skill-data/core/SKILL.md:1-50

Command Chaining with Batches

Multiple commands can be executed in a single batch operation for efficiency:

graph TD
    A[Batch Command] --> B[Parse JSON Array]
    B --> C[Execute Sequentially]
    C --> D[Command 1]
    D --> E[Command 2]
    E --> F[Command N]
    F --> G[Return Combined Results]

Example batch command:

agent-browser batch \
  '["open"]' \
  '["network","route","*","--abort","--resource-type","script"]' \
  '["cookies","set","--curl","cookies.curl","--domain","localhost"]' \
  '["navigate","http://localhost:3000/target"]'

Sources: skill-data/core/references/commands.md:1-30

State Management

Browser State Commands

CommandDescription
is <state> <ref>Check if element is visible, enabled, checked
is openCheck if browser is open
is closedCheck if browser is closed

Visibility and Enabled States

graph TD
    A[Check State] --> B{Element Type}
    B -->|Button/Input| C[Check: enabled]
    B -->|Checkbox| D[Check: checked]
    B -->|Any| E[Check: visible]
    
    C --> F[Return boolean]
    D --> F
    E --> F

Sources: cli/src/output.rs:1-20

Advanced Interactions

React-Specific Commands

For React applications, specialized inspection commands are available:

CommandDescription
react_treeGet component tree
react_inspect <ref>Inspect React component
react_renders_startStart render tracking
react_renders_stopStop render tracking

Sources: cli/src/native/actions.rs:1-30

Dialog Handling

graph TD
    A[Dialog Appears] --> B{dialog type}
    B -->|alert| C[handle_alert]
    B -->|confirm| D[handle_confirm]
    B -->|prompt| E[handle_prompt]
    
    C --> F[dialog accept --message "text"]
    D --> F
    E --> G[dialog accept "input"]
    G --> F
CommandDescription
dialog accept [message]Accept dialog with optional message
dialog dismissCancel/dismiss dialog

Sources: cli/src/native/actions.rs:1-30

Common Workflow Patterns

Basic Navigation and Interaction

# 1. Open page
agent-browser open https://example.com

# 2. Take snapshot to get refs
agent-browser snapshot -i

# 3. Interact with elements
agent-browser click @e1
agent-browser fill @e2 "[email protected]"
agent-browser press Enter

# 4. Wait for response
agent-browser wait 1000

Form Submission Flow

agent-browser open https://example.com/login
agent-browser snapshot -i
agent-browser fill @e_email "[email protected]"
agent-browser fill @e_password "secretpassword"
agent-browser click @e_submit
agent-browser wait --load networkidle
agent-browser screenshot result.png

Error Handling Pattern

# Check if operation succeeded
agent-browser is visible @e_success_message

# If failed, inspect state
agent-browser snapshot -i
agent-browser get text @e_error_message

Command Reference Summary

Interaction Operations Matrix

CategoryCommands
Mouseclick, mouse move/down/up/wheel, dblclick
Keyboardtype, press, setvalue
Scrollscroll up/down/left/right
Formsfill, check, uncheck, select, upload
Inspectget text/value/attr/html/title/url/box/styles
Statefind, count, is
Timingwait

Sources: cli/src/native/actions.rs:1-30 Sources: cli/src/output.rs:1-20 Sources: skill-data/core/SKILL.md:1-50

Best Practices

  1. Always snapshot before interacting - Element refs are obtained from snapshots and must be fetched after page load or navigation
  2. Re-snapshot after navigation - New pages have new accessibility trees with different refs
  3. Use appropriate wait conditions - Wait for networkidle when content loads dynamically
  4. Prefer fill over type - fill is faster and more reliable for automated workflows
  5. Use type for form validation - When you need key events to trigger validation logic

Sources: skill-data/core/references/snapshot-refs.md:1-50

Sources: skill-data/core/references/snapshot-refs.md:1-50

State Inspection Commands

Related topics: Interaction Commands, Element References System

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Get Cookies

Continue reading this section for the full explanation and source context.

Section Set Cookie

Continue reading this section for the full explanation and source context.

Section Set Cookie from File

Continue reading this section for the full explanation and source context.

Related topics: Interaction Commands, Element References System

State Inspection Commands

State Inspection Commands in agent-browser provide mechanisms to examine, retrieve, and manage browser state including cookies, web storage, session data, console errors, and DOM element properties. These commands enable debugging, state verification, and persistence of browser sessions across operations.

Architecture Overview

State inspection in agent-browser operates through a layered architecture where the CLI command layer parses user input, the actions layer dispatches to appropriate handlers, and the browser backend (CDP/WebDriver) executes the actual state retrieval.

graph TD
    A[CLI Input] --> B[commands.rs Parser]
    B --> C[actions.rs Dispatcher]
    C --> D[State Handlers]
    C --> E[Storage Handlers]
    C --> F[Element Handlers]
    D --> G[Browser Backend<br/>Chrome CDP / WebDriver]
    E --> G
    F --> G
    G --> H[State Output]
    
    D -. includes .-> D1[cookies_get/set/clear]
    D -. includes .-> D2[state_save/load/list/clean]
    E -. includes .-> E1[storage_get/set/clear]
    F -. includes .-> F1[gettext/getattr/isvisible]

Sources: cli/src/native/actions.rs:1-150

Command Categories

State inspection commands are organized into five primary categories:

CategoryPurposeCommands
Cookie InspectionManage HTTP cookiescookies_get, cookies_set, cookies_clear
Web StorageInspect localStorage/sessionStoragestorage_get, storage_set, storage_clear
Session StateSave/load browser sessionsstate_save, state_load, state_list, state_clean
Element PropertiesQuery DOM element attributesgettext, getattribute, inputvalue, isvisible, isenabled, ischecked
Error InspectionRetrieve console errorserrors

Sources: cli/src/native/actions.rs:80-100

Cookies can be inspected and managed through the cookies command family.

Get Cookies

Retrieves all cookies for the current domain:

agent-browser cookies get

Sets a cookie with explicit parameters:

agent-browser cookies set --url <url> --name <name> --value <value> [--domain <domain>] [--path <path>] [--httpOnly] [--secure] [--sameSite <strict|lax|none>] [--expires <timestamp>]

Auto-detects and imports cookies from JSON, cURL, or Cookie-header format:

agent-browser cookies set --curl <file> [--domain <host>]

Clear Cookies

Removes all cookies:

agent-browser cookies clear

Sources: cli/src/output.rs:1-50

Web Storage Inspection

Web storage commands manage the browser's localStorage and sessionStorage.

Storage Commands

CommandDescription
storage_getRetrieve value from localStorage or sessionStorage
storage_setSet a key-value pair in storage
storage_clearClear all items from selected storage
# Get storage value
agent-browser storage_get <local|session> <key>

# Set storage value
agent-browser storage_set <local|session> <key> <value>

# Clear storage
agent-browser storage_clear <local|session>

Sources: cli/src/native/actions.rs:85-90

Session State Management

The agent-browser maintains persistent state in ~/.agent-browser (or <tempdir>/agent-browser when home directory cannot be resolved).

State Directory Structure

graph LR
    A[~/.agent-browser] --> B[sessions/]
    A --> C[auth/]
    A --> D[encryption.key]
    B --> E[<session-id>/]
    E --> F[state.json]
    E --> G[screenshots/]

Sources: cli/src/native/state.rs:80-95

State Commands

CommandDescription
state_saveSave current browser state to disk
state_loadRestore browser state from saved file
state_listList all saved states
state_cleanRemove states older than specified days
state_renameRename an existing state
# Save current state
agent-browser state_save <path> [--name <name>]

# Load saved state
agent-browser state_load <path>

# List all states
agent-browser state_list

# Clean old states (default: 30 days)
agent-browser state_clean [--days <n>]

# Rename a state
agent-browser state_rename --path <path> --name <new_name>

State Directory Resolution

pub fn get_state_dir() -> PathBuf {
    if let Some(home) = dirs::home_dir() {
        home.join(".agent-browser")
    } else {
        std::env::temp_dir().join("agent-browser")
    }
}

pub fn get_sessions_dir() -> PathBuf {
    get_state_dir().join("sessions")
}

Sources: cli/src/native/state.rs:80-90

Element Property Inspection

Element inspection commands retrieve properties and states of DOM elements using element references obtained from snapshots.

Get Text Content

Retrieves the visible text of an element:

agent-browser gettext @e1

Get HTML Content

Retrieves element innerHTML or innerText:

agent-browser innerhtml @e1
agent-browser innertext @e1

Get Attributes

Retrieves any attribute value from an element:

agent-browser getattribute @e1 href
agent-browser getattribute @e1 src

Get Input Value

Retrieves the current value of input elements:

agent-browser inputvalue @e1

Check Element State

Verify element state properties:

agent-browser isvisible @e1
agent-browser isenabled @e1
agent-browser ischecked @e1

Count Matching Elements

Count elements matching a selector:

agent-browser count ".item-class"

Get Bounding Box

Retrieve element dimensions and position:

agent-browser boundingbox @e1

Get Styles

Retrieve computed CSS styles:

agent-browser styles @e1

Sources: cli/src/native/actions.rs:30-60

Find Elements

The find command locates DOM elements using various locator strategies.

Supported Locators

LocatorDescriptionExample
roleFind by ARIA rolefind role button --exact
textFind by text contentfind text "Submit"
labelFind form labelfind label "Email"
placeholderFind by placeholderfind placeholder "Search..."
altFind by alt attributefind alt "profile"
titleFind by title attributefind title "Close"
testidFind by test IDfind testid submit-btn
firstFirst element matching selectorfind first ".item"
lastLast element matching selectorfind last ".item"

Find Command Syntax

agent-browser find <locator> <value> [action] [--exact] [--name <name>]

Examples

# Find button by role and click
agent-browser find role button --exact click

# Find input by placeholder
agent-browser find placeholder "email" fill "[email protected]"

# Find link by text
agent-browser find text "Learn more"

Sources: cli/src/commands.rs:150-200

Console Error Inspection

Retrieve JavaScript errors logged to the browser console.

Get Errors

agent-browser errors

Returns a list of all console errors captured during the session.

Console Monitoring

Enable or disable console message capture:

agent-browser console enable
agent-browser console disable

Snapshot-Based Inspection

Snapshots provide a hierarchical view of the page DOM with element references.

Snapshot Modes

FlagDescription
-iInteractive elements only (preferred)
-uInclude href URLs on links
-cCompact mode (no empty structural nodes)
-d <n>Cap depth at n levels
-s <selector>Scope to CSS selector
--jsonMachine-readable JSON output

Snapshot Output Format

Page: Example - Log in
URL: https://example.com/login

@e1 [heading] "Log in"
@e2 [form]
  @e3 [input type="email"] placeholder="Email"
  @e4 [input type="password"] placeholder="Password"
  @e5 [button type="submit"] "Continue"
  @e6 [link] "Forgot password?"

Snapshot Workflow

graph TD
    A[Open Page] --> B[Snapshot -i]
    B --> C[Parse Element Refs]
    C --> D[Click @e3]
    D --> E[Snapshot -i]
    E --> F[Find Input Fields]
    F --> G[Fill @e3 "email"]
    G --> H[Fill @e4 "password"]
    H --> I[Click @e5]

Sources: skill-data/core/SKILL.md:1-80

Complete Command Reference

State Inspection Summary

CommandCategoryDescription
cookies getCookieList all cookies
cookies set --name X --value YCookieSet a cookie
cookies clearCookieClear all cookies
storage_get <type> <key>StorageGet storage value
storage_set <type> <key> <val>StorageSet storage value
storage_clear <type>StorageClear storage
state_save <path>SessionSave browser state
state_load <path>SessionLoad browser state
state_listSessionList saved states
state_clean [days]SessionClean old states
errorsConsoleGet console errors
gettext @eNElementGet element text
getattribute @eN <attr>ElementGet attribute
isvisible @eNElementCheck visibility
count <selector>ElementCount elements

Sources: cli/src/native/actions.rs:70-100

Usage Patterns

Inspecting Page State

# Full page inspection workflow
agent-browser open https://example.com
agent-browser snapshot -i           # Get element refs
agent-browser get title             # Page title
agent-browser get url               # Current URL
agent-browser errors                # Check for console errors

Verifying Element State

agent-browser click @e1             # Click element
agent-browser wait 500             # Wait for response
agent-browser isvisible @e2        # Verify visibility
agent-browser gettext @e3          # Get text content

Persisting Session State

agent-browser open https://app.example.com
agent-browser cookies set --name session --value abc123
agent-browser storage_set local user "john"
agent-browser state_save ./my-session   # Persist state
# Later...
agent-browser state_load ./my-session  # Restore state

Summary

State Inspection Commands in agent-browser provide comprehensive capabilities for examining and managing browser state:

  • Cookie Management: Full CRUD operations on HTTP cookies with file import support
  • Web Storage: Access to localStorage and sessionStorage
  • Session Persistence: Save, load, list, and clean browser sessions
  • Element Inspection: Query text, attributes, states, and styles
  • Element Location: Find elements by role, text, label, placeholder, and other attributes
  • Console Monitoring: Capture and retrieve JavaScript errors

These commands work together with the snapshot system to enable precise browser automation workflows with full state observability.

Sources: cli/src/native/actions.rs:1-150

Browser Engine Integration

Related topics: Daemon and CDP Protocol, Installation Guide

Section Related Pages

Continue reading this section for the full explanation and source context.

Related topics: Daemon and CDP Protocol, Installation Guide

The provided context does not contain the source files specified in the query (lightpanda.rs, discovery.rs, webdriver/mod.rs, safari.rs, ios.rs). Without access to the actual browser engine integration source files, I cannot generate an accurate technical wiki page that properly cites the specified files.

However, based on the available context from this repository, I can provide a partial analysis of browser engine-related components:

Browser Engine Integration

Note: This page requires the actual source files for browser engine implementations (lightpanda.rs, discovery.rs, webdriver/mod.rs, safari.rs, ios.rs) which are not available in the current context. The following represents partial analysis based on indirect evidence.

Architecture Overview

Based on the available context, agent-browser uses a Chrome DevTools Protocol (CDP) based approach for browser automation:

┌─────────────────┐     CDP/WebSocket      ┌─────────────────┐
│  agent-browser  │ ──────────────────────▶│  Chrome/Chromium│
│      CLI        │                        │    Browser      │
└─────────────────┘                        └─────────────────┘
        │
        ├── Session Management
        ├── Element Reference System (@e1, @e2, ...)
        └── Command Dispatch

Supported Browser Contexts

Context TypeImplementationProtocol
Chrome/ChromiumCDP NativeWebSocket
ElectronCDP NativeWebSocket
Remote Debugging--remote-debugging-portCDP
Safari (iOS)WebDriverW3C WebDriver

Session Management

Sessions are managed through port-based connections:

// From session-tree.tsx
interface Session {
  port: number;
  session: string;
  provider?: string;
  pending?: boolean;
}

Sessions can be connected via:

agent-browser connect 9222

Command Dispatch Architecture

The CLI uses a dispatch pattern for handling browser commands:

// From cli/src/native/actions.rs (partial)
match subcmd.as_str() {
    "click" => handle_click(cmd, state).await,
    "fill" => handle_fill(cmd, state).await,
    "snapshot" => handle_snapshot(cmd, state).await,
    "screenshot" => handle_screenshot(cmd, state).await,
    "get" => handle_get(cmd, state).await,
    // ... additional commands
}

Browser Engine Providers

Based on the codebase structure, agent-browser supports multiple browser engine providers:

ProviderFile ReferencePurpose
Lightpandalightpanda.rsLightweight browser engine
Safarisafari.rsmacOS/iOS Safari via WebDriver
iOSios.rsiOS WebKit via WebDriver
Chrome CDPdiscovery.rsAuto-discovery of Chrome instances

CDP Discovery Mechanism

The discovery.rs module handles automatic detection of browser instances:

  • Scans for Chrome/Chromium processes
  • Identifies remote debugging ports
  • Matches browser version compatibility
  • Establishes WebSocket connections

WebDriver Integration

For non-Chrome browsers, WebDriver protocols are used:

# Safari WebDriver
agent-browser set driver safari

# iOS WebDriver  
agent-browser set driver ios

Session State Management

StateDescription
ActiveCurrently connected and responsive
PendingConnection in progress
ClosedSession terminated

Command Reference for Engine Interaction

# Connect to specific port
agent-browser connect <port>

# Session operations
agent-browser session new
agent-browser session list
agent-browser session close

# Engine-specific settings
agent-browser set viewport <width> <height>
agent-browser set device <device-name>
agent-browser set geo <lat> <lng>
agent-browser set offline [on|off]

Limitations

This page cannot provide complete documentation for browser engine integration without access to:

These files are required for accurate implementation details about:

  • CDP command serialization/deserialization
  • WebDriver protocol mapping
  • Browser-specific quirks handling
  • Session lifecycle management

Source: https://github.com/vercel-labs/agent-browser / Human Manual

Authentication and Session Persistence

This page documents the authentication workflows and session persistence mechanisms in agent-browser, covering how to handle login flows, save/restore authenticated states, manage credenti...

Section Authentication and Session Persistence

This page documents the authentication workflows and session persistence mechanisms in agent-browser, covering how to handle login flows, save/restore authenticated states, manage credenti...

This page documents the authentication workflows and session persistence mechanisms in agent-browser, covering how to handle login flows, save/restore authenticated states, manage credentials securely, and persist browser sessions across runs.

Overview

agent-browser provides multiple layers of authentication and session persistence:

  1. Credential Management — Store and retrieve login credentials via an encrypted auth vault
  2. State Persistence — Save and restore full browser state (cookies, localStorage, sessionStorage)
  3. Session Management — Auto-save/restore named sessions without manual file handling
  4. Profile Persistence — Use Chrome user data directories for full browser profile persistence

These mechanisms layer on top of the core CDP (Chrome DevTools Protocol) browser automation, using the underlying Playwright-managed browser infrastructure to serialize and deserialize authentication artifacts.

Sources: cli/src/native/actions.rs:action_dispatch (dispatch table)

Sources: cli/src/native/actions.rs:action_dispatch (dispatch table)

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

high Chrome 147.0 crashes with "trap int3" when running in docker

First-time setup may fail or require extra isolation and rollback planning.

high Detected: Trojan:Win32/Posilod.EB!cl

First-time setup may fail or require extra isolation and rollback planning.

high snapshot -s <selector> produces duplicate elements when AX tree contains virtual nodes without backendDOMNodeId

Users may get misleading failures or incomplete behavior unless configuration is checked carefully.

high Feature Request: Chrome Extension-based Connection for Seamless Login State Reuse

The project should not be treated as fully validated until this signal is reviewed.

Doramagic Pitfall Log

Doramagic extracted 16 source-linked risk signals. Review them before installing or handing real data to the project.

1. Installation risk: Chrome 147.0 crashes with "trap int3" when running in docker

  • Severity: high
  • Finding: Installation risk is backed by a source signal: Chrome 147.0 crashes with "trap int3" when running in docker. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/vercel-labs/agent-browser/issues/1339

2. Installation risk: Detected: Trojan:Win32/Posilod.EB!cl

  • Severity: high
  • Finding: Installation risk is backed by a source signal: Detected: Trojan:Win32/Posilod.EB!cl. Treat it as a review item until the current version is checked.
  • User impact: First-time setup may fail or require extra isolation and rollback planning.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/vercel-labs/agent-browser/issues/1281

3. Configuration risk: snapshot -s <selector> produces duplicate elements when AX tree contains virtual nodes without backendDOMNodeId

  • Severity: high
  • Finding: Configuration risk is backed by a source signal: snapshot -s <selector> produces duplicate elements when AX tree contains virtual nodes without backendDOMNodeId. Treat it as a review item until the current version is checked.
  • User impact: Users may get misleading failures or incomplete behavior unless configuration is checked carefully.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/vercel-labs/agent-browser/issues/1338

4. Project risk: Feature Request: Chrome Extension-based Connection for Seamless Login State Reuse

  • Severity: high
  • Finding: Project risk is backed by a source signal: Feature Request: Chrome Extension-based Connection for Seamless Login State Reuse. Treat it as a review item until the current version is checked.
  • User impact: The project should not be treated as fully validated until this signal is reviewed.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/vercel-labs/agent-browser/issues/1319

5. Security or permission risk: Developers should check this security_permissions risk before relying on the project: Dashboard privileged POST routes should reject cross-origin requests

  • Severity: high
  • Finding: Developers should check this security_permissions risk before relying on the project: Dashboard privileged POST routes should reject cross-origin requests
  • User impact: Developers may expose sensitive permissions or credentials: Dashboard privileged POST routes should reject cross-origin requests
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Dashboard privileged POST routes should reject cross-origin requests. Context: Source discussion did not expose a precise runtime context.
  • Evidence: failure_mode_cluster:github_issue | fmev_bc39fa851aecda51d6ae79863b570093 | https://github.com/vercel-labs/agent-browser/issues/1345 | Dashboard privileged POST routes should reject cross-origin requests

6. Security or permission risk: Developers should check this security_permissions risk before relying on the project: `--auto-connect` fails too quickly when Chrome asks for remote debugging permission

  • Severity: high
  • Finding: Developers should check this security_permissions risk before relying on the project: --auto-connect fails too quickly when Chrome asks for remote debugging permission
  • User impact: Developers may expose sensitive permissions or credentials: --auto-connect fails too quickly when Chrome asks for remote debugging permission
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: --auto-connect fails too quickly when Chrome asks for remote debugging permission. Context: Source discussion did not expose a precise runtime context.
  • Evidence: failure_mode_cluster:github_issue | fmev_50f6336937705c962c78ed48a466eb98 | https://github.com/vercel-labs/agent-browser/issues/1365 | --auto-connect fails too quickly when Chrome asks for remote debugging permission

7. Security or permission risk: Support XDG Base Directory paths for agent-browser state, config, and installs

  • Severity: high
  • Finding: Security or permission risk is backed by a source signal: Support XDG Base Directory paths for agent-browser state, config, and installs. Treat it as a review item until the current version is checked.
  • User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: Source-linked evidence: https://github.com/vercel-labs/agent-browser/issues/1361

8. Installation risk: Developers should check this installation risk before relying on the project: After failed close, subsequent open reports success but returns stale content from prior URL

  • Severity: medium
  • Finding: Developers should check this installation risk before relying on the project: After failed close, subsequent open reports success but returns stale content from prior URL
  • User impact: Developers may fail before the first successful local run: After failed close, subsequent open reports success but returns stale content from prior URL
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: After failed close, subsequent open reports success but returns stale content from prior URL. Context: Observed when using node, python, linux
  • Evidence: failure_mode_cluster:github_issue | fmev_fce1ca55e45e13ba327a52473c958037 | https://github.com/vercel-labs/agent-browser/issues/1367 | After failed close, subsequent open reports success but returns stale content from prior URL

9. Installation risk: Developers should check this installation risk before relying on the project: Chrome 147.0 crashes with "trap int3" when running in docker

  • Severity: medium
  • Finding: Developers should check this installation risk before relying on the project: Chrome 147.0 crashes with "trap int3" when running in docker
  • User impact: Developers may fail before the first successful local run: Chrome 147.0 crashes with "trap int3" when running in docker
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Chrome 147.0 crashes with "trap int3" when running in docker. Context: Observed when using docker, windows, linux
  • Evidence: failure_mode_cluster:github_issue | fmev_de7dc45e4f45905d10cb44680cd26da5 | https://github.com/vercel-labs/agent-browser/issues/1339 | Chrome 147.0 crashes with "trap int3" when running in docker

10. Installation risk: Developers should check this installation risk before relying on the project: Detected: Trojan:Win32/Posilod.EB!cl

  • Severity: medium
  • Finding: Developers should check this installation risk before relying on the project: Detected: Trojan:Win32/Posilod.EB!cl
  • User impact: Developers may fail before the first successful local run: Detected: Trojan:Win32/Posilod.EB!cl
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Detected: Trojan:Win32/Posilod.EB!cl. Context: Observed when using windows
  • Evidence: failure_mode_cluster:github_issue | fmev_11d6daa01783b3f8d6cc4984b34591d9 | https://github.com/vercel-labs/agent-browser/issues/1281 | Detected: Trojan:Win32/Posilod.EB!cl

11. Installation risk: Developers should check this installation risk before relying on the project: Feature: `network throttle` for emulating slow connections / per-URL delay

  • Severity: medium
  • Finding: Developers should check this installation risk before relying on the project: Feature: network throttle for emulating slow connections / per-URL delay
  • User impact: Developers may fail before the first successful local run: Feature: network throttle for emulating slow connections / per-URL delay
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: Feature: network throttle for emulating slow connections / per-URL delay. Context: Observed during installation or first-run setup.
  • Evidence: failure_mode_cluster:github_issue | fmev_af068ec0790d0398008062aef7b5d1a5 | https://github.com/vercel-labs/agent-browser/issues/1372 | Feature: network throttle for emulating slow connections / per-URL delay

12. Installation risk: Developers should check this installation risk before relying on the project: High LLM turn count due to frequent `snapshot` calls when using `agent-browser` skills

  • Severity: medium
  • Finding: Developers should check this installation risk before relying on the project: High LLM turn count due to frequent snapshot calls when using agent-browser skills
  • User impact: Developers may fail before the first successful local run: High LLM turn count due to frequent snapshot calls when using agent-browser skills
  • Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: High LLM turn count due to frequent snapshot calls when using agent-browser skills. Context: Observed when using node, playwright, windows
  • Evidence: failure_mode_cluster:github_issue | fmev_1ea0ed85aeff64de383d8fa15586474d | https://github.com/vercel-labs/agent-browser/issues/1351 | High LLM turn count due to frequent snapshot calls when using agent-browser skills

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Source: Project Pack community evidence and pitfall evidence