# https://github.com/vercel-labs/agent-browser 项目说明书生成时间：2026-05-15 11:32:46 UTC ## 目录 - [Introduction to Agent Browser](#introduction) - [Installation Guide](#installation-guide) - [Element References System](#element-references) - [Architecture Overview](#architecture-overview) - [Daemon and CDP Protocol](#daemon-and-cdp) - [Navigation Commands](#navigation-commands) - [Interaction Commands](#interaction-commands) - [State Inspection Commands](#state-inspection-commands) - [Browser Engine Integration](#browser-engines) - [Authentication and Session Persistence](#authentication) ## Introduction to Agent Browser ### 相关页面相关主题：[Installation Guide](#installation-guide), [Architecture Overview](#architecture-overview)

相关源码文件

以下源码文件用于生成本页说明： - [skill-data/core/SKILL.md](https://github.com/vercel-labs/agent-browser/blob/main/skill-data/core/SKILL.md) - [skills/agent-browser/SKILL.md](https://github.com/vercel-labs/agent-browser/blob/main/skills/agent-browser/SKILL.md) - [skill-data/core/references/commands.md](https://github.com/vercel-labs/agent-browser/blob/main/skill-data/core/references/commands.md) - [skill-data/core/references/snapshot-refs.md](https://github.com/vercel-labs/agent-browser/blob/main/skill-data/core/references/snapshot-refs.md) - [cli/src/native/actions.rs](https://github.com/vercel-labs/agent-browser/blob/main/cli/src/native/actions.rs) - [cli/src/output.rs](https://github.com/vercel-labs/agent-browser/blob/main/cli/src/output.rs) - [AGENTS.md](https://github.com/vercel-labs/agent-browser/blob/main/AGENTS.md) - [examples/environments/app/page.tsx](https://github.com/vercel-labs/agent-browser/blob/main/examples/environments/app/page.tsx)

# Introduction to Agent Browser Agent Browser is a high-performance, native Rust CLI tool designed for browser automation and AI agent integration. Unlike traditional browser automation frameworks that rely on Node.js wrappers or third-party libraries, Agent Browser communicates directly with Chrome/Chromium via the Chrome DevTools Protocol (CDP), providing a lightweight and reliable solution for web interaction tasks. ## Overview Agent Browser serves as a bridge between AI agents and web browsers, enabling autonomous web navigation, interaction, and data extraction. It is compatible with a wide range of AI agent platforms including Cursor, Claude Code, Codex, Continue, and Windsurf. | Aspect | Description | |--------|-------------| | **Language** | Rust (native CLI) | | **Protocol** | Chrome DevTools Protocol (CDP) | | **Dependencies** | No Playwright or Puppeteer dependency | | **Platform** | Chrome/Chromium | | **License** | See repository LICENSE | **资料来源：** [skills/agent-browser/SKILL.md]() ## Architecture Agent Browser follows a modular architecture with distinct layers for CLI handling, native browser control, and extensible skills. ```mermaid graph TD A[User / AI Agent] --> B[CLI Layer
Rust Commands] B --> C[Native Actions Layer
CDP Dispatcher] C --> D[Chrome/Chromium
via CDP] E[Skills System] --> B E --> F[Core Skills] E --> G[Specialized Skills] G --> G1[Electron Apps] G --> G2[Slack Workspace] G --> G3[Exploratory Testing] G --> G4[Cloud Providers] H[Session Management] --> C H --> H1[Auth Vault] H --> H2[State Persistence] H --> H3[Video Recording] ``` **资料来源：** [skill-data/core/SKILL.md](), [skills/agent-browser/SKILL.md]() ## Core Concepts ### Accessibility-Tree Snapshots Agent Browser generates accessibility-tree snapshots that provide structured, human-readable representations of web pages. Each interactive element receives a unique reference ID (e.g., `@e1`, `@e2`) that can be used for subsequent interactions. Example snapshot output: ``` Page: Example - Log in URL: https://example.com/login @e1 [heading] "Log in" @e2 [form] @e3 [input type="email"] placeholder="Email" @e4 [input type="password"] placeholder="Password" @e5 [button type="submit"] "Continue" @e6 [link] "Forgot password?" ``` **资料来源：** [skill-data/core/references/snapshot-refs.md](), [skill-data/core/SKILL.md]() ### Element Reference Notation Element references follow a consistent notation pattern: ``` @e1 [tag attribute="value"] "text content" placeholder="hint" ``` | Component | Description | |-----------|-------------| | `@e1` | Unique reference ID | | `tag` | HTML tag name | | `attribute="value"` | Key attributes | | `"text content"` | Visible text | | `placeholder="hint"` | Additional attributes | **资料来源：** [skill-data/core/references/snapshot-refs.md]() ## Command Reference ### Navigation Commands | Command | Description | |---------|-------------| | `agent-browser open [url]` | Launch browser with optional navigation | | `agent-browser back` | Navigate backward | | `agent-browser forward` | Navigate forward | | `agent-browser reload` | Reload current page | | `agent-browser close` | Close browser | | `agent-browser connect ` | Connect to existing browser via CDP | **资料来源：** [skill-data/core/references/commands.md]() ### Interaction Commands | Command | Description | |---------|-------------| | `agent-browser click ` | Click an element | | `agent-browser fill ` | Type text into input | | `agent-browser select ` | Select dropdown option | | `agent-browser check ` | Check a checkbox | | `agent-browser scroll ` | Scroll page | **资料来源：** [cli/src/native/actions.rs]() ### Data Retrieval Commands | Command | Description | |---------|-------------| | `agent-browser snapshot [-i]` | Get page snapshot (interactive only with `-i`) | | `agent-browser screenshot [path]` | Capture screenshot | | `agent-browser get text ` | Get visible text | | `agent-browser get attr ` | Get attribute value | | `agent-browser get url` | Get current URL | | `agent-browser get title` | Get page title | **资料来源：** [cli/src/output.rs](), [cli/src/native/actions.rs]() ### Network Control Commands | Command | Description | |---------|-------------| | `agent-browser network route ` | Intercept network request | | `agent-browser network unroute ` | Remove interception | | `agent-browser network requests [--clear]` | View/clear network requests | | `agent-browser network har [path]` | Capture HAR file | **资料来源：** [skill-data/core/references/commands.md](), [cli/src/output.rs]() ### Cookie and Storage Management ```bash agent-browser cookies get # View all cookies agent-browser cookies set --url --name --value agent-browser cookies clear # Clear all cookies agent-browser storage local # Manage localStorage agent-browser storage session # Manage sessionStorage ``` **资料来源：** [cli/src/output.rs]() ### Browser Settings Commands | Command | Description | |---------|-------------| | `agent-browser set viewport ` | Set viewport size | | `agent-browser set device ` | Emulate device | | `agent-browser set geo ` | Set geolocation | | `agent-browser set offline on\|off` | Toggle offline mode | | `agent-browser set headers ` | Set custom headers | | `agent-browser set media dark\|light` | Set color scheme | **资料来源：** [cli/src/output.rs]() ## Sessions and State Management Agent Browser supports multiple concurrent browser sessions with state persistence. ```mermaid graph LR A[Session A] --> B[State File A] C[Session B] --> D[State File B] E[Auth Vault] --> A E[Auth Vault] --> C ``` **Key Features:** - **Named Sessions**: `--session ` flag for multiple sessions - **State Persistence**: Save and restore browser state - **Auth Vault**: Secure credential storage - **Video Recording**: Capture browser activity **资料来源：** [skill-data/core/SKILL.md](), [skills/agent-browser/SKILL.md]() ## Skills System Agent Browser uses an extensible skills system that provides specialized workflows for different environments. ### Core Skills ```bash agent-browser skills get core # Core workflows and common patterns agent-browser skills get core --full # Include full command reference ``` ### Specialized Skills | Skill | Description | Command | |-------|-------------|---------| | **Electron** | Desktop app automation | `agent-browser skills get electron` | | **Slack** | Workspace automation | `agent-browser skills get slack` | | **Dogfood** | Exploratory testing/QA | `agent-browser skills get dogfood` | | **Vercel Sandbox** | Cloud browser in microVMs | `agent-browser skills get vercel-sandbox` | | **AgentCore** | AWS Bedrock cloud browsers | `agent-browser skills get agentcore` | **资料来源：** [skills/agent-browser/SKILL.md]() ## React Developer Tools Integration Agent Browser includes built-in React DevTools support for debugging React applications: | Command | Description | |---------|-------------| | `agent-browser react_tree` | View React component tree | | `agent-browser react_inspect` | Inspect component props/state | | `agent-browser react_renders_start` | Track render counts | | `agent-browser react_renders_stop` | Stop render tracking | **资料来源：** [cli/src/native/actions.rs](), [cli/src/react/suspense.rs]() ### Suspense Boundary Analysis Agent Browser can analyze React Suspense boundaries with actionability scoring: | Blocker Kind | Weight | Actionability | |--------------|--------|---------------| | ClientHook | 7 | 90% | | RequestApi | 6 | 88% | | ServerFetch | 5 | 82% | | Cache | 4 | 74% | | Stream | 3 | 60% | | Unknown | 2 | 35% | | Framework | 1 | 18% | **资料来源：** [cli/src/react/suspense.rs]() ## Dashboard Interface Agent Browser includes a web-based dashboard for visual browser management: ```mermaid graph TD A[Dashboard] --> B[Controls Panel] A --> C[Result Panel] A --> D[Network Panel] A --> E[Extensions Panel] B --> B1[URL Input] B --> B2[Mode Selector] B --> B3[Action Controls] C --> C1[Screenshot View] C --> C2[Snapshot View] C --> C3[Step History] D --> D1[Request List] D --> D2[HAR Export] E --> E1[Extension List] E --> E2[Extension Details] ``` The dashboard is built with React and supports: - Resizable panels for flexible layouts - Theme switching (light/dark) - Mobile-responsive design - Real-time step history **资料来源：** [examples/environments/app/page.tsx](), [packages/dashboard/src/components/network-panel.tsx](), [packages/dashboard/src/components/extensions-panel.tsx]() ## Best Practices ### 1. Always Snapshot Before Interacting ```bash # CORRECT - Snapshot first to get refs agent-browser open https://example.com agent-browser snapshot -i # Get refs first agent-browser click @e1 # Use ref # WRONG - Ref doesn't exist yet agent-browser open https://example.com agent-browser click @e1 # Will fail! ``` ### 2. Re-snapshot After Navigation Element references change when the page navigates. Always take a new snapshot after clicking links or navigating to new pages. ### 3. Use Sessions for Complex Workflows ```bash agent-browser --session my-session open https://example.com agent-browser --session my-session snapshot -i # ... perform actions ... agent-browser --session my-session close ``` **资料来源：** [skill-data/core/references/snapshot-refs.md]() ## Installation and Setup ### Prerequisites - Chrome or Chromium browser installed - Operating system: macOS, Linux, or Windows ### Installation Refer to the repository's installation instructions for your platform. Agent Browser is distributed as a native binary with no runtime dependencies. ### Configuration Files | File | Purpose | |------|---------| | `~/.agent-browser/` | Default config directory | | Sessions | Stored in config directory | | Auth Vault | Encrypted credential storage | **资料来源：** [AGENTS.md]() ## Summary Agent Browser provides a powerful, efficient, and AI-agent-friendly approach to browser automation. Its key differentiators include: - **Native Rust implementation** for high performance - **Direct CDP communication** without third-party dependencies - **Accessibility-tree snapshots** for reliable element targeting - **Session management** for complex multi-step workflows - **Extensible skills system** for specialized environments - **Built-in React DevTools** integration for debugging These features make Agent Browser an ideal choice for AI agents, automated testing pipelines, and developer workflows requiring precise browser control. --- ## Installation Guide ### 相关页面相关主题：[Introduction to Agent Browser](#introduction)

相关源码文件

以下源码文件用于生成本页说明： - [README.md](https://github.com/vercel-labs/agent-browser/blob/main/README.md) - [AGENTS.md](https://github.com/vercel-labs/agent-browser/blob/main/AGENTS.md) - [skills/agent-browser/SKILL.md](https://github.com/vercel-labs/agent-browser/blob/main/skills/agent-browser/SKILL.md) - [cli/src/output.rs](https://github.com/vercel-labs/agent-browser/blob/main/cli/src/output.rs) - [cli/src/flags.rs](https://github.com/vercel-labs/agent-browser/blob/main/cli/src/flags.rs)

# Installation Guide ## Overview The agent-browser project is a native Rust CLI tool designed for browser automation, providing AI agents with reliable web interaction capabilities. Unlike traditional browser automation tools that rely on Node.js wrappers, agent-browser delivers a fast, lightweight solution built directly in Rust with Chrome/Chromium support via Chrome DevTools Protocol (CDP). The installation process handles downloading the necessary Chrome browser binaries, setting up platform-specific binaries, and configuring dependencies for the dashboard UI. 资料来源：[AGENTS.md](https://github.com/vercel-labs/agent-browser/blob/main/AGENTS.md) ## Prerequisites ### System Requirements Before installing agent-browser, ensure your system meets the following requirements: | Requirement | Details | |-------------|---------| | **Operating System** | macOS, Linux, or Windows (7 platform binaries built) | | **Chrome/Chromium** | Required for browser automation functionality | | **Rust Toolchain** | Required for building from source | | **Node.js/pnpm** | Required for dashboard development | The project builds all 7 platform binaries during CI/CD, including native binaries for different architectures. Chrome is downloaded directly from Chrome for Testing during the installation process, eliminating the need for system-installed Chrome browsers. 资料来源：[AGENTS.md](https://github.com/vercel-labs/agent-browser/blob/main/AGENTS.md) ### Required Dependencies | Dependency | Purpose | Installation Method | |------------|---------|---------------------| | Chrome/Chromium | Browser automation target | Auto-downloaded via `install` command | | Cargo/Rust | Building CLI from source | [rustup.rs](https://rustup.rs) | | pnpm | Dashboard package management | `npm install -g pnpm` | ## Installation Methods ### Method 1: npm Package Installation (Recommended) The recommended installation method uses the npm registry for cross-platform compatibility: ```bash npm install -g @agent-browser/cli ``` After installation, you must run the setup command to download Chrome binaries: ```bash agent-browser install ``` 资料来源：[skills/agent-browser/SKILL.md](https://github.com/vercel-labs/agent-browser/blob/main/skills/agent-browser/SKILL.md) ### Method 2: Building from Source For development or customization, build the CLI from source: ```bash # Clone the repository git clone https://github.com/vercel-labs/agent-browser.git cd agent-browser # Install dependencies and build cd cli && cargo build --release ``` The Rust codebase architecture follows a modular structure: ```graph TD A[cli/src/native/] --> B[daemon/] A --> C[actions/] A --> D[browser/] A --> E[CDP client/] A --> F[snapshot/] A --> G[state/] ``` The `--engine` flag allows selecting between Chrome and Lightpanda browser engines, providing flexibility in automation scenarios. 资料来源：[AGENTS.md](https://github.com/vercel-labs/agent-browser/blob/main/AGENTS.md) ### Method 3: Docker Installation For containerized environments, Docker builds are supported: ```bash # Build from the project's Dockerfile docker build -t agent-browser -f docker/Dockerfile.build . ``` Docker installation is particularly useful for CI/CD pipelines and reproducible automation environments where system dependencies need to be isolated. ## Post-Installation Setup ### Chrome Binary Download After installing the CLI package, you must download the Chrome binary: ```bash agent-browser install ``` This command retrieves Chrome directly from Chrome for Testing, ensuring a compatible and up-to-date browser binary is available for all automation tasks. The `--download-path` flag can specify a custom location: ```bash agent-browser --download-path /custom/path install ``` 资料来源：[cli/src/flags.rs:45-49](https://github.com/vercel-labs/agent-browser/blob/main/cli/src/flags.rs) ### Verifying Installation Verify the installation by checking the version and available commands: ```bash agent-browser --version agent-browser --help ``` The CLI provides comprehensive command documentation through the help system: | Command | Description | |---------|-------------| | `agent-browser open ` | Open a URL in the browser | | `agent-browser snapshot` | Capture accessibility tree with element refs | | `agent-browser click @` | Click element by reference | | `agent-browser skills get ` | Retrieve skill documentation | | `agent-browser install` | Download Chrome binaries | 资料来源：[cli/src/output.rs](https://github.com/vercel-labs/agent-browser/blob/main/cli/src/output.rs) ## Skill Documentation Loading Agent-browser uses a skill-based documentation system that loads content dynamically based on the installed version: ```bash # Load core workflows and common patterns agent-browser skills get core # Include full command reference and templates agent-browser skills get core --full # List all available skills agent-browser skills list ``` Available specialized skills: | Skill | Purpose | |-------|---------| | `electron` | Electron desktop apps (VS Code, Slack, Discord, Figma) | | `slack` | Slack workspace automation | | `dogfood` | Exploratory testing and QA | | `vercel-sandbox` | Agent-browser inside Vercel Sandbox microVMs | | `agentcore` | AWS Bedrock AgentCore cloud browsers | 资料来源：[skills/agent-browser/SKILL.md](https://github.com/vercel-labs/agent-browser/blob/main/skills/agent-browser/SKILL.md) ## Platform-Specific Considerations ### macOS On macOS, if you encounter security prompts about unsigned applications, you may need to allow the application in System Preferences > Security & Privacy, or run: ```bash xattr -d com.apple.quarantine /path/to/agent-browser ``` ### Linux Linux distributions require WebKit/GTK dependencies for Chrome. Install via your package manager: ```bash # Debian/Ubuntu sudo apt-get install libgtk-3-0 libnss3 # Fedora sudo dnf install gtk3 nss ``` ### Windows Windows installations automatically configure the required runtime dependencies. Ensure Windows Subsystem for Linux (WSL) compatibility if running in hybrid environments. ## Running Tests After installation, verify the setup by running the test suite: ```bash # Unit tests (fast, no Chrome required) cd cli && cargo test # End-to-end tests (requires Chrome installed) cd cli && cargo test e2e -- --ignored --test-threads=1 ``` The project contains approximately 320 unit tests and 18 e2e tests. E2E tests launch real headless Chrome instances and must run serially to avoid instance contention. 资料来源：[AGENTS.md](https://github.com/vercel-labs/agent-browser/blob/main/AGENTS.md) ## Troubleshooting ### Chrome Download Failures If the `install` command fails to download Chrome: 1. Check network connectivity to `Chrome for Testing` 2. Verify write permissions to the download directory 3. Use `--download-path` to specify an alternative location with proper permissions ### Permission Denied Errors Ensure the agent-browser binary has execute permissions: ```bash chmod +x /path/to/agent-browser ``` ### Engine Selection If Chrome automation fails, try specifying the engine explicitly: ```bash agent-browser --engine chrome open https://example.com ``` The `--engine` flag supports Chrome (default) and Lightpanda engines for different automation scenarios. ## Next Steps After successful installation: 1. Load core skill documentation: `agent-browser skills get core --full` 2. Open a test URL: `agent-browser open https://example.com` 3. Capture a snapshot: `agent-browser snapshot -i` 4. Explore specialized skills for your use case 资料来源：[skills/agent-browser/SKILL.md](https://github.com/vercel-labs/agent-browser/blob/main/skills/agent-browser/SKILL.md) --- ## Element References System ### 相关页面相关主题：[State Inspection Commands](#state-inspection-commands), [Interaction Commands](#interaction-commands)

Relevant Source Files

以下源码文件用于生成本页说明： - [cli/src/output.rs](https://github.com/vercel-labs/agent-browser/blob/main/cli/src/output.rs) - [cli/src/native/actions.rs](https://github.com/vercel-labs/agent-browser/blob/main/cli/src/native/actions.rs) - [cli/src/commands.rs](https://github.com/vercel-labs/agent-browser/blob/main/cli/src/commands.rs) - [skill-data/core/references/snapshot-refs.md](https://github.com/vercel-labs/agent-browser/blob/main/skill-data/core/references/snapshot-refs.md) - [skill-data/core/SKILL.md](https://github.com/vercel-labs/agent-browser/blob/main/skill-data/core/SKILL.md)

# Element References System The Element References System is a core mechanism in agent-browser that provides stable, human-readable identifiers for DOM elements during browser automation tasks. Instead of relying on fragile CSS selectors or XPath expressions, the system assigns unique reference IDs (such as `@e1`, `@e2`) that persist across page states and can be used reliably in subsequent automation commands. ## Overview Element references serve as the primary interface between automation scripts and the browser's accessibility tree. When a snapshot is taken, each interactive element receives a reference ID that can be used in commands like `click`, `fill`, `type`, and `get` without requiring re-selection. ```mermaid graph TD A[Browser Page] --> B[snapshot Command] B --> C[Accessibility Tree Traversal] C --> D[Element Identification] D --> E[Reference Assignment] E --> F[@e1 @e2 @e3 ...] F --> G[Automation Commands] G --> H[click @e1] G --> I[fill @e2] G --> J[get text @e3] ``` ## Reference Notation Format Element references follow a standardized notation format that encodes element metadata: ``` @e1 [tag type="value"] "text content" placeholder="hint" │ │ │ │ │ │ │ │ │ └─ Additional attributes │ │ │ └─ Visible text │ │ └─ Key attributes shown │ └─ HTML tag name └─ Unique ref ID ``` 资料来源：[skill-data/core/references/snapshot-refs.md]() ### Reference Components | Component | Description | Example | |-----------|-------------|---------| | `@eN` | Unique reference identifier | `@e1`, `@e42` | | Tag | HTML element type | `button`, `input`, `link` | | Type attribute | Element type classification | `type="email"`, `type="password"` | | Text content | Visible text on element | `"Submit"`, `"Log in"` | | Placeholder | Input placeholder text | `placeholder="Email"` | ## Common Reference Patterns The snapshot system recognizes common element patterns and standardizes their reference notation: ```bash @e1 [button] "Submit" # Button with text @e2 [input type="email"] # Email input @e3 [input type="password"] # Password input @e4 [a href="/page"] "Link Text" # Anchor link @e5 [select] # Dropdown @e6 [textarea] placeholder="Message" # Text area @e7 [div class="modal"] # Container element @e8 [img alt="Logo"] # Image with alt text @e9 [checkbox] checked # Checked checkbox @e10 [radio] selected # Selected radio button ``` 资料来源：[skill-data/core/references/snapshot-refs.md]() ## Snapshot Command Options The `snapshot` command generates element references with various filtering and formatting options: ```bash agent-browser snapshot # Full tree (verbose) agent-browser snapshot -i # Interactive elements only (preferred) agent-browser snapshot -i -u # Include href URLs on links agent-browser snapshot -i -c # Compact mode (no empty structural nodes) agent-browser snapshot -i -d 3 # Cap depth at 3 levels agent-browser snapshot -s "#main" # Scope to a CSS selector agent-browser snapshot -i --json # Machine-readable output ``` 资料来源：[skill-data/core/SKILL.md]() ### Option Reference | Option | Purpose | Use Case | |--------|---------|----------| | `-i` | Interactive elements only | Preferred for automation | | `-u` | Include href URLs | When link destinations matter | | `-c` | Compact output | Complex pages with many empty nodes | | `-d N` | Depth limit | Focus on specific page sections | | `-s SELECTOR` | CSS scope | Target specific page regions | | `--json` | JSON format | Programmatic processing | ## Element Reference Commands Element references are used with various commands to interact with page elements: ### Direct Element Commands ```bash agent-browser click @e1 # Click element agent-browser click @e1 --new-tab # Click and open in new tab agent-browser fill @e2 "text" # Fill input field agent-browser type @e2 "text" # Type character by character agent-browser press Enter # Press key on focused element ``` ### State Inspection Commands ```bash agent-browser get text @e1 # Get visible text agent-browser get html @e1 # Get innerHTML agent-browser get attr @e1 href # Get specific attribute agent-browser get value @e1 # Get input value agent-browser get title # Get page title agent-browser get url # Get current URL agent-browser get count ".item" # Count matching elements ``` ### State Checking Commands The `is` command verifies element states: ```bash agent-browser is visible @e1 agent-browser is enabled @e1 agent-browser is checked @e1 ``` 资料来源：[cli/src/output.rs]() ## Find Command and Locators The `find` command provides an alternative to snapshot-based reference acquisition by locating elements using various criteria: ```bash agent-browser find [text] ``` ### Supported Locators | Locator | Description | Example | |---------|-------------|---------| | `role` | ARIA role selector | `find role button click` | | `text` | Text content match | `find text "Submit" click` | | `label` | Label text association | `find label "Email" fill` | | `placeholder` | Placeholder attribute | `find placeholder "Search"` | | `alt` | Alt text (images) | `find alt "Logo" click` | | `title` | Title attribute | `find title "Help" click` | | `testid` | Test identifier | `find testid "submit-btn" click` | | `first` | First matching selector | `find first button click` | | `last` | Last matching selector | `find last link click` | | `nth` | Nth matching element | `find nth 5 button click` | 资料来源：[cli/src/commands.rs]() ### Find Command Options | Option | Purpose | |--------|---------| | `--exact` | Perform exact string matching | | `--name ` | Filter by accessible name (role locator) | ## Action Dispatch System Element reference commands are dispatched to handlers through the action routing system: ```mermaid graph LR A[Command Input] --> B["dispatch(\"click\", state)"] B --> C{Match Action} C -->|click| D[handle_click] C -->|fill| E[handle_fill] C -->|get| F[handle_get] C -->|is| G[handle_is] C -->|find| H[handle_find] ``` The action router maps command strings to their respective handlers in the native daemon: ```rust "click" => handle_dispatch(cmd, state).await, "fill" => handle_dispatch(cmd, state).await, "get" => handle_dispatch(cmd, state).await, "is" => handle_dispatch(cmd, state).await, "find" => handle_dispatch(cmd, state).await, ``` 资料来源：[cli/src/native/actions.rs]() ### Available Element Actions | Action | Handler | Purpose | |--------|---------|---------| | `click` | `handle_dispatch` | Mouse click | | `fill` | `handle_dispatch` | Fill input with text | | `type` | `handle_dispatch` | Character-by-character typing | | `press` | `handle_dispatch` | Keyboard press | | `hover` | `handle_dispatch` | Mouse hover | | `select` | `handle_dispatch` | Select dropdown option | | `check` | `handle_dispatch` | Check checkbox/radio | | `uncheck` | `handle_dispatch` | Uncheck checkbox | | `focus` | `handle_dispatch` | Focus element | | `blur` | `handle_dispatch` | Blur element | ## Iframe Support Element references automatically handle iframe content. When a snapshot is taken, iframe elements are resolved and their child accessibility trees are included inline: ```bash agent-browser snapshot -i # Output: # @e1 [heading] "Checkout" # @e2 [Iframe] "payment-frame" # @e3 [input] "Card number" # @e4 [input] "Expiry" # @e5 [button] "Pay" # @e6 [button] "Cancel" ``` References to elements inside iframes carry frame context, allowing direct interactions without manual frame switching: ```bash agent-browser click @e3 # Works inside iframe agent-browser fill @e4 "12/25" ``` 资料来源：[skill-data/core/references/snapshot-refs.md]() ## Best Practices ### Always Snapshot Before Interacting ```bash # CORRECT agent-browser open https://example.com agent-browser snapshot -i # Get refs first agent-browser click @e1 # Use ref # WRONG agent-browser open https://example.com agent-browser click @e1 # Ref doesn't exist yet! ``` ### Re-Snapshot After Navigation ```bash agent-browser click @e5 # Navigates to new page agent-browser snapshot -i # Get new refs agent-browser click @e1 # Use new refs ``` ### Re-Snapshot After Dynamic Changes ```bash agent-browser click @e1 # Opens dropdown agent-browser snapshot -i # See dropdown items agent-browser click @e7 # Select item ``` ### Snapshot Specific Regions For complex pages, snapshot specific areas to reduce noise: ```bash # Snapshot just a form agent-browser snapshot @e9 ``` ## Session-Dependent References Element references are session-dependent and may vary between browser sessions. The same element on the same page might receive different reference IDs in different sessions: | Element | Typical Ref Range | How to Find | |---------|------------------|-------------| | Home tab | e10-e20 | `snapshot -i \| grep "Home"` | | DMs tab | e10-e20 | `snapshot -i \| grep "DMs"` | | Activity tab | e10-e20 | `snapshot -i \| grep "Activity"` | | Search | e5-e10 | `snapshot -i \| grep "Search"` | | More unreads | e20-e30 | `snapshot -i \| grep "More unreads"` | | Channel refs | e30+ | `snapshot -i \| grep "channel-name"` | 资料来源：[skill-data/slack/references/slack-tasks.md]() ## Architecture Summary ```mermaid graph TD subgraph "CLI Layer" A[User Command] --> B[commands.rs Parser] B --> C[Command Dispatch] end subgraph "Native Daemon" C --> D[actions.rs Router] D --> E[State Manager] E --> F[CDP Client] end subgraph "Browser Layer" F --> G[Chrome DevTools Protocol] G --> H[Accessibility Tree] end subgraph "Reference Generation" H --> I[Element ID Assignment] I --> J[@eN Reference Labels] end J --> K[Snapshot Output] K --> L[Automation Commands] ``` The Element References System provides the foundation for reliable browser automation by abstracting DOM complexity behind human-readable identifiers that remain stable across page states and navigation events. --- ## Architecture Overview ### 相关页面相关主题：[Daemon and CDP Protocol](#daemon-and-cdp), [Introduction to Agent Browser](#introduction)

相关源码文件

以下源码文件用于生成本页说明： - [cli/src/native/mod.rs](https://github.com/vercel-labs/agent-browser/blob/main/cli/src/native/mod.rs) - [cli/src/native/actions.rs](https://github.com/vercel-labs/agent-browser/blob/main/cli/src/native/actions.rs) - [cli/src/native/stream/websocket.rs](https://github.com/vercel-labs/agent-browser/blob/main/cli/src/native/stream/websocket.rs) - [cli/src/native/react/suspense.rs](https://github.com/vercel-labs/agent-browser/blob/main/cli/src/native/react/suspense.rs) - [cli/src/output.rs](https://github.com/vercel-labs/agent-browser/blob/main/cli/src/output.rs) - [cli/src/connection.rs](https://github.com/vercel-labs/agent-browser/blob/main/cli/src/connection.rs)

# Architecture Overview agent-browser is a Rust-based browser automation framework that provides high-performance browser control through native CDP (Chrome DevTools Protocol) communication. The system is designed for AI agent integration, enabling reliable and observable browser automation. ## System Architecture The architecture follows a layered approach with clear separation between the CLI interface, daemon process, and browser engine. ```mermaid graph TB subgraph "Client Layer" CLI[CLI Interface] Dashboard[Web Dashboard] end subgraph "Daemon Layer" WS[WebSocket Server] Dispatcher[Action Dispatcher] State[State Manager] end subgraph "CDP Layer" CDP[CDP Client] Protocol[Protocol Handler] end subgraph "Browser Engine" Chrome[Chrome/Chromium] Lightpanda[Lightpanda] end CLI --> WS Dashboard --> WS WS --> Dispatcher Dispatcher --> CDP CDP --> Chrome CDP --> Lightpanda Dispatcher --> State ``` ## Core Components ### Daemon Architecture The browser automation daemon is the central coordinator that manages browser sessions and handles command dispatching. It runs as a persistent process that maintains browser state across multiple operations. **Key Responsibilities:** | Component | Responsibility | |-----------|----------------| | WebSocket Server | Accepts client connections with origin validation | | Action Dispatcher | Routes commands to appropriate handlers | | State Manager | Maintains session state and snapshots | | CDP Client | Manages protocol-level communication | 资料来源：[cli/src/native/mod.rs]() ### Action Dispatch System The action system provides a comprehensive set of browser automation commands. Actions are dispatched based on command type and handle specific browser operations. **Action Categories:** | Category | Commands | |----------|----------| | Navigation | `goto`, `back`, `forward`, `reload`, `waitforurl`, `waitforloadstate` | | Interaction | `click`, `fill`, `press`, `select`, `check`, `uncheck`, `multiselect` | | Content | `snapshot`, `innertext`, `innerhtml`, `gettext`, `getattribute` | | State | `cookies_get`, `cookies_set`, `storage_get`, `storage_set` | | Network | `route`, `unroute`, `requests`, `har` | | React Debug | `react_tree`, `react_inspect`, `react_renders_start` | 资料来源：[cli/src/native/actions.rs:1-50]() ### CDP Client Layer The CDP (Chrome DevTools Protocol) client handles low-level communication with the browser engine. This abstraction allows the system to work with different browser engines through a unified interface. **Supported Engines:** | Engine | Selection Flag | |--------|----------------| | Chrome/Chromium | `--engine chrome` (default) | | Lightpanda | `--engine lightpanda` | 资料来源：[cli/src/native/mod.rs]() ## Communication Protocol ### WebSocket Server The daemon exposes a WebSocket server for client communication. Security is enforced through origin validation. ```mermaid graph LR Client[Client App] -->|WebSocket| OriginCheck[Origin Check] OriginCheck -->|Allowed| Accept[Accept Connection] OriginCheck -->|Blocked| Reject[403 Forbidden] ``` **Origin Validation:** The server validates the `Origin` header on incoming WebSocket requests. Connections from disallowed origins receive a `403 Forbidden` response before any data exchange occurs. ```rust if !is_allowed_origin(origin.as_deref()) { return Err(reject); // Status: FORBIDDEN } ``` 资料来源：[cli/src/native/stream/websocket.rs:15-30]() ### Request/Response Flow All commands follow a request-response pattern: 1. Client sends JSON command via WebSocket 2. Server validates origin 3. Dispatcher routes to appropriate handler 4. Handler executes CDP operation 5. Result returned as JSON response ## State Management ### Session State The daemon maintains persistent state for each browser session: | State Component | Description | |-----------------|-------------| | Tabs | Active tab list and current tab reference | | Frame | Current frame hierarchy | | Viewport | Window dimensions | | Recording | Video recording status | 资料来源：[cli/src/native/stream/websocket.rs:5-15]() ### Snapshot System The snapshot system provides accessibility-tree based page representation with stable element references (`@e1`, `@e2`, etc.) for reliable element selection across page mutations. **Best Practice:** Always snapshot before interacting with elements, as refs change after navigation or dynamic content changes. 资料来源：[skill-data/core/references/snapshot-refs.md]() ## React Inspection System For React-based applications, the daemon provides specialized inspection capabilities: ### Blocker Detection The system identifies React Suspense boundaries and classifies them by impact: | Blocker Kind | Weight | Actionability | |--------------|--------|---------------| | ClientHook | 7 | 90 | | RequestApi | 6 | 88 | | ServerFetch | 5 | 82 | | Cache | 4 | 74 | | Stream | 3 | 60 | | Unknown | 2 | 35 | | Framework | 1 | 18 | ### Boundary Classification | Boundary Kind | Description | |---------------|-------------| | RouteSegment | Next.js/App Router segment boundary | | ExplicitSuspense | User-declared `` component | | Component | Implicit boundary from component structure | 资料来源：[cli/src/native/react/suspense.rs:30-60]() ## CLI Architecture The CLI provides both interactive and scripted access to browser automation: ### Command Structure ``` agent-browser [args] ``` **Primary Command Groups:** | Group | Purpose | |-------|---------| | `agent-browser open` | Navigate to URL | | `agent-browser ` | Execute automation action | | `agent-browser set` | Configure browser settings | | `agent-browser network` | Manage network interception | | `agent-browser state` | Save/load/restore sessions | | `agent-browser tab` | Manage browser tabs | | `agent-browser screenshot` | Capture page images | | `agent-browser install` | Download Chrome | 资料来源：[cli/src/output.rs]() ## Dashboard Architecture The web-based dashboard provides visual monitoring and control: ```mermaid graph TD Dashboard[Dashboard App] -->|API| Daemon Dashboard -->|Display| Results[screenshots/snapshots] Dashboard -->|Controls| Form[Control Form] ``` **Dashboard Features:** - Resizable split view (controls + results) - Responsive layout for mobile/desktop - Real-time screenshot display with base64 encoding - Snapshot viewer with step history - Step-by-step playback of automation sequences 资料来源：[packages/dashboard/src/components/extensions-panel.tsx]() ## Installation and Dependencies ### Chrome Installation The `install` command downloads Chrome directly from Chrome for Testing: ```bash agent-browser install ``` This ensures the Chrome binary is available for CDP communication without requiring system-wide Chrome installation. ## Testing Architecture ### Unit Tests Fast tests (~320) that verify individual components without Chrome dependency: ```bash cd cli && cargo test ``` ### End-to-End Tests Integration tests that launch real headless Chrome: ```bash cd cli && cargo test e2e -- --ignored --test-threads=1 ``` **Requirements:** - Chrome must be installed - Tests run serially to avoid browser instance contention ## Security Considerations | Aspect | Implementation | |--------|----------------| | Origin Validation | WebSocket connections validated before acceptance | | Session Isolation | Each session maintains separate state | | Credential Storage | Authentication vault for secure credential handling | ## Summary agent-browser implements a clean three-tier architecture: 1. **Client Layer** - CLI and dashboard provide user interfaces 2. **Daemon Layer** - Rust-based server handles command dispatch and state 3. **CDP Layer** - Browser-agnostic protocol client enables Chrome/Lightpanda support The design prioritizes reliability (stable element refs), observability (snapshots, screenshots, video recording), and extensibility (skill-based system for specialized automation tasks). --- ## Daemon and CDP Protocol ### 相关页面相关主题：[Architecture Overview](#architecture-overview), [Browser Engine Integration](#browser-engines)

相关源码文件

以下源码文件用于生成本页说明： - [cli/src/native/cdp/client.rs](https://github.com/vercel-labs/agent-browser/blob/main/cli/src/native/cdp/client.rs) - [cli/src/native/cdp/chrome.rs](https://github.com/vercel-labs/agent-browser/blob/main/cli/src/native/cdp/chrome.rs) - [cli/src/native/stream/mod.rs](https://github.com/vercel-labs/agent-browser/blob/main/cli/src/native/stream/mod.rs) - [cli/src/native/stream/websocket.rs](https://github.com/vercel-labs/agent-browser/blob/main/cli/src/native/stream/websocket.rs) - [cli/src/native/stream/cdp_loop.rs](https://github.com/vercel-labs/agent-browser/blob/main/cli/src/native/stream/cdp_loop.rs) - [cli/cdp-protocol/browser_protocol.json](https://github.com/vercel-labs/agent-browser/blob/main/cli/cdp-protocol/browser_protocol.json) - [cli/cdp-protocol/js_protocol.json](https://github.com/vercel-labs/agent-browser/blob/main/cli/cdp-protocol/js_protocol.json)

# Daemon and CDP Protocol ## Overview The agent-browser project implements a native Rust-based browser automation daemon that communicates with Chrome/Chromium browsers via the Chrome DevTools Protocol (CDP). The architecture separates the automation logic from browser control through WebSocket-based CDP connections, enabling AI agents to interact with web pages through a CLI interface. **Architecture Layer Diagram:** ```mermaid graph TD A[CLI Interface] --> B[Action Dispatcher] B --> C[CDP Client] C --> D[WebSocket Stream] D --> E[CDP Loop Handler] E --> F[Chrome Browser Instance] G[CDP Protocol Files] --> F H[Generated CDP Types] --> C ``` ## Daemon Architecture ### Native Daemon Components The daemon lives in `cli/src/native/` and handles all browser automation tasks. The main components include: | Component | Location | Purpose | |-----------|----------|---------| | Daemon | `cli/src/native/daemon/` | Process management and state coordination | | Actions | `cli/src/native/actions.rs` | Command handlers for browser operations | | Browser | `cli/src/native/browser/` | Browser instance lifecycle | | CDP Client | `cli/src/native/cdp/client.rs` | Protocol communication | | CDP Loop | `cli/src/native/stream/cdp_loop.rs` | Message processing loop | 资料来源：[cli/src/native/actions.rs](cli/src/native/actions.rs) ### Action Dispatch The action handler maps command names to their implementation functions. Supported actions include: ```rust let result = match action { "launch" => handle_launch(cmd, state).await, "navigate" => handle_navigate(cmd, state).await, "url" => handle_url(state).await, "cdp_url" => handle_cdp_url(state), "inspect" => handle_inspect(state).await, "title" => handle_title(state).await, "content" => handle_content(state).await, "evaluate" => handle_evaluate(cmd, state).await, "close" => handle_close(state).await, "snapshot" => handle_snapshot(cmd, state).await, "screenshot" => handle_screenshot(cmd, state).await, "click" => handle_click(cmd, state).await, "dblclick" => handle_dblclick(cmd, state).await, "fill" => handle_fill(cmd, state).await, "type" => handle_type(cmd, state).await, "press" => handle_press(cmd, state).await, "hover" => handle_hover(cmd, state).await, "scroll" => handle_scroll(cmd, state).await, // ... additional actions }; ``` 资料来源：[cli/src/native/actions.rs:50-75](cli/src/native/actions.rs) ### Browser Engine Selection The `--engine` flag selects between Chrome and Lightpanda browsers. Chrome is downloaded from Chrome for Testing via the `install` command. ## CDP Protocol Implementation ### Protocol Files The CDP protocol definitions are stored in JSON format: | File | Description | |------|-------------| | `browser_protocol.json` | Core browser domains (Page, Network, Runtime, etc.) | | `js_protocol.json` | JavaScript debugging domains | 资料来源：[cli/cdp-protocol/browser_protocol.json](cli/cdp-protocol/browser_protocol.json) ### Auto-Generated Types CDP types are auto-generated from protocol JSON files: ```rust /// Auto-generated CDP types from protocol JSON files in `cdp-protocol/`. /// /// To populate: download `browser_protocol.json` and `js_protocol.json` from /// (or any /// Chromium source) into `cli/cdp-protocol/` and rebuild. #[allow(clippy::upper_case_acronyms)] pub mod generated { include!(concat!(env!("OUT_DIR"), "/cdp_generated.rs")); } ``` 资料来源：[cli/src/native/cdp/types.rs](cli/src/native/cdp/types.rs) ### CDP Client Structure The CDP client manages communication with the browser: ```mermaid graph LR A[Command] --> B[CDP Client] B --> C[WebSocket Writer] C --> D[Browser CDP Endpoint] E[Browser Events] --> F[WebSocket Reader] F --> G[Event Handler] G --> H[State Updates] ``` ## WebSocket Communication ### Stream Module Architecture The WebSocket communication is handled by the stream module located in `cli/src/native/stream/`: | Module | File | Purpose | |--------|------|---------| | Stream Core | `cli/src/native/stream/mod.rs` | Stream trait definitions and utilities | | WebSocket | `cli/src/native/stream/websocket.rs` | WebSocket connection handling | | CDP Loop | `cli/src/native/stream/cdp_loop.rs` | CDP message processing loop | ### WebSocket Connection The WebSocket module establishes and maintains connections to the Chrome DevTools endpoint: ```mermaid sequenceDiagram participant CLI as CLI Command participant Client as CDP Client participant WS as WebSocket participant Chrome as Chrome Browser CLI->>Client: connect(url) Client->>WS: establish_connection() WS->>Chrome: WebSocket Handshake Chrome-->>WS: 101 Switching Protocols WS-->>Client: Connected loop Message Exchange CLI->>Client: send_command() Client->>WS: write_message() WS->>Chrome: CDP JSON Message Chrome-->>WS: CDP Response/Event WS-->>Client: read_message() Client-->>CLI: Result end ``` ### CDP Loop Handler The CDP loop processes incoming messages and manages the event queue: - Handles CDP events from the browser - Routes responses to pending command callbacks - Manages connection state and reconnection logic 资料来源：[cli/src/native/stream/cdp_loop.rs](cli/src/native/stream/cdp_loop.rs) ## Browser Connection ### Connection Methods The daemon supports multiple connection methods: | Method | Command | Use Case | |--------|---------|----------| | Launch new browser | `agent-browser open` | Fresh browser instance | | Connect to existing | `agent-browser connect 9222` | Attach to running browser | ```bash # Launch with navigation agent-browser open # Connect to running browser on specific port agent-browser connect 9222 # Launch without navigation (clean slate) agent-browser open ``` ### CDP WebSocket URL The CDP WebSocket URL can be retrieved programmatically: ```bash agent-browser cdp_url ``` This returns the WebSocket debugger URL for programmatic browser attachment. ### Browser Version Info The connection retrieves browser metadata: ```rust #[derive(Deserialize)] #[serde(rename_all = "camelCase")] pub struct BrowserVersionInfo { #[serde(rename = "webSocketDebuggerUrl")] pub web_socket_debugger_url: Option, #[serde(rename = "Browser")] pub browser: Option, } ``` 资料来源：[cli/src/native/cdp/types.rs](cli/src/native/cdp/types.rs) ## CDP Protocol Domains ### Supported Domains The agent-browser supports CDP domains for: | Domain | Purpose | Key Commands | |--------|---------|--------------| | Page | Page navigation and loading | navigate, reload, back, forward | | Runtime | JavaScript execution | evaluate, callFunctionOn | | DOM | DOM manipulation | getDocument, describeNode | | Input | User input simulation | dispatchEvent, insertText | | Network | Network request interception | setRequestInterception, getResponseBody | | Target | Browser target management | createTarget, attachToTarget | ### Browser Automation Actions The following high-level actions are available via CDP: ```bash # Navigation agent-browser open agent-browser back agent-browser forward agent-browser reload # DOM Interaction agent-browser click @e1 agent-browser fill @e2 "text" agent-browser type @e3 "input" agent-browser hover @e4 agent-browser scroll down 500 # State Queries agent-browser snapshot agent-browser screenshot agent-browser get text @e1 agent-browser get attr @e1 href # JavaScript agent-browser evaluate "document.title" ``` ## Error Handling ### WebDriver Fallback The daemon gracefully handles unsupported actions when using WebDriver backend: ```rust Err(anyhow::anyhow!( "Action '{}' is not supported on the WebDriver backend", action )) ``` ### CDP Error Propagation CDP errors are propagated through the action chain, enabling detailed error messages for debugging failed browser operations. ## Performance Considerations ### Session Management - Each browser session maintains a persistent CDP connection - Sessions can be named and persisted for multi-session workflows - State persistence allows resuming automation tasks ### Network Idle Detection The daemon supports waiting for network idle states: ```bash agent-browser wait --load networkidle ``` This is essential for SPAs and applications with dynamic content loading. ## Security Model ### Credential Management The daemon provides a secure credential vault for browser authentication: ```bash agent-browser set credentials ``` ### Cookie Management Cookies can be set from various formats: ```bash agent-browser cookies set --curl [--domain ] ``` Auto-detects JSON, cURL, and Cookie-header file formats. ## Extension Points ### Custom CDP Scripts Execute arbitrary JavaScript in the browser context: ```bash agent-browser addscript