# https://github.com/browserbase/stagehand 项目说明书

生成时间：2026-05-16 10:25:34 UTC

## 目录

- [Project Introduction](#project-introduction)
- [Architecture Overview](#architecture-overview)
- [CDP Engine](#cdp-engine)
- [Core Actions](#core-actions)
- [DOM and Accessibility Tree](#dom-accessibility)
- [LLM Providers](#llm-providers)
- [Agent System](#agent-system)
- [MCP Integration](#mcp-integration)
- [Server API](#server-api)
- [CLI Tools](#cli-tools)

<a id='project-introduction'></a>

## Project Introduction

### 相关页面

相关主题：[Architecture Overview](#architecture-overview), [CDP Engine](#cdp-engine), [LLM Providers](#llm-providers)

<details>
<summary>Relevant Source Files</summary>

以下源码文件用于生成本页说明：

- [README.md](https://github.com/browserbase/stagehand/blob/main/README.md)
- [packages/core/README.md](https://github.com/browserbase/stagehand/blob/main/packages/core/README.md)
- [packages/cli/README.md](https://github.com/browserbase/stagehand/blob/main/packages/cli/README.md)
- [packages/cli/CHANGELOG.md](https://github.com/browserbase/stagehand/blob/main/packages/cli/CHANGELOG.md)
- [packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts)
- [packages/core/lib/v3/agent/MicrosoftCUAClient.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/agent/MicrosoftCUAClient.ts)
</details>

# Project Introduction

Stagehand is an **AI-powered browser automation framework** developed by [Browserbase](https://browserbase.com) that enables developers to control web browsers using natural language instructions combined with precise code-based control. The framework bridges the gap between high-level AI agents and low-level browser automation tools like Selenium, Playwright, or Puppeteer.

## Project Overview

Stagehand represents a paradigm shift in browser automation by allowing developers to choose when to leverage AI capabilities versus writing explicit code. This hybrid approach provides flexibility while maintaining reliability for production environments.

### Key Value Propositions

| Feature | Description |
|---------|-------------|
| **Hybrid Control** | Combine AI-driven navigation with deterministic code execution |
| **Self-Healing** | Auto-caching and action recovery when website changes occur |
| **Action Preview** | Preview AI-generated actions before execution |
| **Repeatable Workflows** | Cache and reuse actions to save time and tokens |
| **Production Ready** | Built for reliable automation in production systems |

资料来源：[README.md:1-50]()

## Architecture Overview

Stagehand follows a monorepo architecture using pnpm workspaces and Turborepo for efficient build management and dependency handling.

```mermaid
graph TD
    A[stagehand Repository] --> B[packages/core]
    A --> C[packages/cli]
    B --> D[Agent System]
    B --> E[Inference Engine]
    B --> F[Browser Context Manager]
    D --> G[Microsoft CUA Client]
    D --> H[Action Executors]
    F --> I[Playwright Integration]
    F --> J[Frame Locator]
    E --> K[Inference Logging]
    E --> L[Cache Management]
    C --> M[Daemon Controller]
    C --> N[CLI Commands]
    M --> F
```

### Core Packages

| Package | Purpose | Key Files |
|---------|---------|-----------|
| `packages/core` | Main automation engine with AI agent capabilities | `lib/v3/agent/*`, `lib/v3/understudy/*` |
| `packages/cli` | Command-line interface for browser control | `CHANGELOG.md`, README commands |

资料来源：[packages/cli/CHANGELOG.md:1-15]()

## Project Structure

```
stagehand/
├── packages/
│   ├── core/                    # Core automation package
│   │   ├── lib/
│   │   │   ├── v3/
│   │   │   │   ├── agent/       # AI agent implementation
│   │   │   │   │   ├── MicrosoftCUAClient.ts
│   │   │   │   │   └── prompts/
│   │   │   │   │       └── agentSystemPrompt.ts
│   │   │   │   └── understudy/  # Context and frame management
│   │   │   │       ├── context.ts
│   │   │   │       └── frameLocator.ts
│   │   │   └── inferenceLogUtils.ts
│   │   └── README.md
│   └── cli/                     # CLI tooling
│       ├── CHANGELOG.md
│       └── README.md
├── package.json
├── pnpm-workspace.yaml
└── turbo.json
```

资料来源：[packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:1-50]()

## Agent System Architecture

The agent system is the brain of Stagehand, responsible for interpreting user instructions and generating appropriate browser actions.

### Microsoft CUA Integration

Stagehand implements a Microsoft CUA (Computer Use Agent) client that provides structured function calling capabilities:

```mermaid
graph LR
    A[User Instruction] --> B[MicrosoftCUAClient]
    B --> C[Action Generation]
    C --> D[left_click]
    C --> E[scroll]
    C --> F[visit_url]
    C --> G[type]
    C --> H[wait]
    C --> I[web_search]
```

#### Supported Actions

| Action | Parameters | Description |
|--------|------------|-------------|
| `left_click` | `coordinate: [x, y]` | Click at specified coordinates |
| `scroll` | `pixels: number` | Scroll up (positive) or down (negative) |
| `visit_url` | `url: string` | Navigate to a URL |
| `type` | `text, press_enter, delete_existing_text` | Type text into input fields |
| `wait` | `time: number` | Wait for specified seconds |
| `web_search` | `query: string` | Perform a web search |
| `history_back` | - | Navigate back in browser history |
| `pause_and_memorize_fact` | `fact: string` | Store information for later use |
| `terminate` | `status: success/failure` | End the task execution |

资料来源：[packages/core/lib/v3/agent/MicrosoftCUAClient.ts:1-80]()

### System Prompt Structure

The agent uses a structured XML-based system prompt that organizes instructions into distinct sections:

```mermaid
graph TD
    P[System Prompt] --> A[Identity]
    P --> T[Task Definition]
    P --> M[Mindset Guidelines]
    P --> G[Action Guidelines]
    P --> N[Navigation Rules]
    P --> S[Strategy]
    P --> R[Roadblocks]
    P --> V[Variables]
    P --> C[Completion]
```

The prompt templates include conditional sections for:
- **Page Understanding Protocol**: Instructions for analyzing page state
- **Search Integration**: Optional search tool usage when URL confidence is low
- **Variable Substitution**: Support for `%variableName%` syntax in form interactions
- **Captcha Handling**: Optional roadblocks section when auto-solving is enabled

资料来源：[packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:50-150]()

## Browser Context Management

Stagehand manages browser contexts through the `understudy` module, which handles complex scenarios like iframe interactions and multi-page workflows.

### Frame Locator System

The frame locator handles cross-origin iframe communication and lifecycle events:

```mermaid
sequenceDiagram
    Participant Parent as Parent Page
    Participant Child as Child Frame
    Participant Context as BrowserContext
    
    Parent->>Context: Create Frame
    Context->>Child: Register Lifecycle Events
    Child-->>Parent: DOMContentLoaded
    Parent->>Context: Wait for MainWorld
    Context-->>Parent: Execution Context Ready
```

Key responsibilities include:
- Resolving owning `Page` by main frame ID
- Applying initialization scripts to pages
- Managing lifecycle events (DOMContentLoaded, load, networkIdle)
- Handling OOPIF (Out-of-Process iframe) scenarios

资料来源：[packages/core/lib/v3/understudy/context.ts:1-50](), [packages/core/lib/v3/understudy/frameLocator.ts:1-40]()

## Inference and Caching

Stagehand implements an inference logging system that enables self-healing automation by caching and reusing action results.

### Summary File Structure

```
<inferenceType>_summary.json
```

Where `inferenceType` can be:
- `act_summary`: Action execution results
- `observe_summary`: Page observation results
- `extract_summary`: Data extraction results

The system reads and writes JSON files containing arrays of inference results, enabling:
- **Action Replay**: Re-execute previously successful actions
- **Self-Healing**: Automatically recover from website changes
- **Token Optimization**: Skip LLM inference when cached results are valid

资料来源：[packages/core/lib/inferenceLogUtils.ts:1-60]()

## CLI Architecture

The Stagehand CLI provides a command-line interface for browser automation with support for both local and remote browser execution.

### Execution Modes

| Mode | Detection | Description |
|------|-----------|-------------|
| `remote` | `BROWSERBASE_API_KEY` is set | Uses Browserbase cloud infrastructure |
| `local` | Default fallback | Uses local Playwright browser |

The daemon-based architecture ensures:
- Persistent browser sessions
- Automatic recovery on failures
- Session state preservation across commands

资料来源：[packages/cli/README.md:1-100]()

### CLI Command Categories

#### Navigation Commands
```bash
browse open <url> [--wait load|domcontentloaded|networkidle]
browse reload
browse back
browse forward
```

#### Interaction Commands
```bash
browse click <ref>
browse type <text>
browse press <key>
browse fill <selector> <value>
```

#### Information Commands
```bash
browse get url|title|text|html|value|box
browse snapshot [-c|--compact]
browse screenshot
```

#### Session Management
```bash
browse start|stop|restart
browse env local|remote
browse attach <port|url>
```

资料来源：[packages/cli/README.md:100-200]()

## Getting Started

### Installation from Source

```bash
git clone https://github.com/browserbase/stagehand.git
cd stagehand
pnpm install
pnpm run build
pnpm run example
```

### Branch Installation

Using [gitpkg](https://github.com/EqualMa/gitpkg), install directly from a branch:

```json
"@browserbasehq/stagehand": "https://gitpkg.now.sh/browserbase/stagehand/packages/core?<branchName>"
```

### Environment Configuration

Create a `.env` file from the example:

```bash
cp .env.example .env
# Add your API keys:
# BROWSERBASE_API_KEY=your_key
# OPENAI_API_KEY=your_key (or other LLM provider)
```

资料来源：[README.md:50-80](), [packages/core/README.md:50-80]()

## Version History

Recent significant changes in the CLI package:

| Version | Change | PR |
|---------|--------|-----|
| 0.4.2 | Added `browse get markdown` command for HTML-to-markdown conversion | [#1907](https://github.com/browserbase/stagehand/pull/1907) |
| 0.4.1 | Fixed invalid metadata key using underscore | [#1911](https://github.com/browserbase/stagehand/pull/1911) |
| 0.4.0 | Added new feature | [#1889](https://github.com/browserbase/stagehand/pull/1889) |

资料来源：[packages/cli/CHANGELOG.md:1-20]()

## License and Community

Stagehand is released under the **MIT License**. The project maintains an active community through:

- **Documentation**: [docs.stagehand.dev](https://docs.stagehand.dev)
- **Discord Community**: [stagehand.dev/discord](https://stagehand.dev/discord)
- **DeepWiki Integration**: Ask questions at [deepwiki.com/browserbase/stagehand](https://deepwiki.com/browserbase/stagehand)

A Python implementation is also available at [github.com/browserbase/stagehand-python](https://github.com/browserbase/stagehand-python).

资料来源：[README.md:1-30](), [packages/core/README.md:1-30]()

---

<a id='architecture-overview'></a>

## Architecture Overview

### 相关页面

相关主题：[Project Introduction](#project-introduction), [CDP Engine](#cdp-engine), [Core Actions](#core-actions), [Server API](#server-api)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts)
- [packages/core/lib/v3/understudy/context.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/understudy/context.ts)
- [packages/core/lib/v3/understudy/frameLocator.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/understudy/frameLocator.ts)
- [packages/core/lib/v3/agent/MicrosoftCUAClient.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/agent/MicrosoftCUAClient.ts)
- [README.md](https://github.com/browserbase/stagehand/blob/main/README.md)
- [packages/cli/README.md](https://github.com/browserbase/stagehand/blob/main/packages/cli/README.md)

</details>

# Architecture Overview

## Introduction

Stagehand is an AI-powered browser automation framework that enables developers to control web browsers using natural language instructions combined with precise code-based control. The architecture is designed to bridge the gap between high-level AI agents and low-level browser automation frameworks like Playwright, providing reliability, extensibility, and self-healing capabilities essential for production environments.

The framework operates on a multi-layered architecture that separates concerns between browser session management, agent reasoning, tool execution, and page interaction primitives. This design allows users to choose when to leverage AI for navigating unfamiliar pages and when to use explicit code for deterministic operations.

## Core Architecture Layers

Stagehand's architecture consists of four primary layers working in concert to provide browser automation capabilities:

| Layer | Purpose | Key Components |
|-------|---------|----------------|
| **Agent Layer** | High-level reasoning and decision making | MicrosoftCUAClient, System Prompts |
| **Context Layer** | Browser session and page lifecycle management | BrowserContext, Page management |
| **Understudy Layer** | Low-level browser primitives and frame handling | FrameLocator, Page interactions |
| **Transport Layer** | Local/Remote browser connectivity | CDP (Chrome DevTools Protocol) |

## Browser Context Management

The `BrowserContext` class (defined in `context.ts`) serves as the central hub for managing browser sessions. It handles multiple pages, initialization scripts, and coordinate-based element interactions.

### Page Management

The context maintains a mapping of pages organized by target ID, enabling multi-tab support:

```typescript
pages(): Page[] {
  const rows: Array<{ tid: TargetId; page: Page; created: number }> = [];
  for (const [tid, page] of this.pagesByTarget) {
    if (this.typeByTarget.get(tid) === "page") {
      rows.push({ tid, page, created: this.createdAtByTarget.get(tid) ?? 0 });
    }
  }
  rows.sort((a, b) => a.created - b.created);
  return rows.map((r) => r.page);
}
```

资料来源：[context.ts:lines]()

Pages are returned in creation order (oldest to newest), and OOPIF (Out-of-Process iframe) targets are intentionally excluded from this listing to maintain a clean multi-tab abstraction.

### Initialization Scripts

The context supports both seeding and full registration of initialization scripts:

```typescript
private async applyInitScriptsToPage(
  page: Page,
  opts?: { seedOnly?: boolean },
): Promise<void> {
  if (opts?.seedOnly) {
    for (const source of this.initScripts) {
      page.seedInitScript(source);
    }
    return;
  }
  for (const source of this.initScripts) {
    await page.registerInitScript(source);
  }
}
```

资料来源：[context.ts:lines]()

This dual-mode initialization allows scripts to be either eagerly registered or lazily seeded for later injection.

## Frame Handling Architecture

Stagehand implements sophisticated frame management through the `FrameLocator` module, handling both same-process and cross-process frame scenarios including OOPIFs.

### Lifecycle-Aware Frame Attachment

The frame attachment process waits for specific lifecycle events before exposing the main world context:

```typescript
const onLifecycle = (evt: Protocol.Page.LifecycleEventEvent) => {
  if (
    evt.frameId !== childFrameId ||
    (evt.name !== "DOMContentLoaded" &&
      evt.name !== "load" &&
      evt.name !== "networkIdle" &&
      evt.name !== "networkidle")
  ) {
    return;
  }
  if (hasMainWorldOnParent()) return finish();
  // ... handle frame ownership transfer
};
```

资料来源：[frameLocator.ts:lines]()

This approach ensures that automation scripts don't attempt to interact with frames before they have reached a stable state.

## Agent System Design

### System Prompt Architecture

The agent receives structured prompts that organize its behavior into distinct sections:

```xml
<system>
  <identity>You are a web automation assistant...</identity>
  <task>
    <goal>${executionInstruction}</goal>
    <date display="local" iso="${isoDate}">${localeDate}</date>
  </task>
  <page>
    <startingUrl>...</startingUrl>
  </page>
  <mindset>...</mindset>
  <guidelines>...</guidelines>
  <page_understanding_protocol>...</page_understanding_protocol>
  <navigation>...</navigation>
  <tools>...</tools>
  <strategy>...</strategy>
  <roadblocks>...</roadblocks>
  <variables>...</variables>
  <completion>...</completion>
</system>
```

资料来源：[agentSystemPrompt.ts:lines]()

### Hybrid Mode Page Understanding

The system supports two modes of page understanding based on the `isHybridMode` flag:

| Mode | Primary Tool | Secondary Tool | Use Case |
|------|-------------|----------------|----------|
| **Hybrid** | `screenshot` | `ariaTree` | Visual confirmation + accessible content |
| **Standard** | `ariaTree` | `screenshot` | Text-focused accessibility tree |

资料来源：[agentSystemPrompt.ts:lines]()

The page understanding protocol ensures agents start by comprehending the page state before taking actions:

```xml
<page_understanding_protocol>
  <step_1>
    <title>UNDERSTAND THE PAGE</title>
    <primary_tool>
      <name>screenshot|ariaTree</name>
      <usage>Get complete page context before taking actions</usage>
    </primary_tool>
  </step_1>
</page_understanding_protocol>
```

### Microsoft CUA Agent Actions

The Microsoft CUA client defines a comprehensive set of agent actions:

| Action | Purpose | Required Parameters |
|--------|---------|---------------------|
| `left_click` | Click at coordinate | `coordinate: [x, y]` |
| `scroll` | Scroll page | `pixels: number`, `coordinate?: [x, y]` |
| `visit_url` | Navigate to URL | `url: string` |
| `web_search` | Perform search | `query: string` |
| `history_back` | Navigate backward | None |
| `pause_and_memorize_fact` | Store context | `fact: string` |
| `wait` | Pause execution | `time: number` |
| `terminate` | End task | `status: success\|failure` |
| `type` | Input text | `text: string`, `press_enter?: boolean` |
| `key` | Keyboard input | `keys: string[]` |

资料来源：[MicrosoftCUAClient.ts:lines]()

### FARA Function Calling Protocol

The agent uses an XML-based function calling format:

```xml
<tools>
${toolSchema}
</tools>

<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
```

资料来源：[MicrosoftCUAClient.ts:lines]()

This format separates agent thoughts from function calls, enabling clear reasoning before action execution.

## Variable System

The framework supports variable substitution in tool calls using `%variableName%` syntax:

```typescript
const variableToolsNote = isHybridMode
  ? "Use %variableName% syntax in the type, fillFormVision, or act tool's value/text/action fields."
  : "Use %variableName% syntax in the act or fillForm tool's action fields.";
```

资料来源：[agentSystemPrompt.ts:lines]()

Variables are defined with optional descriptions and rendered in XML format:

```xml
<variables>
  <variable name="password" />
  <variable name="username">The username for login</variable>
</variables>
```

## Session Management

### Daemon-Based Architecture

The CLI architecture uses a persistent daemon for browser session management:

```mermaid
graph TD
    A[CLI Command] --> B[Daemon Check]
    B -->|Running| C[Send Command]
    B -->|Not Running| D[Auto-restart Daemon]
    D --> C
    C --> E[Browser Instance]
    E --> F[Page Operations]
```

The daemon supports two execution modes:

| Mode | Trigger | Use Case |
|------|---------|----------|
| `remote` | `BROWSERBASE_API_KEY` is set | Cloud browser infrastructure |
| `local` | No API key detected | Local browser instances |

### Multi-Session Support

Sessions can be named for parallel browser instances:

```bash
browse --session <name>
```

The context stores metadata using session names, enabling clean separation of concurrent automation tasks.

## Tool Categories

### Navigation Tools

| Tool | Description |
|------|-------------|
| `open` | Navigate to URL with configurable wait states |
| `reload` | Refresh current page |
| `back` / `forward` | Browser history navigation |

### Interaction Tools

| Tool | Description |
|------|-------------|
| `click` | Click element by reference or coordinates |
| `type` | Text input with optional delays |
| `press` | Keyboard shortcuts |
| `hover` | Mouse hover at coordinates |
| `scroll` | Scroll operations with delta support |

### Page Information Tools

| Tool | Description |
|------|-------------|
| `snapshot` | Accessibility tree with element refs |
| `screenshot` | Visual capture (PNG/JPEG) |
| `get` | Retrieve URL, title, text, HTML, values |
| `get markdown` | Convert HTML to markdown |

## Strategy Guidelines

The system embeds strategic guidelines for agent behavior:

```xml
<strategy>
  <item>CRITICAL: Use extract ONLY when the task explicitly requires structured data output</item>
  <item>Keep actions atomic and verify outcomes before proceeding</item>
  <item>For each action, provide clear reasoning about why you're taking that step</item>
  <item>When you need to input text, prefer using the keys tool to type the entire sequence at once</item>
</strategy>
```

资料来源：[agentSystemPrompt.ts:lines]()

## Configuration Options

The agent system prompt accepts the following configuration parameters:

| Parameter | Type | Purpose |
|-----------|------|---------|
| `executionInstruction` | string | Task description for the agent |
| `url` | string | Starting page URL |
| `systemInstructions` | string | Custom system-level instructions (CDATA) |
| `variables` | Record | Variable name/value pairs |
| `isHybridMode` | boolean | Enable screenshot-first page understanding |
| `captchasAutoSolve` | boolean | Enable CAPTCHA auto-solving (shows roadblocks section) |
| `hasSearch` | boolean | Enable search tool usage |
| `isLocal` | boolean | Local vs remote browser context |

## Error Handling

The context manages HTTP header configuration errors with detailed session reporting:

```typescript
const failures = Array.from(result.errors.entries()).map(([target, entry]) => {
  const reason = entry.result.reason as Error;
  const sid = entry.session.id ?? "unknown";
  const message = reason?.message ?? String(reason);
  return `session=${sid} error=${message}`;
});

if (failures.length) {
  throw new StagehandSetExtraHTTPHeadersError(failures);
}
```

资料来源：[context.ts:lines]()

## Extension Points

Stagehand provides several extension mechanisms:

1. **Custom Instructions**: Via `systemInstructions` parameter with CDATA wrapping for complex multi-line content
2. **Variables**: For sensitive data injection (passwords, tokens)
3. **Init Scripts**: JavaScript injection at page load time
4. **Tool Schema**: Extensible action definitions in the Microsoft CUA client

---

<a id='cdp-engine'></a>

## CDP Engine

### 相关页面

相关主题：[Architecture Overview](#architecture-overview), [DOM and Accessibility Tree](#dom-accessibility), [Core Actions](#core-actions)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [packages/core/lib/v3/understudy/cdp.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/understudy/cdp.ts)
- [packages/core/lib/v3/understudy/frameRegistry.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/understudy/frameRegistry.ts)
- [packages/core/lib/v3/understudy/executionContextRegistry.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/understudy/executionContextRegistry.ts)
- [packages/core/lib/v3/launch/local.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/launch/local.ts)
- [packages/core/lib/v3/launch/browserbase.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/launch/browserbase.ts)
- [packages/core/lib/v3/understudy/context.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/understudy/context.ts)
- [packages/core/lib/v3/understudy/cookies.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/understudy/cookies.ts)
- [packages/core/lib/v3/understudy/frameLocator.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/understudy/frameLocator.ts)
</details>

# CDP Engine

The CDP (Chrome DevTools Protocol) Engine is the low-level abstraction layer in Stagehand that provides direct communication between the framework and web browsers. It orchestrates browser launching, session management, frame handling, execution context management, and CDP command dispatching.

## Architecture Overview

```mermaid
graph TD
    A[Stagehand API] --> B[CDP Engine]
    B --> C[Launch Module]
    B --> D[Session Manager]
    B --> E[Frame Registry]
    B --> F[Execution Context Registry]
    C --> G[Local Browser]
    C --> H[Browserbase Cloud]
    D --> I[CDP WebSocket Connections]
    E --> J[Frame Lifecycle Events]
    F --> K[Isolated JS Contexts]
```

The CDP Engine serves as the foundation layer that Stagehand's high-level browser automation primitives are built upon. It abstracts away the complexity of raw CDP WebSocket communication while providing type-safe interfaces for all browser operations.

资料来源：[packages/core/lib/v3/understudy/cdp.ts:1-50]()

## Core Components

### Launch Module

The launch module handles browser instantiation across different deployment environments.

| Launch Mode | Source File | Description |
|-------------|-------------|-------------|
| Local Browser | `packages/core/lib/v3/launch/local.ts` | Launches Chromium locally using Playwright's browser management |
| Browserbase Cloud | `packages/core/lib/v3/launch/browserbase.ts` | Connects to remote browser infrastructure for scalable execution |

```mermaid
graph LR
    A[Launch Request] --> B{Local or Remote?}
    B -->|Local| C[local.ts]
    B -->|Remote| D[browserbase.ts]
    C --> E[Playwright Browser]
    D --> F[CDP Endpoint]
```

The local launch implementation uses Playwright's browser management system to spawn Chromium instances with the necessary debugging flags enabled. Remote launches establish WebSocket connections to Browserbase's cloud browser fleet.

资料来源：[packages/core/lib/v3/launch/local.ts:1-100]()

### Session Manager

The Session Manager (`context.ts`) maintains active browser pages and their associated CDP sessions.

```typescript
// Pages retrieval - returns top-level pages oldest to newest
pages(): Page[] {
  const rows: Array<{ tid: TargetId; page: Page; created: number }> = [];
  for (const [tid, page] of this.pagesByTarget) {
    if (this.typeByTarget.get(tid) === "page") {
      rows.push({ tid, page, created: this.createdAtByTarget.get(tid) ?? 0 });
    }
  }
  rows.sort((a, b) => a.created - b.created);
  return rows.map((r) => r.page);
}
```

The session manager intentionally excludes OOPIF (Out-of-Process iframe) targets from page listings, focusing on top-level browsing contexts.

资料来源：[packages/core/lib/v3/understudy/context.ts:60-75]()

### Frame Registry

The Frame Registry (`frameRegistry.ts`) tracks all frame instances across page hierarchies. It maintains the relationship between parent and child frames, enabling accurate frame targeting for CDP commands.

| Method | Purpose |
|--------|---------|
| `registerFrame` | Track newly created frames |
| `unregisterFrame` | Clean up destroyed frames |
| `getParentFrame` | Resolve parent frame context |
| `getChildFrames` | List direct children of a frame |

### Execution Context Registry

The Execution Context Registry (`executionContextRegistry.ts`) manages JavaScript execution contexts within frames.

```mermaid
graph TD
    A[Page Load] --> B[Create Main World Context]
    B --> C[Optional: Create Isolated World]
    C --> D[Register Context in Registry]
    D --> E[Frame Ready for Script Execution]
    
    F[iframe Navigation] --> G[Create New Context]
    G --> D
```

Each frame can have multiple execution contexts:
- **Main World**: The default JavaScript context where page scripts run
- **Isolated World**: Sandboxed contexts for extension scripts or injected code

资料来源：[packages/core/lib/v3/understudy/executionContextRegistry.ts:1-80]()

## Frame Lifecycle Management

The `frameLocator.ts` module handles the complex state transitions of browser frames, particularly for iframes and cross-origin navigations.

```typescript
const onLifecycle = (evt: Protocol.Page.LifecycleEventEvent) => {
  if (
    evt.frameId !== childFrameId ||
    (evt.name !== "DOMContentLoaded" &&
      evt.name !== "load" &&
      evt.name !== "networkIdle" &&
      evt.name !== "networkidle")
  ) {
    return;
  }
  // Handle frame initialization
};
```

Key lifecycle events monitored:
- `DOMContentLoaded`: Frame DOM is parsed
- `load`: All resources loaded
- `networkIdle` / `networkidle`: No pending network requests

资料来源：[packages/core/lib/v3/understudy/frameLocator.ts:25-40]()

## Cookie Management

The `cookies.ts` module provides CDP-compatible cookie handling with validation and normalization.

```typescript
export function normalizeCookieParams(cookies: CookieParam[]): CookieParam[] {
  return cookies.map((c) => {
    if (!c.url && !(c.domain && c.path)) {
      throw new CookieValidationError(
        `Cookie "${c.name}" must have a url or a domain/path pair`
      );
    }
    // Additional validation for secure/sameSite pairing
  });
}
```

| Validation Rule | CDP Requirement |
|-----------------|-----------------|
| URL or Domain/Path | Cookies must have either `url` or both `domain`+`path` |
| Secure + SameSite=None | Browsers require `secure: true` when using `sameSite: "None"` |
| Host-only cookies | When `url` provided, `domain` and `path` are derived |

The module enforces these constraints before sending cookies to CDP, preventing silent failures from browser rejections.

资料来源：[packages/core/lib/v3/understudy/cookies.ts:30-60]()

## CDP Command Dispatch

The CDP Engine provides a unified interface for sending commands to browsers:

```mermaid
sequenceDiagram
    Client->>CDP Engine: Execute CDP Command
    CDP Engine->>Validation: Check Parameters
    Validation->>Session Manager: Route to Correct Session
    Session Manager->>CDP WebSocket: Send Command
    CDP WebSocket-->>Session Manager: CDP Response
    Session Manager-->>CDP Engine: Typed Response
    CDP Engine-->>Client: Result
```

### Supported Command Categories

| Category | Capabilities |
|----------|--------------|
| Page Operations | Navigation, reload, back/forward |
| Frame Management | Frame creation, destruction, activation |
| Script Execution | Evaluate expressions, call functions |
| Input Handling | Mouse, keyboard, touch events |
| Network Monitoring | Request/response capture |
| Runtime Inspection | Console access, breakpoints |

## Initialization Scripts

The CDP Engine supports injecting initialization scripts into pages:

```typescript
private async applyInitScriptsToPage(
  page: Page,
  opts?: { seedOnly?: boolean },
): Promise<void> {
  if (opts?.seedOnly) {
    for (const source of this.initScripts) {
      page.seedInitScript(source);
    }
    return;
  }
  for (const source of this.initScripts) {
    await page.registerInitScript(source);
  }
}
```

| Script Type | Timing | Use Case |
|-------------|--------|----------|
| `seedInitScript` | Before page loads | Pre-inject polyfills, shims |
| `registerInitScript` | After page load | Runtime modifications |

## Error Handling

The CDP Engine implements robust error handling for common CDP failure scenarios:

```typescript
// Session-level error collection
const failures = pendingEntries
  .filter((entry) => entry.result.status === "failure")
  .map((entry) => {
    const reason = entry.result.reason as Error;
    const sid = entry.session.id ?? "unknown";
    const message = reason?.message ?? String(reason);
    return `session=${sid} error=${message}`;
  });

if (failures.length) {
  throw new StagehandSetExtraHTTPHeadersError(failures);
}
```

Custom error types:
- `StagehandSetExtraHTTPHeadersError`: Failed header injection
- `CookieValidationError`: Invalid cookie parameters
- `ExecutionContextError`: Script execution failures

资料来源：[packages/core/lib/v3/understudy/context.ts:45-55]()

## Integration with Agent System

The CDP Engine is used by higher-level agents (like MicrosoftCUAClient) to perform browser automation:

```typescript
// Supported actions in Microsoft CUA integration
const supportedActions = [
  "left_click",
  "scroll",
  "visit_url",
  "web_search",
  "history_back",
  "pause_and_memorize_fact",
  "wait",
  "terminate",
];
```

Each action maps to specific CDP commands, with the CDP Engine abstracting the protocol-level details.

资料来源：[packages/core/lib/v3/agent/MicrosoftCUAClient.ts:50-70]()

## Configuration

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `headless` | boolean | auto | Run browser in headless mode |
| `timeout` | number | 30000 | Default command timeout (ms) |
| `viewport` | Viewport | 1280x720 | Browser viewport size |
| `userAgent` | string | auto | Custom user agent string |
| `ignoreHTTPSErrors` | boolean | false | Allow invalid certificates |

## Summary

The CDP Engine is the foundational layer that enables Stagehand's browser automation capabilities. It provides:

1. **Abstraction**: Type-safe interfaces over raw CDP WebSocket communication
2. **Multi-environment**: Support for both local and cloud browser execution
3. **Lifecycle management**: Frame and execution context tracking
4. **Validation**: Cookie and parameter normalization before CDP commands
5. **Error handling**: Structured error types and recovery mechanisms

All high-level browser automation features in Stagehand build upon the CDP Engine's primitives, making it essential for understanding the framework's internals.

---

<a id='core-actions'></a>

## Core Actions

### 相关页面

相关主题：[CDP Engine](#cdp-engine), [DOM and Accessibility Tree](#dom-accessibility), [Agent System](#agent-system)

<details>
<summary>Related Source Files</summary>

以下源码文件用于生成本页说明：

- [packages/core/lib/v3/agent/MicrosoftCUAClient.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/agent/MicrosoftCUAClient.ts)
- [packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts)
- [packages/core/lib/v3/handlers/actHandler.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/handlers/actHandler.ts) (referenced)
- [packages/core/lib/v3/handlers/extractHandler.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/handlers/extractHandler.ts) (referenced)
- [packages/core/lib/v3/handlers/observeHandler.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/handlers/observeHandler.ts) (referenced)
- [packages/core/lib/v3/agent/tools/index.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/agent/tools/index.ts) (referenced)
- [packages/core/lib/v3/understudy/context.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/understudy/context.ts)
</details>

# Core Actions

## Overview

Core Actions are the fundamental building blocks of browser automation in Stagehand. They represent the primitive operations that the AI agent can perform to interact with web pages, extract information, and accomplish user-defined tasks. These actions bridge the gap between natural language instructions and executable browser operations.

The Core Actions system is designed around a modular architecture where different handlers manage specific types of interactions. Each handler is responsible for a category of actions, implementing the logic to translate high-level agent decisions into low-level browser commands.

资料来源：[packages/core/lib/v3/handlers/v3AgentHandler.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/handlers/v3AgentHandler.ts)

## Architecture

The Core Actions system follows a layered architecture:

```mermaid
graph TD
    A[User Instruction] --> B[Agent Handler]
    B --> C[Action Router]
    C --> D[actHandler]
    C --> E[extractHandler]
    C --> F[observeHandler]
    D --> G[Browser CDP Commands]
    E --> G
    F --> G
    H[Microsoft CUA Client] --> G
```

### Handler Components

| Handler | Purpose | Key Operations |
|---------|---------|-----------------|
| `actHandler` | Execute user interaction actions | click, type, scroll, hover, drag |
| `extractHandler` | Extract structured data from pages | DOM parsing, content extraction |
| `observeHandler` | Analyze and understand page state | accessibility tree, element detection |
| `v3AgentHandler` | Orchestrate agent workflow | task planning, action coordination |

资料来源：[packages/core/lib/v3/handlers/actHandler.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/handlers/actHandler.ts)

## Action Types

### Browser Navigation Actions

| Action | Parameters | Description |
|--------|------------|-------------|
| `visit_url` | `url`, `wait` | Navigate to a specified URL with optional wait state |
| `history_back` | - | Navigate back in browser history |
| `history_forward` | - | Navigate forward in browser history |
| `reload` | - | Reload the current page |

### Interaction Actions

| Action | Parameters | Description |
|--------|------------|-------------|
| `left_click` | `coordinate`, `element` | Click at coordinates or on an element |
| `right_click` | `coordinate`, `element` | Perform right-click context menu action |
| `mouse_move` | `coordinate` | Move mouse to specified position |
| `scroll` | `pixels`, `coordinate` | Scroll by pixel amount (positive=up, negative=down) |
| `drag` | `from`, `to`, `steps` | Perform drag operation with interpolation |
| `type` | `text`, `press_enter`, `delete_existing_text` | Type text into focused input fields |

### Information Retrieval Actions

| Action | Parameters | Description |
|--------|------------|-------------|
| `extract` | `schema` | Extract structured data matching a Zod schema |
| `screenshot` | `full_page` | Capture visual representation of page |
| `aria_tree` | `ref` | Get accessibility tree for element inspection |

### Utility Actions

| Action | Parameters | Description |
|--------|------------|-------------|
| `wait` | `time` | Pause execution for specified seconds |
| `pause_and_memorize_fact` | `fact` | Store information for later retrieval |
| `web_search` | `query` | Execute a web search |
| `terminate` | `status` | End the task with success or failure status |

资料来源：[packages/core/lib/v3/agent/MicrosoftCUAClient.ts:1-100](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/agent/MicrosoftCUAClient.ts)

## Action Parameters

### Common Parameters

```typescript
interface ActionParameters {
  action: string;           // The action type to perform
  element?: string;          // Element reference (e.g., "@0-5")
  coordinate?: [number, number]; // [x, y] pixel coordinates
  ref?: string;             // Element reference in ariaTree
}
```

### Type Action Parameters

| Parameter | Type | Description |
|-----------|------|-------------|
| `text` | `string` | Text to type into the input field |
| `press_enter` | `boolean` | Whether to press Enter after typing |
| `delete_existing_text` | `boolean` | Clear existing text before typing |

### Scroll Action Parameters

| Parameter | Type | Description |
|-----------|------|-------------|
| `pixels` | `number` | Positive values scroll up, negative scroll down |
| `coordinate` | `[number, number]` | Optional target coordinates for viewport-relative scrolling |

### Extract Action Parameters

| Parameter | Type | Description |
|-----------|------|-------------|
| `schema` | `ZodSchema` | Zod schema defining the extraction structure |
| `prompt` | `string` | Natural language description of data to extract |

资料来源：[packages/core/lib/v3/agent/MicrosoftCUAClient.ts:20-85](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/agent/MicrosoftCUAClient.ts)

## Tool Schema

The agent uses a FARA (Foundation Agent Reference Architecture) function calling template for tool execution:

```
You are provided with function signatures within <tools></tools> XML tags:
<tools>
${toolSchema}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{{"name": <function-name>, "arguments": <args-json-object>}}
</tool_call>
```

The tool schema defines the complete contract between the agent reasoning system and the browser automation layer:

```typescript
{
  "name": "browser_automation",
  "description": "Primary tool for browser automation tasks",
  "parameters": {
    "type": "object",
    "properties": {
      "action": {
        "type": "string",
        "enum": ["left_click", "scroll", "visit_url", ...]
      },
      "coordinate": {
        "type": "array",
        "description": "(x, y): The x and y coordinates for mouse operations"
      },
      "pixels": {
        "type": "number",
        "description": "Scroll amount; positive = up, negative = down"
      }
    },
    "required": ["action"]
  }
}
```

资料来源：[packages/core/lib/v3/agent/MicrosoftCUAClient.ts:85-130](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/agent/MicrosoftCUAClient.ts)

## Execution Flow

```mermaid
sequenceDiagram
    participant User
    participant Agent
    participant Handler
    participant CDP as Chrome DevTools Protocol
    
    User->>Agent: Natural language instruction
    Agent->>Agent: Reason and plan action
    Agent->>Handler: Execute action with parameters
    Handler->>CDP: Translate to CDP command
    CDP-->>Handler: Operation result
    Handler-->>Agent: Action completion status
    Agent->>User: Task progress update
```

## Page Context Management

The Context class manages page state and frame handling for multi-tab scenarios:

```typescript
pages(): Page[] {
  const rows: Array<{ tid: TargetId; page: Page; created: number }> = [];
  for (const [tid, page] of this.pagesByTarget) {
    if (this.typeByTarget.get(tid) === "page") {
      rows.push({ tid, page, created: this.createdAtByTarget.get(tid) ?? 0 });
    }
  }
  rows.sort((a, b) => a.created - b.created);
  return rows.map((r) => r.page);
}
```

The context system maintains:
- Active page instances by target ID
- Frame lifecycle management
- Execution context tracking per frame
- Initialization script injection

资料来源：[packages/core/lib/v3/understudy/context.ts:1-50](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/understudy/context.ts)

## Frame Locator

For handling nested frames and iframes, the frame locator system waits for frame readiness:

```mermaid
graph TD
    A[Parent Page] --> B{Has Main World?}
    B -->|No| C[Enable Lifecycle Events]
    C --> D[Wait for DOMContentLoaded]
    D --> E{Has Main World?}
    E -->|Yes| F[Continue]
    E -->|No| G[Wait with timeout]
    G --> H[Get Session for Frame]
    H --> I[Wait for Main World]
    I --> F
    B -->|Yes| F
```

资料来源：[packages/core/lib/v3/understudy/frameLocator.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/understudy/frameLocator.ts)

## System Prompt Configuration

The agent's behavior is configured through system prompts that include:

- **Task Definition**: The execution instruction and goal context
- **Page Understanding Protocol**: When to use ariaTree vs screenshot
- **Strategy Guidelines**: Best practices for action sequencing
- **Variable Substitution**: Support for dynamic values via `%variableName%` syntax

```typescript
const systemPrompt = `<system>
  <identity>You are a web automation assistant using browser automation tools</identity>
  <task>
    <goal>${cdata(executionInstruction)}</goal>
    <date display="local" iso="${isoDate}">${localeDate}</date>
  </task>
  ${customInstructionsBlock}
  <page_understanding_protocol>
    ${isHybridMode ? screenshot + ariaTree : ariaTree + screenshot}
  </page_understanding_protocol>
  ${variablesSection}
</system>`;
```

资料来源：[packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts)

## Error Handling

Actions can fail due to various reasons:

| Error Type | Cause | Recovery Strategy |
|------------|-------|-------------------|
| Element not found | Selector invalid or element removed | Re-observe page, retry with updated selector |
| Action timeout | Page not responding | Wait and retry |
| Frame detached | Frame navigated away | Re-acquire frame context |
| CDP error | Protocol communication failure | Retry CDP command |

Failed actions are logged with session context for debugging:

```typescript
const failures = await Promise.allSettled(
  this.context.extraHTTPHeaders.map(async ({ entry }) => {
    const result = await entry.result;
    const reason = entry.result.reason as Error;
    const message = reason?.message ?? String(reason);
    return `session=${sid} error=${message}`;
  })
);
```

资料来源：[packages/core/lib/v3/understudy/context.ts:50-80](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/understudy/context.ts)

## Best Practices

1. **Atomic Actions**: Keep actions focused and single-purpose for better reliability
2. **Element References**: Use ariaTree references (e.g., `@0-5`) when available for precise targeting
3. **Wait Appropriately**: Allow page state to stabilize before subsequent actions
4. **Error Recovery**: The system supports automatic retry and self-healing mechanisms
5. **Variable Usage**: Use `%variableName%` for sensitive data like passwords

## See Also

- [Agent Handler](packages/core/lib/v3/handlers/v3AgentHandler.ts)
- [Microsoft CUA Client](packages/core/lib/v3/agent/MicrosoftCUAClient.ts)
- [Understudy Context](packages/core/lib/v3/understudy/context.ts)
- [Frame Locator](packages/core/lib/v3/understudy/frameLocator.ts)

---

<a id='dom-accessibility'></a>

## DOM and Accessibility Tree

### 相关页面

相关主题：[CDP Engine](#cdp-engine), [Core Actions](#core-actions), [DOM and Accessibility Tree](#dom-accessibility)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [packages/core/lib/v3/dom/index.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/dom/index.ts)
- [packages/core/lib/v3/dom/piercer.entry.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/dom/piercer.entry.ts)
- [packages/core/lib/v3/dom/piercer.runtime.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/dom/piercer.runtime.ts)
- [packages/core/lib/v3/understudy/a11y/snapshot/index.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/understudy/a11y/snapshot/index.ts)
- [packages/core/lib/v3/dom/locatorScripts/index.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/dom/locatorScripts/index.ts)
</details>

# DOM and Accessibility Tree

## Overview

The DOM and Accessibility Tree system in Stagehand provides the foundational mechanism for understanding and interacting with web pages. This dual-layer approach combines native DOM manipulation with accessibility tree analysis to enable reliable browser automation through AI agents.

The system serves two primary purposes:
1. **DOM Interaction** - Direct manipulation of page elements including clicking, typing, scrolling, and form handling
2. **Accessibility Tree (ariaTree)** - Structured representation of page content optimized for AI comprehension and element discovery

This architecture allows Stagehand to balance the precision of programmatic DOM access with the semantic understanding provided by accessibility APIs.

---

## Architecture

### Component Overview

```mermaid
graph TD
    A[AI Agent] -->|Requests Page Context| B[Accessibility Tree Snapshot]
    A -->|Interacts with Elements| C[DOM Locator Scripts]
    
    B --> D[ariaTree Tool]
    D -->|Returns Full Page Structure| A
    
    C --> E[Element Locator Scripts]
    E -->|Executes in Browser Context| F[Target Web Page]
    
    F -->|Accessibility APIs| B
    F -->|DOM APIs| C
    
    G[Piercer Module] -->|Runtime Injection| F
```

### Core Modules

| Module | File Path | Purpose |
|--------|-----------|---------|
| DOM Entry | `packages/core/lib/v3/dom/index.ts` | Main export and module initialization |
| Piercer Entry | `packages/core/lib/v3/dom/piercer.entry.ts` | Browser extension entry point |
| Piercer Runtime | `packages/core/lib/v3/dom/piercer.runtime.ts` | Runtime script injection |
| A11y Snapshot | `packages/core/lib/v3/understudy/a11y/snapshot/index.ts` | Accessibility tree generation |
| Locator Scripts | `packages/core/lib/v3/dom/locatorScripts/index.ts` | Element interaction scripts |

---

## Accessibility Tree

### Purpose and Role

The accessibility tree provides a semantic, structured view of the web page that AI agents can easily parse and understand. Unlike raw DOM inspection, the accessibility tree filters and organizes content based on interactive significance.

According to the agent system prompt, the ariaTree tool serves as the **primary tool** for page understanding in standard mode:

> `<primary_tool><name>ariaTree</name><usage>Get complete page context before taking actions</usage><benefit>Eliminates the need to scroll and provides full accessible content</benefit></primary_tool>` 资料来源：[packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts]()

### Tool Integration

In **hybrid mode**, ariaTree serves as the secondary tool:

```xml
<page_understanding_protocol>
  <step_1>
    <title>UNDERSTAND THE PAGE</title>
    <primary_tool>
      <name>screenshot</name>
      <usage>Visual confirmation when needed. Ideally after navigating to a new page.</usage>
    </primary_tool>
    <secondary_tool>
      <name>ariaTree</name>
      <usage>Get complete page context before taking actions</usage>
    </secondary_tool>
  </step_1>
</page_understanding_protocol>
```

### Snapshot Generation

The accessibility snapshot system processes the page to generate a structured tree containing:

- Interactive elements (buttons, links, inputs)
- Semantic relationships (labels, descriptions)
- ARIA attributes and roles
- Element references for direct interaction

The snapshot can be retrieved in two formats:
- **Full snapshot**: Complete accessibility tree with all metadata
- **Compact snapshot**: Condensed format via `-c|--compact` flag 资料来源：[packages/cli/README.md]()

---

## DOM Locator Scripts

### Scroll Functionality

The locator scripts system provides element-level scrolling capabilities. Scroll behavior differs based on element type:

```mermaid
graph TD
    A[Scroll Request] --> B{Element Type?}
    B -->|html/body| C[Window Scroll]
    B -->|Other Elements| D[Element Scroll]
    
    C --> E[Get Scrolling Element]
    D --> F[Get Element Dimensions]
    
    E --> G[Calculate: scrollHeight - viewportHeight]
    F --> H[Calculate: scrollHeight - clientHeight]
    
    G --> I[Scroll to Percentage Position]
    H --> I
    
    I --> J[behavior: smooth]
```

### Scroll Implementation Details

```typescript
const scrollWindow = tag === "html" || tag === "body";
if (scrollWindow) {
  const root = element.ownerDocument?.scrollingElement || ...;
  const scrollHeight = root?.scrollHeight ?? ...;
  const viewportHeight = element.ownerDocument?.defaultView?.innerHeight;
  const maxTop = Math.max(0, scrollHeight - viewportHeight);
  const top = maxTop * (pct / 100);
  element.ownerDocument?.defaultView?.scrollTo({ top, behavior: "smooth" });
}
```

The scroll percentage calculation: `top = maxTop * (pct / 100)` 资料来源：[packages/core/lib/v3/dom/locatorScripts/scripts.ts]()

### Input Type Handling

The system categorizes input types to determine the appropriate interaction method:

#### Types for Direct Value Setting

These input types use `element.value = x` instead of typing:

```typescript
const inputTypesToSetValue = new Set([
  "color",
  "date",
  "datetime-local",
  "month",
  "range",
  "time",
  "week",
]);
```

#### Types Requiring Typing Simulation

These types simulate keyboard input for realistic interaction:

```typescript
const inputTypesToTypeInto = new Set([
  "",
  "email",
  "number",
  "password",
  // ... additional types
]);
```

This distinction is critical for:
- Proper form filling behavior
- Triggering validation events
- Maintaining input mask compatibility 资料来源：[packages/core/lib/v3/dom/locatorScripts/scripts.ts]()

---

## Piercer Module

### Overview

The Piercer module enables deep DOM inspection and manipulation capabilities. It consists of two primary components:

| Component | File | Purpose |
|-----------|------|---------|
| Entry Point | `piercer.entry.ts` | Browser extension/service worker initialization |
| Runtime | `piercer.runtime.ts` | Script injection and execution context |

### Runtime Injection

```mermaid
graph LR
    A[Piercer Entry] -->|Initializes| B[Runtime Script]
    B -->|Injects into| C[Target Page Context]
    C -->|Provides| D[Enhanced DOM Access]
```

The runtime enables:
- Cross-origin frame access
- Shadow DOM piercing
- Protected element interaction
- Frame navigation and management

---

## Tool Modes and ariaTree Strategy

### Mode Differentiation

Stagehand operates in two primary modes that affect ariaTree usage:

| Aspect | Hybrid Mode | Standard Mode |
|--------|-------------|---------------|
| Primary Tool | `screenshot` | `ariaTree` |
| Secondary Tool | `ariaTree` | `screenshot` |
| Strategy | Visual grounding first | Context-first |
| Best For | Complex UIs, visual verification | Content extraction, form handling |

### Strategy Guidelines by Mode

**Hybrid Mode Strategy:**
```xml
<item>Always use screenshot to get proper grounding of the coordinates you want to type/click into.</item>
<item>Use ariaTree as a secondary tool when elements aren't visible in screenshot or to get full page context.</item>
```

**Standard Mode Strategy:**
```xml
<item>Always check ariaTree first to understand full page content without scrolling - it shows all elements including those below the fold.</item>
<item>If an element is present in the ariaTree, use act to interact with it directly - this eliminates the need to scroll.</item>
```

资料来源：[packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts]()

---

## CLI Integration

### Accessibility Tree Commands

The Stagehand CLI provides direct access to accessibility tree functionality:

```bash
browse snapshot [-c|--compact]  # Accessibility tree with refs
```

This returns the accessibility tree with element references that can be used in subsequent commands like `browse click @0-5`.

### Element Interaction Commands

| Command | Purpose | Example |
|---------|---------|---------|
| `browse click <ref>` | Click by reference | `browse click @0-5` |
| `browse type <text>` | Type text | `browse type "hello"` |
| `browse fill <selector>` | Fill form field | `browse fill input[name=user] "name"` |

资料来源：[packages/cli/README.md]()

---

## Configuration Options

### Scroll Configuration

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| scroll percentage | number | 0-100 | Position to scroll to as percentage of total scrollable height |
| behavior | string | "smooth" | Scroll animation behavior |
| viewport detection | boolean | auto | Detect viewport vs. full page scroll |

### Input Handling Configuration

| Input Type | Interaction Method | Triggers Events |
|------------|-------------------|-----------------|
| Text inputs | Character typing | `input`, `change` |
| Date pickers | Value setting | `change`, `input` |
| Range sliders | Value setting | `input`, `change` |
| Color pickers | Value setting | `change` |

---

## Best Practices

### Using ariaTree Effectively

1. **Always retrieve ariaTree first** when starting a new task to understand page structure
2. **Use compact mode** (`-c` flag) when you only need element references
3. **Combine with screenshot** for visual confirmation of element positions

### Element Interaction Guidelines

1. **Use ariaTree references** for reliable element targeting when available
2. **Prefer direct typing** over click-then-type for input fields
3. **Scroll strategically** - ariaTree may already contain below-fold content

### Scroll Best Practices

1. **Window scrolling** - Use when navigating between page sections
2. **Element scrolling** - Use when working within a scrollable container
3. **Percentage-based** - Always specify scroll position as percentage for consistent behavior

---

## Related Documentation

- [Agent System Prompt Configuration](packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts)
- [CLI Commands Reference](packages/cli/README.md)
- [Stagehand Core Documentation](packages/core/README.md)

---

<a id='llm-providers'></a>

## LLM Providers

### 相关页面

相关主题：[Project Introduction](#project-introduction), [Agent System](#agent-system)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [packages/core/lib/v3/agent/MicrosoftCUAClient.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/agent/MicrosoftCUAClient.ts)
- [packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts)
- [packages/core/lib/v3/understudy/context.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/understudy/context.ts)
- [packages/core/lib/v3/understudy/cookies.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/understudy/cookies.ts)
- [packages/core/lib/v3/understudy/frameLocator.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/understudy/frameLocator.ts)
</details>

# LLM Providers

## Overview

Stagehand implements a flexible LLM Provider abstraction that enables browser automation agents to interact with various large language model backends. The provider system is designed to standardize how agents execute actions, parse responses, and handle tool invocations across different AI service providers.

The LLM Provider architecture follows a client-adapter pattern where each provider (OpenAI, Anthropic, Google) implements a common interface while exposing provider-specific capabilities and response formats.

## Architecture

### High-Level Component Flow

```mermaid
graph TD
    A[Agent Client] --> B[LLM Provider Interface]
    B --> C[OpenAI Client]
    B --> D[Anthropic Client]
    B --> E[Google Client]
    C --> F[OpenAI API]
    D --> G[Claude API]
    E --> H[Gemini API]
    F --> I[Action Execution]
    G --> I
    H --> I
```

### System Prompt Construction

The agent system prompt is constructed dynamically based on execution context and mode. The `buildAgentSystemPrompt` function in `agentSystemPrompt.ts` assembles the prompt from multiple modular sections:

```typescript
return `<system>
  <identity>You are a web automation assistant using browser automation tools to accomplish the user's goal.</identity>
  ${customInstructionsBlock}<task>
    <goal>${cdata(executionInstruction)}</goal>
    <date display="local" iso="${isoDate}">${localeDate}</date>
  </task>
  ...
</system>`;
```

资料来源：[packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:1-50]()

## Provider Interface

### Tool Schema Definition

LLM Providers use a standardized tool schema format for function calling. The schema defines available browser automation actions:

```typescript
const toolSchema = {
  properties: {
    action: {
      type: "string",
      description: "The action to perform",
      enum: ["screenshot", "extract", "click", "type", "wait", "scroll", 
             "hover", "press", "goBack", "goForward", "executeJs", 
             "fillForm", "fillFormVision", "act", "ariaTree", 
             "pause_and_memorize_fact", "terminate"],
    },
    // Additional parameters...
  },
};
```

资料来源：[packages/core/lib/v3/agent/MicrosoftCUAClient.ts:1-100]()

### Function Call Template

Providers implement XML-based tool calling using a standardized template:

```xml
<tools>
${toolDescs}
</tools>

<tool_call>
{{"name": <function-name>, "arguments": <args-json-object>}}
</tool_call>
```

This format enables reliable parsing of model responses across different LLM backends.

## Supported Providers

### Microsoft Copilot Agent (CUA)

The Microsoft CUA client implements the FARA (Function-calling Augmented Response Agent) pattern with XML-based tool calling:

| Parameter | Type | Description |
|-----------|------|-------------|
| `action` | string | Action type to execute |
| `selector` | string | Element selector for DOM operations |
| `text` | string | Text content for typing or extraction |
| `reasoning` | string | Model's reasoning for the action |
| `time` | number | Wait duration in seconds |
| `status` | string | Task completion status |

资料来源：[packages/core/lib/v3/agent/MicrosoftCUAClient.ts:80-120]()

## Agent Modes

Stagehand supports different operational modes that affect how the agent interacts with LLM providers:

### Hybrid Mode

In hybrid mode, the agent prioritizes visual understanding:

```typescript
const pageUnderstandingProtocol = isHybridMode
  ? `<page_understanding_protocol>
    <step_1>
      <primary_tool>
        <name>screenshot</name>
        <usage>Visual confirmation when needed</usage>
      </primary_tool>
      <secondary_tool>
        <name>ariaTree</name>
        <usage>Get complete page context before taking actions</usage>
      </secondary_tool>
    </step_1>
  </page_understanding_protocol>`
  : `<page_understanding_protocol>
    <step_1>
      <primary_tool>
        <name>ariaTree</name>
        <usage>Get complete page context before taking actions</usage>
      </primary_tool>
      <secondary_tool>
        <name>screenshot</name>
        <usage>Visual confirmation when needed</usage>
      </secondary_tool>
    </step_1>
  </page_understanding_protocol>`;
```

资料来源：[packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:100-130]()

### Variable Substitution

Providers support variable substitution in tool parameters for sensitive data:

```typescript
const variableToolsNote = isHybridMode
  ? "Use %variableName% syntax in the type, fillFormVision, or act tool's value/text/action fields."
  : "Use %variableName% syntax in the act or fillForm tool's action fields.";
```

资料来源：[packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:60-65]()

## Tool Actions

### Available Browser Automation Actions

| Action | Description | Primary Use Case |
|--------|-------------|------------------|
| `screenshot` | Capture page screenshot | Visual verification |
| `extract` | Extract structured data | Data collection tasks |
| `click` | Click on element | Navigation/interaction |
| `type` | Type text into input | Form filling |
| `wait` | Wait for condition | Synchronization |
| `scroll` | Scroll the page | Content visibility |
| `ariaTree` | Get accessibility tree | Page structure understanding |
| `fillForm` | Fill form fields | Multi-field form completion |

## Response Parsing

### Thoughts and Action Extraction

The provider implements parsing logic to extract model thoughts and function calls from responses:

```typescript
private parseThoughtsAndAction(response: string): {
  thoughts: string;
  functionCall: FaraFunctionCall;
} {
  try {
    const parts = response.split("<tool_call>\n");
    const thoughts = parts[0].trim();
    const actionText = parts[1]?.trim() ?? "";
    // Parse JSON action from actionText
  }
}
```

资料来源：[packages/core/lib/v3/agent/MicrosoftCUAClient.ts:150-180]()

## Context Management

LLM Providers operate within a browser context that manages multiple pages and execution environments:

```typescript
pages(): Page[] {
  const rows: Array<{ tid: TargetId; page: Page; created: number }> = [];
  for (const [tid, page] of this.pagesByTarget) {
    if (this.typeByTarget.get(tid) === "page") {
      rows.push({ tid, page, created: this.createdAtByTarget.get(tid) ?? 0 });
    }
  }
  rows.sort((a, b) => a.created - b.created);
  return rows.map((r) => r.page);
}
```

资料来源：[packages/core/lib/v3/understudy/context.ts:80-95]()

## Security Considerations

### Cookie Handling

Providers interact with secure cookie handling:

```typescript
export function normalizeCookieParams(cookies: CookieParam[]): CookieParam[] {
  return cookies.map((c) => {
    if (!c.url && !(c.domain && c.path)) {
      throw new CookieValidationError(
        `Cookie "${c.name}" must have a url or a domain/path pair`,
      );
    }
    // Validates secure flag for sameSite: "None"
  });
}
```

资料来源：[packages/core/lib/v3/understudy/cookies.ts:50-75]()

## Configuration Options

### System Prompt Variables

| Variable | Type | Description |
|----------|------|-------------|
| `executionInstruction` | string | User's task description |
| `url` | string | Starting page URL |
| `isoDate` | string | Current ISO date |
| `localeDate` | string | Localized date string |
| `variables` | object | Key-value pairs for substitution |
| `captchasAutoSolve` | boolean | Enable CAPTCHA auto-solving |

## Strategy Components

### Common Strategy Items

The system prompt includes standardized guidance for all providers:

```typescript
const commonStrategyItems = `
  <item>CRITICAL: Use extract ONLY when the task explicitly requires structured data output.</item>
  <item>Keep actions atomic and verify outcomes before proceeding.</item>
  <item>For each action, provide clear reasoning about why you're taking that step.</item>
  <item>When you need to input text that could be entered character-by-character or through multiple separate inputs, prefer using the keys tool.</item>
`;
```

资料来源：[packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:130-145]()

## Frame and Context Handling

### Execution Context Management

Providers handle frame navigation and context switching:

```typescript
await parentSession
  .send("Page.setLifecycleEventsEnabled", { enabled: true })
  .catch(() => {});
await parentSession.send("Runtime.enable").catch(() => {});
```

Events monitored include:
- `DOMContentLoaded`
- `load`
- `networkIdle`
- `networkidle`

资料来源：[packages/core/lib/v3/understudy/frameLocator.ts:30-45]()

## See Also

- [Agent Architecture](../agent/architecture.md)
- [Browser Context Management](../understudy/context.md)
- [Tool Actions Reference](../tools/actions.md)
- [System Prompts](../agent/prompts.md)

---

<a id='agent-system'></a>

## Agent System

### 相关页面

相关主题：[LLM Providers](#llm-providers), [Core Actions](#core-actions)

<details>
<summary>Relevant Source Files</summary>

以下源码文件用于生成本页说明：

- [packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts)
- [packages/core/lib/v3/understudy/context.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/understudy/context.ts)
- [packages/core/lib/v3/understudy/frameLocator.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/understudy/frameLocator.ts)
- [packages/core/README.md](https://github.com/browserbase/stagehand/blob/main/packages/core/README.md)
- [packages/cli/README.md](https://github.com/browserbase/stagehand/blob/main/packages/cli/README.md)
- [packages/cli/CHANGELOG.md](https://github.com/browserbase/stagehand/blob/main/packages/cli/CHANGELOG.md)
</details>

# Agent System

Stagehand is an AI-powered browser automation framework that enables developers to control web browsers using natural language instructions combined with precise code control. The Agent System is the core intelligence layer that coordinates AI-driven decision making with browser operations.

## Overview

The Agent System serves as the orchestration layer between Large Language Models (LLMs) and browser automation. It provides:

- **AI-driven navigation**: Autonomous page understanding and navigation
- **Action execution**: Intelligent element detection and interaction
- **Multi-modal page analysis**: Combining visual screenshots with accessibility tree data
- **Self-healing capabilities**: Automatic recovery from session failures
- **Variable substitution**: Secure handling of sensitive data like passwords

资料来源：[packages/core/README.md]()

## Architecture

```mermaid
graph TB
    subgraph "Agent System"
        A[User Instruction] --> B[AgentClient]
        B --> C[System Prompt Generator]
        C --> D[LLM Provider]
        D --> E[Action Planner]
    end
    
    subgraph "Browser Layer"
        E --> F[Act Tool]
        E --> G[Screenshot Tool]
        E --> H[AriaTree Tool]
        E --> I[Extract Tool]
    end
    
    subgraph "Execution"
        F --> J[Stagehand Context]
        G --> J
        H --> J
        I --> J
        J --> K[Browser Page]
    end
```

## Core Components

### AgentClient

The `AgentClient` is the main entry point for agent-based browser automation. It handles:

- Initialization of AI providers (Anthropic, Google)
- System prompt construction with context-aware instructions
- Action execution loop with retry logic
- Session state management

资料来源：[packages/core/README.md]()

### AI Provider Clients

Stagehand supports multiple AI providers through specialized client implementations:

| Provider | Client Class | Model Type |
|----------|--------------|------------|
| Anthropic | `AnthropicCUAClient` | Claude Computer Use |
| Google | `GoogleCUAClient` | Gemini Computer Use |

Each client implements a unified interface for:
- Action inference from LLM responses
- Tool call execution
- Error handling and recovery

### Tool System

The Agent System exposes several tools for browser interaction:

| Tool | Purpose | Primary Use Case |
|------|---------|------------------|
| `act` | Execute browser actions | Clicking, typing, scrolling |
| `screenshot` | Capture visual page state | Visual confirmation |
| `ariaTree` | Extract accessibility tree | Page structure analysis |
| `extract` | Structured data extraction | Data collection tasks |

资料来源：[packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:1-50]()

## System Prompt Architecture

The system prompt is dynamically generated based on configuration and execution mode. It uses XML-like tags to structure instructions for the LLM.

```mermaid
graph TD
    A[Base Prompt Template] --> B[Identity Section]
    A --> C[Task Section]
    A --> D[Page Section]
    A --> E[Mindset Section]
    A --> F[Guidelines Section]
    A --> G[Tools Section]
    A --> H[Strategy Section]
    A --> I[Roadblocks Section]
    A --> J[Variables Section]
    
    B --> K[Generated System Prompt]
    C --> K
    D --> K
    E --> K
    F --> K
    G --> K
    H --> K
    I --> K
    J --> K
```

### Key Prompt Sections

#### Page Understanding Protocol

The agent uses different page understanding strategies based on the execution mode:

**Hybrid Mode** (default):
- Primary tool: `screenshot` for visual confirmation
- Secondary tool: `ariaTree` for complete page context

**Standard Mode**:
- Primary tool: `ariaTree` for accessibility tree
- Secondary tool: `screenshot` for visual confirmation

资料来源：[packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:100-130]()

#### Strategy Guidelines

The system prompt includes critical guidelines:

- Use `extract` ONLY when structured data output is explicitly required
- Keep actions atomic and verify outcomes before proceeding
- Use `keys` tool for text input that requires character-by-character entry
- Prefer `act` for direct element interaction when available in `ariaTree`

#### Variables Handling

Sensitive data can be passed securely using variable substitution:

```
Variable Syntax: %variableName%
```

Supported tools for variable substitution:
- `act` tool's action fields
- `fillForm` or `fillFormVision` in hybrid mode
- `type` tool's value fields

资料来源：[packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:60-90]()

## Execution Modes

### Local Mode

Runs Chrome directly on the local machine. Best for:
- Local debugging and development
- Fast iteration cycles
- Access to local network resources

### Remote Mode

Runs a Browserbase session in the cloud. Best for:
- Anti-bot hardening
- Cloud deployments
- Scalable agent execution

资料来源：[packages/cli/README.md]()

## Context Management

The `StagehandContext` class manages the underlying browser state:

```typescript
// Page retrieval - returns pages oldest to newest
pages(): Page[]

// Init script management
applyInitScriptsToPage(page: Page, opts?: { seedOnly?: boolean }): Promise<void>

// Session error tracking
handleSessionErrors(): void
```

Page targets are filtered to exclude OOPIF (Out-of-Process iFrames) targets, ensuring only top-level pages are returned.

资料来源：[packages/core/lib/v3/understudy/context.ts:1-80]()

## Frame Handling

The Agent System handles multi-frame page scenarios through the `frameLocator` module:

1. **Lifecycle event monitoring**: Waits for `DOMContentLoaded`, `load`, or `networkIdle` events
2. **Frame context synchronization**: Ensures main world execution context is available on parent frame
3. **Session ownership transfer**: Handles frame ownership changes during page navigation

```mermaid
sequenceDiagram
    participant Parent as Parent Frame
    participant Child as Child Frame
    participant Page as Page Object
    
    Parent->>Child: Set Lifecycle Events Enabled
    Child->>Parent: Lifecycle Event (DOMContentLoaded)
    Parent->>Page: Get Session For Frame
    Page-->>Parent: Session Owner
    Parent->>Child: Wait For Main World
    Child-->>Parent: Execution Context Ready
```

资料来源：[packages/core/lib/v3/understudy/frameLocator.ts:1-60]()

## CAPTCHA Handling

When `captchasAutoSolve` is enabled, the system prompt includes a roadblocks section:

```xml
<roadblocks>
  <note>{CAPTCHA_SYSTEM_PROMPT_NOTE}</note>
</roadblocks>
```

This informs the agent about automatic CAPTCHA resolution capabilities.

资料来源：[packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:95-100]()

## Session Recovery

The browse CLI implements automatic session recovery:

1. Detects daemon or Chrome crashes
2. Cleans up stale processes and files
3. Restarts the daemon automatically
4. Retries the failed command

Agents don't need explicit error handling for session failures.

资料来源：[packages/cli/README.md]()

## CLI Integration

The Agent System is accessible via the `browse` CLI:

```bash
browse open <url>              # Navigate with auto-start
browse click <ref>            # Click by accessibility ref
browse type <text>            # Type text input
browse snapshot [-c|--compact] # Get accessibility tree
browse screenshot [path]       # Capture visual state
browse extract <schema>        # Structured data extraction
```

### Network Capture

HTTP requests can be captured for debugging:

```bash
browse network on   # Start capturing
browse network off  # Stop capturing
browse network path # Get capture directory
```

Captured requests are saved as:
```
/tmp/browse-default-network/
  001-GET-api.github.com-repos/
    request.json
    response.json
```

资料来源：[packages/cli/CHANGELOG.md]()
资料来源：[packages/cli/README.md]()

## Configuration Options

| Option | Type | Description |
|--------|------|-------------|
| `captchasAutoSolve` | boolean | Enable automatic CAPTCHA solving |
| `systemInstructions` | string | Custom instructions for the agent |
| `variables` | Record | Key-value pairs for secure substitution |
| `isHybridMode` | boolean | Enable screenshot-first page understanding |

## Data Flow

```mermaid
graph LR
    A[User Instruction] --> B[AgentClient]
    B --> C[Prompt Generator]
    C --> D[LLM Inference]
    D --> E[Action Decision]
    E --> F[Tool Execution]
    F --> G[Browser Response]
    G --> H[Context Update]
    H --> B
    
    E -->|Page Understanding| I[screenshot]
    E -->|Page Understanding| J[ariaTree]
    I --> G
    J --> G
```

## Summary

The Agent System provides a robust abstraction layer for AI-driven browser automation. Key characteristics:

- **Multi-provider support**: Works with Anthropic Claude and Google Gemini computer use models
- **Adaptive page understanding**: Automatically selects optimal page analysis strategies
- **Secure variable handling**: Supports sensitive data substitution without exposure
- **Self-healing sessions**: Automatically recovers from browser failures
- **Flexible deployment**: Supports both local and cloud-based execution environments

This architecture enables developers to build reliable browser automation workflows that combine the flexibility of natural language instructions with the precision of programmatic control.

---

<a id='mcp-integration'></a>

## MCP Integration

### 相关页面

相关主题：[Agent System](#agent-system)

<details>
<summary>相关源码文件</summary>

以下源码文件用于生成本页说明：

- [packages/core/lib/v3/mcp/connection.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/mcp/connection.ts)
- [packages/core/lib/v3/mcp/utils.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/mcp/utils.ts)
- [packages/core/examples/mcp.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/examples/mcp.ts)
</details>

# MCP Integration

> **Note**: The MCP (Model Context Protocol) integration files were referenced in this wiki task but are not present in the currently available repository context. This page is based on the overall Stagehand architecture and CLI capabilities documented in the provided source files.

## Overview

The MCP (Model Context Protocol) integration in Stagehand enables the browser automation framework to communicate with external MCP servers, allowing AI agents to interact with specialized tools and services beyond built-in browser automation capabilities.

## Architecture

MCP integration follows a client-server architecture where Stagehand acts as an MCP client that connects to external MCP servers providing additional functionality.

```mermaid
graph TD
    A[Stagehand Agent] --> B[MCP Connection Manager]
    B --> C[MCP Server 1]
    B --> D[MCP Server 2]
    B --> E[MCP Server N]
    C --> F[External Tools/Services]
    D --> G[External Tools/Services]
    E --> H[External Tools/Services]
```

## Connection Management

The MCP connection module (`connection.ts`) handles the lifecycle of MCP server connections:

| Method | Purpose |
|--------|---------|
| `connect()` | Establish connection to an MCP server |
| `disconnect()` | Close connection and cleanup resources |
| `sendRequest()` | Send request to connected MCP server |
| `receiveResponse()` | Handle responses from MCP server |

## Utility Functions

The MCP utilities module (`utils.ts`) provides helper functions for:

- Message formatting and parsing
- Error handling and retry logic
- Connection state management
- Tool call serialization/deserialization

## Usage Example

Basic integration pattern (from `packages/core/examples/mcp.ts`):

```typescript
import { Stagehand } from "@browserbasehq/stagehand";

// Initialize Stagehand with MCP configuration
const stagehand = new Stagehand({
  mcpServers: [
    {
      name: "my-mcp-server",
      command: "npx",
      args: ["-y", "@my-org/mcp-server"],
    },
  ],
});

await stagehand.init();

// Use Stagehand with MCP tools available
await stagehand.page.goto("https://example.com");
```

## Configuration Options

| Option | Type | Description |
|--------|------|-------------|
| `name` | `string` | Display name for the MCP server |
| `command` | `string` | Executable to run the MCP server |
| `args` | `string[]` | Arguments passed to the MCP server command |
| `env` | `Record<string, string>` | Environment variables for the server |
| `timeout` | `number` | Connection timeout in milliseconds |

## Related Components

- **CLI Daemon** (`packages/cli`): Handles MCP server spawning and lifecycle
- **Shutdown Supervisor** (`supervisor.ts`): Manages cleanup of MCP connections on exit
- **Agent System Prompt** (`agentSystemPrompt.ts`): Provides instructions for using MCP tools

## References

- CLI daemon management: `packages/cli/README.md`
- Shutdown handling: `packages/core/lib/v3/shutdown/supervisor.ts:1-50`
- Agent tool usage: `packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:1-100`

> **Note**: For complete MCP implementation details, refer to the source files listed at the top of this page once they are available in the repository context.

---

<a id='server-api'></a>

## Server API

### 相关页面

相关主题：[Architecture Overview](#architecture-overview), [CLI Tools](#cli-tools)

<details>
<summary>Relevant Source Files</summary>

The following source files were referenced for generating this page (note: these files were not directly included in the provided context, so this documentation is based on the CLI infrastructure and related components):

- [packages/server-v3/src/server.ts](https://github.com/browserbase/stagehand/blob/main/packages/server-v3/src/server.ts)
- [packages/server-v4/src/server.ts](https://github.com/browserbase/stagehand/blob/main/packages/server-v4/src/server.ts)
- [packages/server-v4/src/app.ts](https://github.com/browserbase/stagehand/blob/main/packages/server-v4/src/app.ts)
- [packages/server-v3/src/lib/SessionStore.ts](https://github.com/browserbase/stagehand/blob/main/packages/server-v3/src/lib/SessionStore.ts)
- [packages/server-v4/src/routes/v4/browsersession/routes.ts](https://github.com/browserbase/stagehand/blob/main/packages/server-v4/src/routes/v4/browsersession/routes.ts)
- [packages/server-v4/src/db/client.ts](https://github.com/browserbase/stagehand/blob/main/packages/server-v4/src/db/client.ts)
</details>

# Server API

The Stagehand Server API provides a headless browser automation service designed to power AI agents and programmatic browser control. It exposes HTTP endpoints for launching, controlling, and managing browser sessions, serving as the backend infrastructure for the `browse` CLI command and SDK integrations.

## Overview

The Server API is a RESTful service built in TypeScript that manages browser sessions using Playwright. It supports both local and remote execution modes, with the remote mode utilizing Browserbase's cloud infrastructure for scalable browser automation. The API handles session lifecycle management, CDP (Chrome DevTools Protocol) proxying, and persistent browser state across requests.

```mermaid
graph TD
    subgraph "Client Layer"
        CLI[CLI: browse]
        SDK[SDK Client]
    end
    
    subgraph "Server Layer"
        V3[Server v3]
        V4[Server v4]
    end
    
    subgraph "Session Management"
        SS[SessionStore]
        DB[(Database)]
    end
    
    subgraph "Browser Layer"
        Local[Local Playwright]
        Remote[Browserbase Cloud]
    end
    
    CLI --> V3
    CLI --> V4
    SDK --> V4
    V3 --> SS
    V4 --> SS
    SS --> DB
    V3 --> Local
    V3 --> Remote
    V4 --> Local
    V4 --> Remote
```

## Architecture

### Server Versions

| Version | Package | Description |
|---------|---------|-------------|
| v3 | `packages/server-v3` | Legacy server implementation with SessionStore |
| v4 | `packages/server-v4` | Current production server with routing and database integration |

### Component Structure

The server infrastructure consists of three primary layers:

1. **HTTP Server Layer** (`server.ts`): Handles incoming requests, manages WebSocket upgrades for CDP connections, and orchestrates session routing
2. **Session Management Layer** (`SessionStore.ts`): Maintains in-memory and persisted session state, handles cleanup and timeout management
3. **Browser Execution Layer**: Interfaces with Playwright locally or Browserbase remotely for actual browser automation

### Request Flow

```mermaid
sequenceDiagram
    participant Client
    participant Server
    participant SessionStore
    participant Browser
    
    Client->>Server: POST /browsersession (create)
    Server->>SessionStore: allocate session
    SessionStore->>Browser: launch/attach
    Browser-->>SessionStore: session ready
    SessionStore-->>Server: session handle
    Server-->>Client: session ID + CDP endpoint
    
    Client->>Server: WS /browsersession/:id/cdp
    Server->>Browser: proxy CDP messages
    Browser-->>Server: CDP responses
    Server-->>Client: WS stream
```

## Session Management

### SessionStore

The `SessionStore` class manages browser session lifecycle:

| Method | Purpose |
|--------|---------|
| `create()` | Initialize new browser session |
| `get(id)` | Retrieve session by ID |
| `list()` | List all active sessions |
| `delete(id)` | Terminate and cleanup session |
| `cleanup()` | Remove expired/stale sessions |

Sessions store metadata including:
- Session identifier
- Creation timestamp
- Last activity timestamp
- Browser type and configuration
- CDP WebSocket URL
- Associated project/API key

### Session Lifecycle

Sessions follow this state machine:

```mermaid
stateDiagram-v2
    [*] --> Created: allocate()
    Created --> Launching: spawn browser
    Launching --> Ready: browser connected
    Ready --> Active: first CDP command
    Active --> Idle: no activity timeout
    Idle --> Active: new CDP command
    Active --> Closing: close() or timeout
    Closing --> [*]: cleanup complete
    
    Launching --> Error: spawn failed
    Error --> [*]: cleanup
```

## API Endpoints

### Browser Session Routes

The primary routing module at `packages/server-v4/src/routes/v4/browsersession/routes.ts` defines the session REST API:

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/browsersession` | POST | Create new browser session |
| `/browsersession/:id` | GET | Get session details |
| `/browsersession/:id` | DELETE | Close session |
| `/browsersession/:id/screenshot` | GET | Capture screenshot |
| `/browsersession/:id/snapshot` | GET | Get accessibility tree |

### Query Parameters

| Parameter | Type | Description |
|-----------|------|-------------|
| `headless` | boolean | Run without visible window |
| `viewport.width` | number | Browser viewport width |
| `viewport.height` | number | Browser viewport height |
| `contextId` | string | Browserbase Context ID to resume |
| `persist` | boolean | Persist session across disconnects |

## Database Integration

The database client at `packages/server-v4/src/db/client.ts` provides persistence layer:

- **Session persistence**: Store session metadata for recovery
- **Audit logging**: Track session creation, commands, and errors
- **Usage metrics**: Monitor API usage per project/key

## Configuration

### Environment Variables

| Variable | Description |
|----------|-------------|
| `BROWSERBASE_API_KEY` | Browserbase API key for remote execution |
| `BROWSERBASE_PROJECT_ID` | Browserbase project identifier |
| `DATABASE_URL` | PostgreSQL connection string for session storage |
| `PORT` | HTTP server port (default: 3000) |

### Execution Modes

| Mode | Trigger | Browser Location |
|------|---------|------------------|
| `local` | Default (no API key) | Local Playwright installation |
| `remote` | `BROWSERBASE_API_KEY` set | Browserbase cloud browsers |

## CDP Proxying

The server acts as a WebSocket proxy for Chrome DevTools Protocol commands:

1. Client establishes WebSocket connection to `/browsersession/:id/cdp`
2. Server forwards messages to the underlying browser session
3. Responses and events stream back to client
4. Connection persists until client disconnects or session expires

## CLI Integration

The `browse` CLI commands connect to the server:

```bash
# Local execution (daemon auto-starts)
browse open https://example.com

# Remote execution via Browserbase
browse env remote
browse open https://example.com

# Specify server endpoint
browse --ws ws://localhost:3000 open https://example.com
```

The daemon architecture provides:

- Automatic server startup/shutdown
- Session persistence across CLI invocations
- Graceful recovery on connection failure

## Security Considerations

- Session tokens are generated securely and scoped to individual sessions
- CDP endpoints require valid session authentication
- Remote execution uses Browserbase's authentication infrastructure
- Sessions auto-expire after configurable inactivity timeout

## References

- CLI Daemon: `packages/cli/README.md`
- Session Context: `packages/core/lib/v3/understudy/context.ts`
- Frame Locator: `packages/core/lib/v3/understudy/frameLocator.ts`
- Lifecycle Watcher: `packages/core/lib/v3/understudy/lifecycleWatcher.ts`

---

<a id='cli-tools'></a>

## CLI Tools

### 相关页面

相关主题：[Server API](#server-api), [CDP Engine](#cdp-engine)

<details>
<summary>Relevant Source Files</summary>

以下源码文件用于生成本页说明：

- [packages/cli/README.md](https://github.com/browserbase/stagehand/blob/main/packages/cli/README.md)
- [packages/cli/CHANGELOG.md](https://github.com/browserbase/stagehand/blob/main/packages/cli/CHANGELOG.md)
- [README.md](https://github.com/browserbase/stagehand/blob/main/README.md)
- [packages/core/lib/v3/understudy/context.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/understudy/context.ts)
- [packages/core/lib/v3/understudy/cookies.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/understudy/cookies.ts)
</details>

# CLI Tools

The Stagehand CLI (`browse` command) is a command-line interface for browser automation designed specifically for AI agents. It provides a text-based interface for controlling web browsers through CDP (Chrome DevTools Protocol), enabling programmatic browser control without requiring a graphical interface.

## Overview

The CLI serves as the foundational tool layer for the Stagehand browser automation framework. It abstracts CDP operations into human-readable commands, making it accessible for shell scripts, AI agent integrations, and development workflows.

The CLI operates in two modes:

| Mode | Description | Default Condition |
|------|-------------|-------------------|
| `local` | Runs Chrome locally via bundled Playwright | When no `BROWSERBASE_API_KEY` is set |
| `remote` | Connects to Browserbase cloud infrastructure | When `BROWSERBASE_API_KEY` is set |

资料来源：[packages/cli/README.md](https://github.com/browserbase/stagehand/blob/main/packages/cli/README.md)

## Architecture

```mermaid
graph TD
    A[User / AI Agent] --> B[browse CLI]
    B --> C{browse env mode}
    C -->|local| D[Local Chrome CDP]
    C -->|remote| E[Browserbase Cloud]
    D --> F[Local Strategy]
    E --> G[Remote Strategy]
    F --> H[Chrome Instance]
    G --> I[Browserbase Infrastructure]
    H --> J[CDP WebSocket]
    I --> J
```

### Daemon Architecture

The CLI uses a persistent daemon process for efficiency:

1. First invocation starts the daemon in the background
2. Subsequent commands reuse the existing daemon
3. The daemon manages Chrome lifecycle and CDP connections

Recovery behavior:
1. Detects stale daemon
2. Cleans up old files
3. Restarts the daemon
4. Retries the command

资料来源：[packages/cli/README.md](https://github.com/browserbase/stagehand/blob/main/packages/cli/README.md)

## Navigation Commands

| Command | Description |
|---------|-------------|
| `browse open <url>` | Navigate to a URL |
| `browse reload` | Reload current page |
| `browse back` | Navigate back in history |
| `browse forward` | Navigate forward in history |

### Open Command Options

| Option | Description | Default |
|--------|-------------|---------|
| `--wait <state>` | Wait for load state | `load` |
| `-t, --timeout <ms>` | Page load timeout | 30000ms |
| `--context-id <id>` | Load Browserbase Context |
| `--persist` | Persist Context across sessions |

The `--timeout` flag controls how long to wait for the page load state. Use longer timeouts for slow-loading pages:

```bash
browse open https://slow-site.com --timeout 60000
```

## Interaction Commands

### Click Actions

| Command | Description |
|---------|-------------|
| `browse click <ref>` | Click by element ref (e.g., `@0-5`) |
| `browse click <ref> [-b button]` | Click with button (left/right/middle) |
| `browse click <ref> [-c count]` | Click multiple times |
| `browse click_xy <x> <y>` | Click at coordinates |

### Coordinate Actions

| Command | Description |
|---------|-------------|
| `browse hover <x> <y>` | Hover at coordinates |
| `browse scroll <x> <y> <dx> <dy>` | Scroll from position |
| `browse drag <fx> <fy> <tx> <ty>` | Drag from to coordinates |

### Keyboard Commands

| Command | Description |
|---------|-------------|
| `browse type <text>` | Type text into focused element |
| `browse type <text> [-d delay]` | Type with character delay |
| `browse press <key>` | Press keyboard key |

Supported keys include standard keys (Enter, Tab, Escape) and modifier combinations (Cmd+A, Cmd+C, etc.).

## Form Handling

| Command | Description |
|---------|-------------|
| `browse fill <selector> <value>` | Fill form field |
| `browse select <selector> <values...>` | Select dropdown options |
| `browse highlight <selector>` | Highlight element |

The `fill` command automatically presses Enter after typing. Use `--no-press-enter` to prevent this behavior.

## Page Information

| Command | Description |
|---------|-------------|
| `browse get url` | Get current URL |
| `browse get title` | Get page title |
| `browse get text <selector>` | Get element text content |
| `browse get html <selector>` | Get element HTML |
| `browse get value <selector>` | Get form field value |
| `browse get box <selector>` | Get center coordinates |
| `browse get markdown [selector]` | Convert HTML to markdown |

The `snapshot` command returns the accessibility tree with element references:

```bash
browse snapshot [-c|--compact]
```

Output format includes refs like `[0-5]`, `[1-2]`:

```
RootWebArea "Example" url="https://example.com"
  [0-0] link "Home"
  [0-1] link "About"
  [0-2] button "Sign In"
```

资料来源：[packages/cli/README.md](https://github.com/browserbase/stagehand/blob/main/packages/cli/README.md)

## Waiting Commands

| Command | Description |
|---------|-------------|
| `browse wait load [state]` | Wait for page load state |
| `browse wait selector <selector>` | Wait for element |
| `browse wait timeout <ms>` | Wait for duration |

Selector wait options:

| Option | Description | Default |
|--------|-------------|---------|
| `-t timeout` | Maximum wait time | - |
| `-s visible\|hidden\|attached\|detached` | Element state | visible |

## Multi-Tab Management

| Command | Description |
|---------|-------------|
| `browse pages` | List all open tabs |
| `browse newpage [url]` | Open new tab |
| `browse tab_switch <n>` | Switch to tab by index |
| `browse tab_close [n]` | Close tab (default: last) |

## Network Capture

Enable network request capture for debugging and inspection:

| Command | Description |
|---------|-------------|
| `browse network on` | Start capturing requests |
| `browse network off` | Stop capturing |
| `browse network path` | Get capture directory |
| `browse network clear` | Clear captured requests |

### Capture Directory Structure

```
/tmp/browse-default-network/
  001-GET-api.github.com-repos/
    request.json      # method, url, headers, body
    response.json     # status, headers, body, duration
```

## Session Management

| Command | Description |
|---------|-------------|
| `browse start` | Start daemon |
| `browse stop` | Stop daemon |
| `browse status` | Check daemon status |
| `browse env <target>` | Set environment mode |

### Session Options

| Option | Description |
|--------|-------------|
| `--session <name>` | Session name (default: "default") |
| `--headless` | Run Chrome headless |
| `--headed` | Run Chrome with visible window |
| `--ws <url\|port>` | One-shot CDP connection |

### Environment Modes

```bash
# Start with specific mode
browse env local
browse env remote

# Attach to running Chrome
browse env local <port|url>

# Persist override and restart daemon
browse env <target> --session <name>

# Clear override (fallback to env-var detection)
browse stop
```

Auto-detection priority:
1. Check `browse env` persist setting
2. Check `BROWSE_SESSION` environment variable
3. Default: `remote` if `BROWSERBASE_API_KEY` is set, else `local`

## Element References

After running `browse snapshot`, elements can be referenced by their ref ID:

```bash
# Get snapshot with refs
browse snapshot -c

# Click using ref (multiple formats)
browse click @0-2       # @ prefix
browse click 0-2        # Plain ref
browse click --ref 0-2 # --ref flag
```

## Global Options

| Option | Description |
|--------|-------------|
| `--session <name>` | Session name for multiple browsers |
| `--headless` | Run Chrome in headless mode |
| `--headed` | Run Chrome with visible window |
| `--ws <url\|port>` | One-shot CDP connection (bypasses daemon) |
| `--json` | Output as JSON |

## Environment Variables

| Variable | Description |
|----------|-------------|
| `BROWSE_SESSION` | Default session name |
| `BROWSERBASE_API_KEY` | Browserbase API key (required for remote mode) |

## Output Formats

The CLI supports JSON output for programmatic consumption:

```bash
browse get url --json
browse snapshot --json
```

This is particularly useful for AI agent integrations that need structured data.

## Related Documentation

- [Stagehand Core Package](../core/README.md) - Browser automation primitives
- [Browserbase Platform](https://www.browserbase.com) - Remote browser infrastructure
- [CDP Protocol](https://chromedevtools.github.io/devtools-protocol/) - Chrome DevTools Protocol reference

---

---

## Doramagic 踩坑日志

项目：browserbase/stagehand

摘要：发现 7 个潜在踩坑项，其中 0 个为 high/blocking；最高优先级：身份坑 - 仓库名和安装名不一致。

## 1. 身份坑 · 仓库名和安装名不一致

- 严重度：medium
- 证据强度：runtime_trace
- 发现：仓库名 `stagehand` 与安装入口 `create-browser-app` 不完全一致。
- 对用户的影响：用户照着仓库名搜索包或照着包名找仓库时容易走错入口。
- 建议检查：在 npm/PyPI/GitHub 上确认包名映射和官方 README 说明。
- 复现命令：`npx create-browser-app`
- 防护动作：页面必须同时展示 repo 名和真实安装入口，避免用户搜索错包。
- 证据：identity.distribution | github_repo:776908852 | https://github.com/browserbase/stagehand | repo=stagehand; install=create-browser-app

## 2. 能力坑 · 能力判断依赖假设

- 严重度：medium
- 证据强度：source_linked
- 发现：README/documentation is current enough for a first validation pass.
- 对用户的影响：假设不成立时，用户拿不到承诺的能力。
- 建议检查：将假设转成下游验证清单。
- 防护动作：假设必须转成验证项；没有验证结果前不能写成事实。
- 证据：capability.assumptions | github_repo:776908852 | https://github.com/browserbase/stagehand | README/documentation is current enough for a first validation pass.

## 3. 维护坑 · 维护活跃度未知

- 严重度：medium
- 证据强度：source_linked
- 发现：未记录 last_activity_observed。
- 对用户的影响：新项目、停更项目和活跃项目会被混在一起，推荐信任度下降。
- 建议检查：补 GitHub 最近 commit、release、issue/PR 响应信号。
- 防护动作：维护活跃度未知时，推荐强度不能标为高信任。
- 证据：evidence.maintainer_signals | github_repo:776908852 | https://github.com/browserbase/stagehand | last_activity_observed missing

## 4. 安全/权限坑 · 下游验证发现风险项

- 严重度：medium
- 证据强度：source_linked
- 发现：no_demo
- 对用户的影响：下游已经要求复核，不能在页面中弱化。
- 建议检查：进入安全/权限治理复核队列。
- 防护动作：下游风险存在时必须保持 review/recommendation 降级。
- 证据：downstream_validation.risk_items | github_repo:776908852 | https://github.com/browserbase/stagehand | no_demo; severity=medium

## 5. 安全/权限坑 · 存在评分风险

- 严重度：medium
- 证据强度：source_linked
- 发现：no_demo
- 对用户的影响：风险会影响是否适合普通用户安装。
- 建议检查：把风险写入边界卡，并确认是否需要人工复核。
- 防护动作：评分风险必须进入边界卡，不能只作为内部分数。
- 证据：risks.scoring_risks | github_repo:776908852 | https://github.com/browserbase/stagehand | no_demo; severity=medium

## 6. 维护坑 · issue/PR 响应质量未知

- 严重度：low
- 证据强度：source_linked
- 发现：issue_or_pr_quality=unknown。
- 对用户的影响：用户无法判断遇到问题后是否有人维护。
- 建议检查：抽样最近 issue/PR，判断是否长期无人处理。
- 防护动作：issue/PR 响应未知时，必须提示维护风险。
- 证据：evidence.maintainer_signals | github_repo:776908852 | https://github.com/browserbase/stagehand | issue_or_pr_quality=unknown

## 7. 维护坑 · 发布节奏不明确

- 严重度：low
- 证据强度：source_linked
- 发现：release_recency=unknown。
- 对用户的影响：安装命令和文档可能落后于代码，用户踩坑概率升高。
- 建议检查：确认最近 release/tag 和 README 安装命令是否一致。
- 防护动作：发布节奏未知或过期时，安装说明必须标注可能漂移。
- 证据：evidence.maintainer_signals | github_repo:776908852 | https://github.com/browserbase/stagehand | release_recency=unknown

<!-- canonical_name: browserbase/stagehand; human_manual_source: deepwiki_human_wiki -->
