stagehand Manual Preview

Doramagic Project Pack · Human Manual

stagehand

Stagehand represents a paradigm shift in browser automation by allowing developers to choose when to leverage AI capabilities versus writing explicit code. This hybrid approach provides fl...

Project Introduction

Related topics: Architecture Overview, CDP Engine, LLM Providers

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Key Value Propositions

Continue reading this section for the full explanation and source context.

Section Core Packages

Continue reading this section for the full explanation and source context.

Section Microsoft CUA Integration

Continue reading this section for the full explanation and source context.

Related topics: Architecture Overview, CDP Engine, LLM Providers

Project Introduction

Stagehand is an AI-powered browser automation framework developed by Browserbase that enables developers to control web browsers using natural language instructions combined with precise code-based control. The framework bridges the gap between high-level AI agents and low-level browser automation tools like Selenium, Playwright, or Puppeteer.

Project Overview

Stagehand represents a paradigm shift in browser automation by allowing developers to choose when to leverage AI capabilities versus writing explicit code. This hybrid approach provides flexibility while maintaining reliability for production environments.

Key Value Propositions

Feature	Description
Hybrid Control	Combine AI-driven navigation with deterministic code execution
Self-Healing	Auto-caching and action recovery when website changes occur
Action Preview	Preview AI-generated actions before execution
Repeatable Workflows	Cache and reuse actions to save time and tokens
Production Ready	Built for reliable automation in production systems

Sources: README.md:1-50

Architecture Overview

Stagehand follows a monorepo architecture using pnpm workspaces and Turborepo for efficient build management and dependency handling.

graph TD
    A[stagehand Repository] --> B[packages/core]
    A --> C[packages/cli]
    B --> D[Agent System]
    B --> E[Inference Engine]
    B --> F[Browser Context Manager]
    D --> G[Microsoft CUA Client]
    D --> H[Action Executors]
    F --> I[Playwright Integration]
    F --> J[Frame Locator]
    E --> K[Inference Logging]
    E --> L[Cache Management]
    C --> M[Daemon Controller]
    C --> N[CLI Commands]
    M --> F

Core Packages

Package	Purpose	Key Files
`packages/core`	Main automation engine with AI agent capabilities	`lib/v3/agent/`, `lib/v3/understudy/`
`packages/cli`	Command-line interface for browser control	`CHANGELOG.md`, README commands

Sources: packages/cli/CHANGELOG.md:1-15

Project Structure

stagehand/
├── packages/
│   ├── core/                    # Core automation package
│   │   ├── lib/
│   │   │   ├── v3/
│   │   │   │   ├── agent/       # AI agent implementation
│   │   │   │   │   ├── MicrosoftCUAClient.ts
│   │   │   │   │   └── prompts/
│   │   │   │   │       └── agentSystemPrompt.ts
│   │   │   │   └── understudy/  # Context and frame management
│   │   │   │       ├── context.ts
│   │   │   │       └── frameLocator.ts
│   │   │   └── inferenceLogUtils.ts
│   │   └── README.md
│   └── cli/                     # CLI tooling
│       ├── CHANGELOG.md
│       └── README.md
├── package.json
├── pnpm-workspace.yaml
└── turbo.json

Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:1-50

Agent System Architecture

The agent system is the brain of Stagehand, responsible for interpreting user instructions and generating appropriate browser actions.

Microsoft CUA Integration

Stagehand implements a Microsoft CUA (Computer Use Agent) client that provides structured function calling capabilities:

graph LR
    A[User Instruction] --> B[MicrosoftCUAClient]
    B --> C[Action Generation]
    C --> D[left_click]
    C --> E[scroll]
    C --> F[visit_url]
    C --> G[type]
    C --> H[wait]
    C --> I[web_search]

#### Supported Actions

Action	Parameters	Description
`left_click`	`coordinate: [x, y]`	Click at specified coordinates
`scroll`	`pixels: number`	Scroll up (positive) or down (negative)
`visit_url`	`url: string`	Navigate to a URL
`type`	`text, press_enter, delete_existing_text`	Type text into input fields
`wait`	`time: number`	Wait for specified seconds
`web_search`	`query: string`	Perform a web search
`history_back`	-	Navigate back in browser history
`pause_and_memorize_fact`	`fact: string`	Store information for later use
`terminate`	`status: success/failure`	End the task execution

Sources: packages/core/lib/v3/agent/MicrosoftCUAClient.ts:1-80

System Prompt Structure

The agent uses a structured XML-based system prompt that organizes instructions into distinct sections:

graph TD
    P[System Prompt] --> A[Identity]
    P --> T[Task Definition]
    P --> M[Mindset Guidelines]
    P --> G[Action Guidelines]
    P --> N[Navigation Rules]
    P --> S[Strategy]
    P --> R[Roadblocks]
    P --> V[Variables]
    P --> C[Completion]

The prompt templates include conditional sections for:

Page Understanding Protocol: Instructions for analyzing page state
Search Integration: Optional search tool usage when URL confidence is low
Variable Substitution: Support for %variableName% syntax in form interactions
Captcha Handling: Optional roadblocks section when auto-solving is enabled

Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:50-150

Browser Context Management

Stagehand manages browser contexts through the understudy module, which handles complex scenarios like iframe interactions and multi-page workflows.

Frame Locator System

The frame locator handles cross-origin iframe communication and lifecycle events:

sequenceDiagram
    Participant Parent as Parent Page
    Participant Child as Child Frame
    Participant Context as BrowserContext
    
    Parent->>Context: Create Frame
    Context->>Child: Register Lifecycle Events
    Child-->>Parent: DOMContentLoaded
    Parent->>Context: Wait for MainWorld
    Context-->>Parent: Execution Context Ready

Key responsibilities include:

Resolving owning Page by main frame ID
Applying initialization scripts to pages
Managing lifecycle events (DOMContentLoaded, load, networkIdle)
Handling OOPIF (Out-of-Process iframe) scenarios

Sources: packages/core/lib/v3/understudy/context.ts:1-50, packages/core/lib/v3/understudy/frameLocator.ts:1-40

Inference and Caching

Stagehand implements an inference logging system that enables self-healing automation by caching and reusing action results.

Summary File Structure

<inferenceType>_summary.json

Where inferenceType can be:

act_summary: Action execution results
observe_summary: Page observation results
extract_summary: Data extraction results

The system reads and writes JSON files containing arrays of inference results, enabling:

Action Replay: Re-execute previously successful actions
Self-Healing: Automatically recover from website changes
Token Optimization: Skip LLM inference when cached results are valid

Sources: packages/core/lib/inferenceLogUtils.ts:1-60

CLI Architecture

The Stagehand CLI provides a command-line interface for browser automation with support for both local and remote browser execution.

Execution Modes

Mode	Detection	Description
`remote`	`BROWSERBASE_API_KEY` is set	Uses Browserbase cloud infrastructure
`local`	Default fallback	Uses local Playwright browser

The daemon-based architecture ensures:

Persistent browser sessions
Automatic recovery on failures
Session state preservation across commands

Sources: packages/cli/README.md:1-100

CLI Command Categories

#### Navigation Commands

browse open <url> [--wait load|domcontentloaded|networkidle]
browse reload
browse back
browse forward

#### Interaction Commands

browse click <ref>
browse type <text>
browse press <key>
browse fill <selector> <value>

#### Information Commands

browse get url|title|text|html|value|box
browse snapshot [-c|--compact]
browse screenshot

#### Session Management

browse start|stop|restart
browse env local|remote
browse attach <port|url>

Sources: packages/cli/README.md:100-200

Getting Started

Installation from Source

git clone https://github.com/browserbase/stagehand.git
cd stagehand
pnpm install
pnpm run build
pnpm run example

Branch Installation

Using gitpkg, install directly from a branch:

"@browserbasehq/stagehand": "https://gitpkg.now.sh/browserbase/stagehand/packages/core?<branchName>"

Environment Configuration

Create a .env file from the example:

cp .env.example .env
# Add your API keys:
# BROWSERBASE_API_KEY=your_key
# OPENAI_API_KEY=your_key (or other LLM provider)

Sources: README.md:50-80, packages/core/README.md:50-80

Version History

Recent significant changes in the CLI package:

Version	Change	PR
0.4.2	Added `browse get markdown` command for HTML-to-markdown conversion	#1907
0.4.1	Fixed invalid metadata key using underscore	#1911
0.4.0	Added new feature	#1889

Sources: packages/cli/CHANGELOG.md:1-20

License and Community

Stagehand is released under the MIT License. The project maintains an active community through:

Documentation: docs.stagehand.dev
Discord Community: stagehand.dev/discord
DeepWiki Integration: Ask questions at deepwiki.com/browserbase/stagehand

A Python implementation is also available at github.com/browserbase/stagehand-python.

Sources: README.md:1-30, packages/core/README.md:1-30

Sources: [README.md:1-50]()

Architecture Overview

Related topics: Project Introduction, CDP Engine, Core Actions, Server API

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Page Management

Continue reading this section for the full explanation and source context.

Section Initialization Scripts

Continue reading this section for the full explanation and source context.

Section Lifecycle-Aware Frame Attachment

Continue reading this section for the full explanation and source context.

Architecture Overview

Introduction

Stagehand is an AI-powered browser automation framework that enables developers to control web browsers using natural language instructions combined with precise code-based control. The architecture is designed to bridge the gap between high-level AI agents and low-level browser automation frameworks like Playwright, providing reliability, extensibility, and self-healing capabilities essential for production environments.

The framework operates on a multi-layered architecture that separates concerns between browser session management, agent reasoning, tool execution, and page interaction primitives. This design allows users to choose when to leverage AI for navigating unfamiliar pages and when to use explicit code for deterministic operations.

Core Architecture Layers

Stagehand's architecture consists of four primary layers working in concert to provide browser automation capabilities:

Layer	Purpose	Key Components
Agent Layer	High-level reasoning and decision making	MicrosoftCUAClient, System Prompts
Context Layer	Browser session and page lifecycle management	BrowserContext, Page management
Understudy Layer	Low-level browser primitives and frame handling	FrameLocator, Page interactions
Transport Layer	Local/Remote browser connectivity	CDP (Chrome DevTools Protocol)

Browser Context Management

The BrowserContext class (defined in context.ts) serves as the central hub for managing browser sessions. It handles multiple pages, initialization scripts, and coordinate-based element interactions.

Page Management

The context maintains a mapping of pages organized by target ID, enabling multi-tab support:

pages(): Page[] {
  const rows: Array<{ tid: TargetId; page: Page; created: number }> = [];
  for (const [tid, page] of this.pagesByTarget) {
    if (this.typeByTarget.get(tid) === "page") {
      rows.push({ tid, page, created: this.createdAtByTarget.get(tid) ?? 0 });
    }
  }
  rows.sort((a, b) => a.created - b.created);
  return rows.map((r) => r.page);
}

Sources: context.ts:lines

Pages are returned in creation order (oldest to newest), and OOPIF (Out-of-Process iframe) targets are intentionally excluded from this listing to maintain a clean multi-tab abstraction.

Initialization Scripts

The context supports both seeding and full registration of initialization scripts:

private async applyInitScriptsToPage(
  page: Page,
  opts?: { seedOnly?: boolean },
): Promise<void> {
  if (opts?.seedOnly) {
    for (const source of this.initScripts) {
      page.seedInitScript(source);
    }
    return;
  }
  for (const source of this.initScripts) {
    await page.registerInitScript(source);
  }
}

Sources: context.ts:lines

This dual-mode initialization allows scripts to be either eagerly registered or lazily seeded for later injection.

Frame Handling Architecture

Stagehand implements sophisticated frame management through the FrameLocator module, handling both same-process and cross-process frame scenarios including OOPIFs.

Lifecycle-Aware Frame Attachment

The frame attachment process waits for specific lifecycle events before exposing the main world context:

const onLifecycle = (evt: Protocol.Page.LifecycleEventEvent) => {
  if (
    evt.frameId !== childFrameId ||
    (evt.name !== "DOMContentLoaded" &&
      evt.name !== "load" &&
      evt.name !== "networkIdle" &&
      evt.name !== "networkidle")
  ) {
    return;
  }
  if (hasMainWorldOnParent()) return finish();
  // ... handle frame ownership transfer
};

Sources: frameLocator.ts:lines

This approach ensures that automation scripts don't attempt to interact with frames before they have reached a stable state.

Agent System Design

System Prompt Architecture

The agent receives structured prompts that organize its behavior into distinct sections:

<system>
  <identity>You are a web automation assistant...</identity>
  <task>
    <goal>${executionInstruction}</goal>
    <date display="local" iso="${isoDate}">${localeDate}</date>
  </task>
  <page>
    <startingUrl>...</startingUrl>
  </page>
  <mindset>...</mindset>
  <guidelines>...</guidelines>
  <page_understanding_protocol>...</page_understanding_protocol>
  <navigation>...</navigation>
  <tools>...</tools>
  <strategy>...</strategy>
  <roadblocks>...</roadblocks>
  <variables>...</variables>
  <completion>...</completion>
</system>

Sources: agentSystemPrompt.ts:lines

Hybrid Mode Page Understanding

The system supports two modes of page understanding based on the isHybridMode flag:

Mode	Primary Tool	Secondary Tool	Use Case
Hybrid	`screenshot`	`ariaTree`	Visual confirmation + accessible content
Standard	`ariaTree`	`screenshot`	Text-focused accessibility tree

Sources: agentSystemPrompt.ts:lines

The page understanding protocol ensures agents start by comprehending the page state before taking actions:

<page_understanding_protocol>
  <step_1>
    <title>UNDERSTAND THE PAGE</title>
    <primary_tool>
      <name>screenshot|ariaTree</name>
      <usage>Get complete page context before taking actions</usage>
    </primary_tool>
  </step_1>
</page_understanding_protocol>

Microsoft CUA Agent Actions

The Microsoft CUA client defines a comprehensive set of agent actions:

Action	Purpose	Required Parameters
`left_click`	Click at coordinate	`coordinate: [x, y]`
`scroll`	Scroll page	`pixels: number`, `coordinate?: [x, y]`
`visit_url`	Navigate to URL	`url: string`
`web_search`	Perform search	`query: string`
`history_back`	Navigate backward	None
`pause_and_memorize_fact`	Store context	`fact: string`
`wait`	Pause execution	`time: number`
`terminate`	End task	`status: success\	failure`
`type`	Input text	`text: string`, `press_enter?: boolean`
`key`	Keyboard input	`keys: string[]`

Sources: MicrosoftCUAClient.ts:lines

FARA Function Calling Protocol

The agent uses an XML-based function calling format:

<tools>
${toolSchema}
</tools>

<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>

Sources: MicrosoftCUAClient.ts:lines

This format separates agent thoughts from function calls, enabling clear reasoning before action execution.

Variable System

The framework supports variable substitution in tool calls using %variableName% syntax:

const variableToolsNote = isHybridMode
  ? "Use %variableName% syntax in the type, fillFormVision, or act tool's value/text/action fields."
  : "Use %variableName% syntax in the act or fillForm tool's action fields.";

Sources: agentSystemPrompt.ts:lines

Variables are defined with optional descriptions and rendered in XML format:

<variables>
  <variable name="password" />
  <variable name="username">The username for login</variable>
</variables>

Session Management

Daemon-Based Architecture

The CLI architecture uses a persistent daemon for browser session management:

graph TD
    A[CLI Command] --> B[Daemon Check]
    B -->|Running| C[Send Command]
    B -->|Not Running| D[Auto-restart Daemon]
    D --> C
    C --> E[Browser Instance]
    E --> F[Page Operations]

The daemon supports two execution modes:

Mode	Trigger	Use Case
`remote`	`BROWSERBASE_API_KEY` is set	Cloud browser infrastructure
`local`	No API key detected	Local browser instances

Multi-Session Support

Sessions can be named for parallel browser instances:

browse --session <name>

The context stores metadata using session names, enabling clean separation of concurrent automation tasks.

Tool Categories

Tool	Description
`open`	Navigate to URL with configurable wait states
`reload`	Refresh current page
`back` / `forward`	Browser history navigation

Interaction Tools

Tool	Description
`click`	Click element by reference or coordinates
`type`	Text input with optional delays
`press`	Keyboard shortcuts
`hover`	Mouse hover at coordinates
`scroll`	Scroll operations with delta support

Page Information Tools

Tool	Description
`snapshot`	Accessibility tree with element refs
`screenshot`	Visual capture (PNG/JPEG)
`get`	Retrieve URL, title, text, HTML, values
`get markdown`	Convert HTML to markdown

Strategy Guidelines

The system embeds strategic guidelines for agent behavior:

<strategy>
  <item>CRITICAL: Use extract ONLY when the task explicitly requires structured data output</item>
  <item>Keep actions atomic and verify outcomes before proceeding</item>
  <item>For each action, provide clear reasoning about why you're taking that step</item>
  <item>When you need to input text, prefer using the keys tool to type the entire sequence at once</item>
</strategy>

Sources: agentSystemPrompt.ts:lines

Configuration Options

The agent system prompt accepts the following configuration parameters:

Parameter	Type	Purpose
`executionInstruction`	string	Task description for the agent
`url`	string	Starting page URL
`systemInstructions`	string	Custom system-level instructions (CDATA)
`variables`	Record	Variable name/value pairs
`isHybridMode`	boolean	Enable screenshot-first page understanding
`captchasAutoSolve`	boolean	Enable CAPTCHA auto-solving (shows roadblocks section)
`hasSearch`	boolean	Enable search tool usage
`isLocal`	boolean	Local vs remote browser context

Error Handling

The context manages HTTP header configuration errors with detailed session reporting:

const failures = Array.from(result.errors.entries()).map(([target, entry]) => {
  const reason = entry.result.reason as Error;
  const sid = entry.session.id ?? "unknown";
  const message = reason?.message ?? String(reason);
  return `session=${sid} error=${message}`;
});

if (failures.length) {
  throw new StagehandSetExtraHTTPHeadersError(failures);
}

Sources: context.ts:lines

Extension Points

Stagehand provides several extension mechanisms:

Custom Instructions: Via systemInstructions parameter with CDATA wrapping for complex multi-line content
Variables: For sensitive data injection (passwords, tokens)
Init Scripts: JavaScript injection at page load time
Tool Schema: Extensible action definitions in the Microsoft CUA client

Sources: [context.ts:lines]()

CDP Engine

Related topics: Architecture Overview, DOM and Accessibility Tree, Core Actions

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Launch Module

Continue reading this section for the full explanation and source context.

Section Session Manager

Continue reading this section for the full explanation and source context.

Section Frame Registry

Continue reading this section for the full explanation and source context.

CDP Engine

The CDP (Chrome DevTools Protocol) Engine is the low-level abstraction layer in Stagehand that provides direct communication between the framework and web browsers. It orchestrates browser launching, session management, frame handling, execution context management, and CDP command dispatching.

Architecture Overview

graph TD
    A[Stagehand API] --> B[CDP Engine]
    B --> C[Launch Module]
    B --> D[Session Manager]
    B --> E[Frame Registry]
    B --> F[Execution Context Registry]
    C --> G[Local Browser]
    C --> H[Browserbase Cloud]
    D --> I[CDP WebSocket Connections]
    E --> J[Frame Lifecycle Events]
    F --> K[Isolated JS Contexts]

The CDP Engine serves as the foundation layer that Stagehand's high-level browser automation primitives are built upon. It abstracts away the complexity of raw CDP WebSocket communication while providing type-safe interfaces for all browser operations.

Sources: packages/core/lib/v3/understudy/cdp.ts:1-50

Core Components

Launch Module

The launch module handles browser instantiation across different deployment environments.

Launch Mode	Source File	Description
Local Browser	`packages/core/lib/v3/launch/local.ts`	Launches Chromium locally using Playwright's browser management
Browserbase Cloud	`packages/core/lib/v3/launch/browserbase.ts`	Connects to remote browser infrastructure for scalable execution

graph LR
    A[Launch Request] --> B{Local or Remote?}
    B -->|Local| C[local.ts]
    B -->|Remote| D[browserbase.ts]
    C --> E[Playwright Browser]
    D --> F[CDP Endpoint]

The local launch implementation uses Playwright's browser management system to spawn Chromium instances with the necessary debugging flags enabled. Remote launches establish WebSocket connections to Browserbase's cloud browser fleet.

Sources: packages/core/lib/v3/launch/local.ts:1-100

Session Manager

The Session Manager (context.ts) maintains active browser pages and their associated CDP sessions.

// Pages retrieval - returns top-level pages oldest to newest
pages(): Page[] {
  const rows: Array<{ tid: TargetId; page: Page; created: number }> = [];
  for (const [tid, page] of this.pagesByTarget) {
    if (this.typeByTarget.get(tid) === "page") {
      rows.push({ tid, page, created: this.createdAtByTarget.get(tid) ?? 0 });
    }
  }
  rows.sort((a, b) => a.created - b.created);
  return rows.map((r) => r.page);
}

The session manager intentionally excludes OOPIF (Out-of-Process iframe) targets from page listings, focusing on top-level browsing contexts.

Sources: packages/core/lib/v3/understudy/context.ts:60-75

Frame Registry

The Frame Registry (frameRegistry.ts) tracks all frame instances across page hierarchies. It maintains the relationship between parent and child frames, enabling accurate frame targeting for CDP commands.

Method	Purpose
`registerFrame`	Track newly created frames
`unregisterFrame`	Clean up destroyed frames
`getParentFrame`	Resolve parent frame context
`getChildFrames`	List direct children of a frame

Execution Context Registry

The Execution Context Registry (executionContextRegistry.ts) manages JavaScript execution contexts within frames.

graph TD
    A[Page Load] --> B[Create Main World Context]
    B --> C[Optional: Create Isolated World]
    C --> D[Register Context in Registry]
    D --> E[Frame Ready for Script Execution]
    
    F[iframe Navigation] --> G[Create New Context]
    G --> D

Each frame can have multiple execution contexts:

Main World: The default JavaScript context where page scripts run
Isolated World: Sandboxed contexts for extension scripts or injected code

Sources: packages/core/lib/v3/understudy/executionContextRegistry.ts:1-80

Frame Lifecycle Management

The frameLocator.ts module handles the complex state transitions of browser frames, particularly for iframes and cross-origin navigations.

const onLifecycle = (evt: Protocol.Page.LifecycleEventEvent) => {
  if (
    evt.frameId !== childFrameId ||
    (evt.name !== "DOMContentLoaded" &&
      evt.name !== "load" &&
      evt.name !== "networkIdle" &&
      evt.name !== "networkidle")
  ) {
    return;
  }
  // Handle frame initialization
};

Key lifecycle events monitored:

DOMContentLoaded: Frame DOM is parsed
load: All resources loaded
networkIdle / networkidle: No pending network requests

Sources: packages/core/lib/v3/understudy/frameLocator.ts:25-40

The cookies.ts module provides CDP-compatible cookie handling with validation and normalization.

export function normalizeCookieParams(cookies: CookieParam[]): CookieParam[] {
  return cookies.map((c) => {
    if (!c.url && !(c.domain && c.path)) {
      throw new CookieValidationError(
        `Cookie "${c.name}" must have a url or a domain/path pair`
      );
    }
    // Additional validation for secure/sameSite pairing
  });
}

Validation Rule	CDP Requirement
URL or Domain/Path	Cookies must have either `url` or both `domain`+`path`
Secure + SameSite=None	Browsers require `secure: true` when using `sameSite: "None"`
Host-only cookies	When `url` provided, `domain` and `path` are derived

The module enforces these constraints before sending cookies to CDP, preventing silent failures from browser rejections.

Sources: packages/core/lib/v3/understudy/cookies.ts:30-60

CDP Command Dispatch

The CDP Engine provides a unified interface for sending commands to browsers:

sequenceDiagram
    Client->>CDP Engine: Execute CDP Command
    CDP Engine->>Validation: Check Parameters
    Validation->>Session Manager: Route to Correct Session
    Session Manager->>CDP WebSocket: Send Command
    CDP WebSocket-->>Session Manager: CDP Response
    Session Manager-->>CDP Engine: Typed Response
    CDP Engine-->>Client: Result

Supported Command Categories

Category	Capabilities
Page Operations	Navigation, reload, back/forward
Frame Management	Frame creation, destruction, activation
Script Execution	Evaluate expressions, call functions
Input Handling	Mouse, keyboard, touch events
Network Monitoring	Request/response capture
Runtime Inspection	Console access, breakpoints

Initialization Scripts

The CDP Engine supports injecting initialization scripts into pages:

private async applyInitScriptsToPage(
  page: Page,
  opts?: { seedOnly?: boolean },
): Promise<void> {
  if (opts?.seedOnly) {
    for (const source of this.initScripts) {
      page.seedInitScript(source);
    }
    return;
  }
  for (const source of this.initScripts) {
    await page.registerInitScript(source);
  }
}

Script Type	Timing	Use Case
`seedInitScript`	Before page loads	Pre-inject polyfills, shims
`registerInitScript`	After page load	Runtime modifications

Error Handling

The CDP Engine implements robust error handling for common CDP failure scenarios:

// Session-level error collection
const failures = pendingEntries
  .filter((entry) => entry.result.status === "failure")
  .map((entry) => {
    const reason = entry.result.reason as Error;
    const sid = entry.session.id ?? "unknown";
    const message = reason?.message ?? String(reason);
    return `session=${sid} error=${message}`;
  });

if (failures.length) {
  throw new StagehandSetExtraHTTPHeadersError(failures);
}

Custom error types:

StagehandSetExtraHTTPHeadersError: Failed header injection
CookieValidationError: Invalid cookie parameters
ExecutionContextError: Script execution failures

Sources: packages/core/lib/v3/understudy/context.ts:45-55

Integration with Agent System

The CDP Engine is used by higher-level agents (like MicrosoftCUAClient) to perform browser automation:

// Supported actions in Microsoft CUA integration
const supportedActions = [
  "left_click",
  "scroll",
  "visit_url",
  "web_search",
  "history_back",
  "pause_and_memorize_fact",
  "wait",
  "terminate",
];

Each action maps to specific CDP commands, with the CDP Engine abstracting the protocol-level details.

Sources: packages/core/lib/v3/agent/MicrosoftCUAClient.ts:50-70

Configuration

Option	Type	Default	Description
`headless`	boolean	auto	Run browser in headless mode
`timeout`	number	30000	Default command timeout (ms)
`viewport`	Viewport	1280x720	Browser viewport size
`userAgent`	string	auto	Custom user agent string
`ignoreHTTPSErrors`	boolean	false	Allow invalid certificates

Summary

The CDP Engine is the foundational layer that enables Stagehand's browser automation capabilities. It provides:

Abstraction: Type-safe interfaces over raw CDP WebSocket communication
Multi-environment: Support for both local and cloud browser execution
Lifecycle management: Frame and execution context tracking
Validation: Cookie and parameter normalization before CDP commands
Error handling: Structured error types and recovery mechanisms

All high-level browser automation features in Stagehand build upon the CDP Engine's primitives, making it essential for understanding the framework's internals.

Sources: [packages/core/lib/v3/understudy/cdp.ts:1-50]()

Core Actions

Related topics: CDP Engine, DOM and Accessibility Tree, Agent System

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Handler Components

Continue reading this section for the full explanation and source context.

Section Browser Navigation Actions

Continue reading this section for the full explanation and source context.

Section Interaction Actions

Continue reading this section for the full explanation and source context.

Core Actions

Overview

Core Actions are the fundamental building blocks of browser automation in Stagehand. They represent the primitive operations that the AI agent can perform to interact with web pages, extract information, and accomplish user-defined tasks. These actions bridge the gap between natural language instructions and executable browser operations.

The Core Actions system is designed around a modular architecture where different handlers manage specific types of interactions. Each handler is responsible for a category of actions, implementing the logic to translate high-level agent decisions into low-level browser commands.

Sources: packages/core/lib/v3/handlers/v3AgentHandler.ts

Architecture

The Core Actions system follows a layered architecture:

graph TD
    A[User Instruction] --> B[Agent Handler]
    B --> C[Action Router]
    C --> D[actHandler]
    C --> E[extractHandler]
    C --> F[observeHandler]
    D --> G[Browser CDP Commands]
    E --> G
    F --> G
    H[Microsoft CUA Client] --> G

Handler Components

Handler	Purpose	Key Operations
`actHandler`	Execute user interaction actions	click, type, scroll, hover, drag
`extractHandler`	Extract structured data from pages	DOM parsing, content extraction
`observeHandler`	Analyze and understand page state	accessibility tree, element detection
`v3AgentHandler`	Orchestrate agent workflow	task planning, action coordination

Sources: packages/core/lib/v3/handlers/actHandler.ts

Action Types

Action	Parameters	Description
`visit_url`	`url`, `wait`	Navigate to a specified URL with optional wait state
`history_back`	-	Navigate back in browser history
`history_forward`	-	Navigate forward in browser history
`reload`	-	Reload the current page

Interaction Actions

Action	Parameters	Description
`left_click`	`coordinate`, `element`	Click at coordinates or on an element
`right_click`	`coordinate`, `element`	Perform right-click context menu action
`mouse_move`	`coordinate`	Move mouse to specified position
`scroll`	`pixels`, `coordinate`	Scroll by pixel amount (positive=up, negative=down)
`drag`	`from`, `to`, `steps`	Perform drag operation with interpolation
`type`	`text`, `press_enter`, `delete_existing_text`	Type text into focused input fields

Information Retrieval Actions

Action	Parameters	Description
`extract`	`schema`	Extract structured data matching a Zod schema
`screenshot`	`full_page`	Capture visual representation of page
`aria_tree`	`ref`	Get accessibility tree for element inspection

Utility Actions

Action	Parameters	Description
`wait`	`time`	Pause execution for specified seconds
`pause_and_memorize_fact`	`fact`	Store information for later retrieval
`web_search`	`query`	Execute a web search
`terminate`	`status`	End the task with success or failure status

Sources: packages/core/lib/v3/agent/MicrosoftCUAClient.ts:1-100

Action Parameters

Common Parameters

interface ActionParameters {
  action: string;           // The action type to perform
  element?: string;          // Element reference (e.g., "@0-5")
  coordinate?: [number, number]; // [x, y] pixel coordinates
  ref?: string;             // Element reference in ariaTree
}

Type Action Parameters

Parameter	Type	Description
`text`	`string`	Text to type into the input field
`press_enter`	`boolean`	Whether to press Enter after typing
`delete_existing_text`	`boolean`	Clear existing text before typing

Scroll Action Parameters

Parameter	Type	Description
`pixels`	`number`	Positive values scroll up, negative scroll down
`coordinate`	`[number, number]`	Optional target coordinates for viewport-relative scrolling

Extract Action Parameters

Parameter	Type	Description
`schema`	`ZodSchema`	Zod schema defining the extraction structure
`prompt`	`string`	Natural language description of data to extract

Sources: packages/core/lib/v3/agent/MicrosoftCUAClient.ts:20-85

Tool Schema

The agent uses a FARA (Foundation Agent Reference Architecture) function calling template for tool execution:

You are provided with function signatures within <tools></tools> XML tags:
<tools>
${toolSchema}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{{"name": <function-name>, "arguments": <args-json-object>}}
</tool_call>

The tool schema defines the complete contract between the agent reasoning system and the browser automation layer:

{
  "name": "browser_automation",
  "description": "Primary tool for browser automation tasks",
  "parameters": {
    "type": "object",
    "properties": {
      "action": {
        "type": "string",
        "enum": ["left_click", "scroll", "visit_url", ...]
      },
      "coordinate": {
        "type": "array",
        "description": "(x, y): The x and y coordinates for mouse operations"
      },
      "pixels": {
        "type": "number",
        "description": "Scroll amount; positive = up, negative = down"
      }
    },
    "required": ["action"]
  }
}

Sources: packages/core/lib/v3/agent/MicrosoftCUAClient.ts:85-130

Execution Flow

sequenceDiagram
    participant User
    participant Agent
    participant Handler
    participant CDP as Chrome DevTools Protocol
    
    User->>Agent: Natural language instruction
    Agent->>Agent: Reason and plan action
    Agent->>Handler: Execute action with parameters
    Handler->>CDP: Translate to CDP command
    CDP-->>Handler: Operation result
    Handler-->>Agent: Action completion status
    Agent->>User: Task progress update

Page Context Management

The Context class manages page state and frame handling for multi-tab scenarios:

pages(): Page[] {
  const rows: Array<{ tid: TargetId; page: Page; created: number }> = [];
  for (const [tid, page] of this.pagesByTarget) {
    if (this.typeByTarget.get(tid) === "page") {
      rows.push({ tid, page, created: this.createdAtByTarget.get(tid) ?? 0 });
    }
  }
  rows.sort((a, b) => a.created - b.created);
  return rows.map((r) => r.page);
}

The context system maintains:

Active page instances by target ID
Frame lifecycle management
Execution context tracking per frame
Initialization script injection

Sources: packages/core/lib/v3/understudy/context.ts:1-50

Frame Locator

For handling nested frames and iframes, the frame locator system waits for frame readiness:

graph TD
    A[Parent Page] --> B{Has Main World?}
    B -->|No| C[Enable Lifecycle Events]
    C --> D[Wait for DOMContentLoaded]
    D --> E{Has Main World?}
    E -->|Yes| F[Continue]
    E -->|No| G[Wait with timeout]
    G --> H[Get Session for Frame]
    H --> I[Wait for Main World]
    I --> F
    B -->|Yes| F

Sources: packages/core/lib/v3/understudy/frameLocator.ts

System Prompt Configuration

The agent's behavior is configured through system prompts that include:

Task Definition: The execution instruction and goal context
Page Understanding Protocol: When to use ariaTree vs screenshot
Strategy Guidelines: Best practices for action sequencing
Variable Substitution: Support for dynamic values via %variableName% syntax

const systemPrompt = `<system>
  <identity>You are a web automation assistant using browser automation tools</identity>
  <task>
    <goal>${cdata(executionInstruction)}</goal>
    <date display="local" iso="${isoDate}">${localeDate}</date>
  </task>
  ${customInstructionsBlock}
  <page_understanding_protocol>
    ${isHybridMode ? screenshot + ariaTree : ariaTree + screenshot}
  </page_understanding_protocol>
  ${variablesSection}
</system>`;

Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts

Error Handling

Actions can fail due to various reasons:

Error Type	Cause	Recovery Strategy
Element not found	Selector invalid or element removed	Re-observe page, retry with updated selector
Action timeout	Page not responding	Wait and retry
Frame detached	Frame navigated away	Re-acquire frame context
CDP error	Protocol communication failure	Retry CDP command

Failed actions are logged with session context for debugging:

const failures = await Promise.allSettled(
  this.context.extraHTTPHeaders.map(async ({ entry }) => {
    const result = await entry.result;
    const reason = entry.result.reason as Error;
    const message = reason?.message ?? String(reason);
    return `session=${sid} error=${message}`;
  })
);

Sources: packages/core/lib/v3/understudy/context.ts:50-80

Best Practices

Atomic Actions: Keep actions focused and single-purpose for better reliability
Element References: Use ariaTree references (e.g., @0-5) when available for precise targeting
Wait Appropriately: Allow page state to stabilize before subsequent actions
Error Recovery: The system supports automatic retry and self-healing mechanisms
Variable Usage: Use %variableName% for sensitive data like passwords

DOM and Accessibility Tree

Related topics: CDP Engine, Core Actions, DOM and Accessibility Tree

Section Related Pages

Continue reading this section for the full explanation and source context.

DOM and Accessibility Tree

Overview

The DOM and Accessibility Tree system in Stagehand provides the foundational mechanism for understanding and interacting with web pages. This dual-layer approach combines native DOM manipulation with accessibility tree analysis to enable reliable browser automation through AI agents.

The system serves two primary purposes:

DOM Interaction - Direct manipulation of page elements including clicking, typing, scrolling, and form handling
Accessibility Tree (ariaTree) - Structured representation of page content optimized for AI comprehension and element discovery

This architecture allows Stagehand to balance the precision of programmatic DOM access with the semantic understanding provided by accessibility APIs.

Source: https://github.com/browserbase/stagehand / Human Manual

LLM Providers

Related topics: Project Introduction, Agent System

Section Related Pages

Continue reading this section for the full explanation and source context.

Section High-Level Component Flow

Continue reading this section for the full explanation and source context.

Section System Prompt Construction

Continue reading this section for the full explanation and source context.

Section Tool Schema Definition

Continue reading this section for the full explanation and source context.

Related topics: Project Introduction, Agent System

LLM Providers

Overview

Stagehand implements a flexible LLM Provider abstraction that enables browser automation agents to interact with various large language model backends. The provider system is designed to standardize how agents execute actions, parse responses, and handle tool invocations across different AI service providers.

The LLM Provider architecture follows a client-adapter pattern where each provider (OpenAI, Anthropic, Google) implements a common interface while exposing provider-specific capabilities and response formats.

Architecture

High-Level Component Flow

graph TD
    A[Agent Client] --> B[LLM Provider Interface]
    B --> C[OpenAI Client]
    B --> D[Anthropic Client]
    B --> E[Google Client]
    C --> F[OpenAI API]
    D --> G[Claude API]
    E --> H[Gemini API]
    F --> I[Action Execution]
    G --> I
    H --> I

System Prompt Construction

The agent system prompt is constructed dynamically based on execution context and mode. The buildAgentSystemPrompt function in agentSystemPrompt.ts assembles the prompt from multiple modular sections:

return `<system>
  <identity>You are a web automation assistant using browser automation tools to accomplish the user's goal.</identity>
  ${customInstructionsBlock}<task>
    <goal>${cdata(executionInstruction)}</goal>
    <date display="local" iso="${isoDate}">${localeDate}</date>
  </task>
  ...
</system>`;

Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:1-50

Provider Interface

Tool Schema Definition

LLM Providers use a standardized tool schema format for function calling. The schema defines available browser automation actions:

const toolSchema = {
  properties: {
    action: {
      type: "string",
      description: "The action to perform",
      enum: ["screenshot", "extract", "click", "type", "wait", "scroll", 
             "hover", "press", "goBack", "goForward", "executeJs", 
             "fillForm", "fillFormVision", "act", "ariaTree", 
             "pause_and_memorize_fact", "terminate"],
    },
    // Additional parameters...
  },
};

Sources: packages/core/lib/v3/agent/MicrosoftCUAClient.ts:1-100

Function Call Template

Providers implement XML-based tool calling using a standardized template:

<tools>
${toolDescs}
</tools>

<tool_call>
{{"name": <function-name>, "arguments": <args-json-object>}}
</tool_call>

This format enables reliable parsing of model responses across different LLM backends.

Supported Providers

Microsoft Copilot Agent (CUA)

The Microsoft CUA client implements the FARA (Function-calling Augmented Response Agent) pattern with XML-based tool calling:

Parameter	Type	Description
`action`	string	Action type to execute
`selector`	string	Element selector for DOM operations
`text`	string	Text content for typing or extraction
`reasoning`	string	Model's reasoning for the action
`time`	number	Wait duration in seconds
`status`	string	Task completion status

Sources: packages/core/lib/v3/agent/MicrosoftCUAClient.ts:80-120

Agent Modes

Stagehand supports different operational modes that affect how the agent interacts with LLM providers:

Hybrid Mode

In hybrid mode, the agent prioritizes visual understanding:

const pageUnderstandingProtocol = isHybridMode
  ? `<page_understanding_protocol>
    <step_1>
      <primary_tool>
        <name>screenshot</name>
        <usage>Visual confirmation when needed</usage>
      </primary_tool>
      <secondary_tool>
        <name>ariaTree</name>
        <usage>Get complete page context before taking actions</usage>
      </secondary_tool>
    </step_1>
  </page_understanding_protocol>`
  : `<page_understanding_protocol>
    <step_1>
      <primary_tool>
        <name>ariaTree</name>
        <usage>Get complete page context before taking actions</usage>
      </primary_tool>
      <secondary_tool>
        <name>screenshot</name>
        <usage>Visual confirmation when needed</usage>
      </secondary_tool>
    </step_1>
  </page_understanding_protocol>`;

Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:100-130

Variable Substitution

Providers support variable substitution in tool parameters for sensitive data:

const variableToolsNote = isHybridMode
  ? "Use %variableName% syntax in the type, fillFormVision, or act tool's value/text/action fields."
  : "Use %variableName% syntax in the act or fillForm tool's action fields.";

Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:60-65

Tool Actions

Available Browser Automation Actions

Action	Description	Primary Use Case
`screenshot`	Capture page screenshot	Visual verification
`extract`	Extract structured data	Data collection tasks
`click`	Click on element	Navigation/interaction
`type`	Type text into input	Form filling
`wait`	Wait for condition	Synchronization
`scroll`	Scroll the page	Content visibility
`ariaTree`	Get accessibility tree	Page structure understanding
`fillForm`	Fill form fields	Multi-field form completion

Response Parsing

Thoughts and Action Extraction

The provider implements parsing logic to extract model thoughts and function calls from responses:

private parseThoughtsAndAction(response: string): {
  thoughts: string;
  functionCall: FaraFunctionCall;
} {
  try {
    const parts = response.split("<tool_call>\n");
    const thoughts = parts[0].trim();
    const actionText = parts[1]?.trim() ?? "";
    // Parse JSON action from actionText
  }
}

Sources: packages/core/lib/v3/agent/MicrosoftCUAClient.ts:150-180

Context Management

LLM Providers operate within a browser context that manages multiple pages and execution environments:

pages(): Page[] {
  const rows: Array<{ tid: TargetId; page: Page; created: number }> = [];
  for (const [tid, page] of this.pagesByTarget) {
    if (this.typeByTarget.get(tid) === "page") {
      rows.push({ tid, page, created: this.createdAtByTarget.get(tid) ?? 0 });
    }
  }
  rows.sort((a, b) => a.created - b.created);
  return rows.map((r) => r.page);
}

Sources: packages/core/lib/v3/understudy/context.ts:80-95

Security Considerations

Providers interact with secure cookie handling:

export function normalizeCookieParams(cookies: CookieParam[]): CookieParam[] {
  return cookies.map((c) => {
    if (!c.url && !(c.domain && c.path)) {
      throw new CookieValidationError(
        `Cookie "${c.name}" must have a url or a domain/path pair`,
      );
    }
    // Validates secure flag for sameSite: "None"
  });
}

Sources: packages/core/lib/v3/understudy/cookies.ts:50-75

Configuration Options

System Prompt Variables

Variable	Type	Description
`executionInstruction`	string	User's task description
`url`	string	Starting page URL
`isoDate`	string	Current ISO date
`localeDate`	string	Localized date string
`variables`	object	Key-value pairs for substitution
`captchasAutoSolve`	boolean	Enable CAPTCHA auto-solving

Strategy Components

Common Strategy Items

The system prompt includes standardized guidance for all providers:

const commonStrategyItems = `
  <item>CRITICAL: Use extract ONLY when the task explicitly requires structured data output.</item>
  <item>Keep actions atomic and verify outcomes before proceeding.</item>
  <item>For each action, provide clear reasoning about why you're taking that step.</item>
  <item>When you need to input text that could be entered character-by-character or through multiple separate inputs, prefer using the keys tool.</item>
`;

Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:130-145

Frame and Context Handling

Execution Context Management

Providers handle frame navigation and context switching:

await parentSession
  .send("Page.setLifecycleEventsEnabled", { enabled: true })
  .catch(() => {});
await parentSession.send("Runtime.enable").catch(() => {});

Events monitored include:

DOMContentLoaded
load
networkIdle
networkidle

Sources: packages/core/lib/v3/understudy/frameLocator.ts:30-45

Agent System

Related topics: LLM Providers, Core Actions

Section Related Pages

Continue reading this section for the full explanation and source context.

Section AgentClient

Continue reading this section for the full explanation and source context.

Section AI Provider Clients

Continue reading this section for the full explanation and source context.

Section Tool System

Continue reading this section for the full explanation and source context.

Related topics: LLM Providers, Core Actions

Agent System

Stagehand is an AI-powered browser automation framework that enables developers to control web browsers using natural language instructions combined with precise code control. The Agent System is the core intelligence layer that coordinates AI-driven decision making with browser operations.

Overview

The Agent System serves as the orchestration layer between Large Language Models (LLMs) and browser automation. It provides:

AI-driven navigation: Autonomous page understanding and navigation
Action execution: Intelligent element detection and interaction
Multi-modal page analysis: Combining visual screenshots with accessibility tree data
Self-healing capabilities: Automatic recovery from session failures
Variable substitution: Secure handling of sensitive data like passwords

Sources: packages/core/README.md

Architecture

graph TB
    subgraph "Agent System"
        A[User Instruction] --> B[AgentClient]
        B --> C[System Prompt Generator]
        C --> D[LLM Provider]
        D --> E[Action Planner]
    end
    
    subgraph "Browser Layer"
        E --> F[Act Tool]
        E --> G[Screenshot Tool]
        E --> H[AriaTree Tool]
        E --> I[Extract Tool]
    end
    
    subgraph "Execution"
        F --> J[Stagehand Context]
        G --> J
        H --> J
        I --> J
        J --> K[Browser Page]
    end

Core Components

AgentClient

The AgentClient is the main entry point for agent-based browser automation. It handles:

Initialization of AI providers (Anthropic, Google)
System prompt construction with context-aware instructions
Action execution loop with retry logic
Session state management

Sources: packages/core/README.md

AI Provider Clients

Stagehand supports multiple AI providers through specialized client implementations:

Provider	Client Class	Model Type
Anthropic	`AnthropicCUAClient`	Claude Computer Use
Google	`GoogleCUAClient`	Gemini Computer Use

Each client implements a unified interface for:

Action inference from LLM responses
Tool call execution
Error handling and recovery

Tool System

The Agent System exposes several tools for browser interaction:

Tool	Purpose	Primary Use Case
`act`	Execute browser actions	Clicking, typing, scrolling
`screenshot`	Capture visual page state	Visual confirmation
`ariaTree`	Extract accessibility tree	Page structure analysis
`extract`	Structured data extraction	Data collection tasks

Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:1-50

System Prompt Architecture

The system prompt is dynamically generated based on configuration and execution mode. It uses XML-like tags to structure instructions for the LLM.

graph TD
    A[Base Prompt Template] --> B[Identity Section]
    A --> C[Task Section]
    A --> D[Page Section]
    A --> E[Mindset Section]
    A --> F[Guidelines Section]
    A --> G[Tools Section]
    A --> H[Strategy Section]
    A --> I[Roadblocks Section]
    A --> J[Variables Section]
    
    B --> K[Generated System Prompt]
    C --> K
    D --> K
    E --> K
    F --> K
    G --> K
    H --> K
    I --> K
    J --> K

Key Prompt Sections

#### Page Understanding Protocol

The agent uses different page understanding strategies based on the execution mode:

Hybrid Mode (default):

Primary tool: screenshot for visual confirmation
Secondary tool: ariaTree for complete page context

Standard Mode:

Primary tool: ariaTree for accessibility tree
Secondary tool: screenshot for visual confirmation

Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:100-130

#### Strategy Guidelines

The system prompt includes critical guidelines:

Use extract ONLY when structured data output is explicitly required
Keep actions atomic and verify outcomes before proceeding
Use keys tool for text input that requires character-by-character entry
Prefer act for direct element interaction when available in ariaTree

#### Variables Handling

Sensitive data can be passed securely using variable substitution:

Variable Syntax: %variableName%

Supported tools for variable substitution:

act tool's action fields
fillForm or fillFormVision in hybrid mode
type tool's value fields

Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:60-90

Execution Modes

Local Mode

Runs Chrome directly on the local machine. Best for:

Local debugging and development
Fast iteration cycles
Access to local network resources

Remote Mode

Runs a Browserbase session in the cloud. Best for:

Anti-bot hardening
Cloud deployments
Scalable agent execution

Sources: packages/cli/README.md

Context Management

The StagehandContext class manages the underlying browser state:

// Page retrieval - returns pages oldest to newest
pages(): Page[]

// Init script management
applyInitScriptsToPage(page: Page, opts?: { seedOnly?: boolean }): Promise<void>

// Session error tracking
handleSessionErrors(): void

Page targets are filtered to exclude OOPIF (Out-of-Process iFrames) targets, ensuring only top-level pages are returned.

Sources: packages/core/lib/v3/understudy/context.ts:1-80

Frame Handling

The Agent System handles multi-frame page scenarios through the frameLocator module:

Lifecycle event monitoring: Waits for DOMContentLoaded, load, or networkIdle events
Frame context synchronization: Ensures main world execution context is available on parent frame
Session ownership transfer: Handles frame ownership changes during page navigation

sequenceDiagram
    participant Parent as Parent Frame
    participant Child as Child Frame
    participant Page as Page Object
    
    Parent->>Child: Set Lifecycle Events Enabled
    Child->>Parent: Lifecycle Event (DOMContentLoaded)
    Parent->>Page: Get Session For Frame
    Page-->>Parent: Session Owner
    Parent->>Child: Wait For Main World
    Child-->>Parent: Execution Context Ready

Sources: packages/core/lib/v3/understudy/frameLocator.ts:1-60

CAPTCHA Handling

When captchasAutoSolve is enabled, the system prompt includes a roadblocks section:

<roadblocks>
  <note>{CAPTCHA_SYSTEM_PROMPT_NOTE}</note>
</roadblocks>

This informs the agent about automatic CAPTCHA resolution capabilities.

Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:95-100

Session Recovery

The browse CLI implements automatic session recovery:

Detects daemon or Chrome crashes
Cleans up stale processes and files
Restarts the daemon automatically
Retries the failed command

Agents don't need explicit error handling for session failures.

Sources: packages/cli/README.md

CLI Integration

The Agent System is accessible via the browse CLI:

browse open <url>              # Navigate with auto-start
browse click <ref>            # Click by accessibility ref
browse type <text>            # Type text input
browse snapshot [-c|--compact] # Get accessibility tree
browse screenshot [path]       # Capture visual state
browse extract <schema>        # Structured data extraction

Network Capture

HTTP requests can be captured for debugging:

browse network on   # Start capturing
browse network off  # Stop capturing
browse network path # Get capture directory

Captured requests are saved as:

/tmp/browse-default-network/
  001-GET-api.github.com-repos/
    request.json
    response.json

Sources: packages/cli/CHANGELOG.md Sources: packages/cli/README.md

Configuration Options

Option	Type	Description
`captchasAutoSolve`	boolean	Enable automatic CAPTCHA solving
`systemInstructions`	string	Custom instructions for the agent
`variables`	Record	Key-value pairs for secure substitution
`isHybridMode`	boolean	Enable screenshot-first page understanding

Data Flow

graph LR
    A[User Instruction] --> B[AgentClient]
    B --> C[Prompt Generator]
    C --> D[LLM Inference]
    D --> E[Action Decision]
    E --> F[Tool Execution]
    F --> G[Browser Response]
    G --> H[Context Update]
    H --> B
    
    E -->|Page Understanding| I[screenshot]
    E -->|Page Understanding| J[ariaTree]
    I --> G
    J --> G

Summary

The Agent System provides a robust abstraction layer for AI-driven browser automation. Key characteristics:

Multi-provider support: Works with Anthropic Claude and Google Gemini computer use models
Adaptive page understanding: Automatically selects optimal page analysis strategies
Secure variable handling: Supports sensitive data substitution without exposure
Self-healing sessions: Automatically recovers from browser failures
Flexible deployment: Supports both local and cloud-based execution environments

This architecture enables developers to build reliable browser automation workflows that combine the flexibility of natural language instructions with the precision of programmatic control.

Sources: [packages/core/README.md]()

MCP Integration

Note: The MCP (Model Context Protocol) integration files were referenced in this wiki task but are not present in the currently available repository context. This page is based on the overall Stagehand architecture and CLI capabilities documented in the provided source files.

Overview

The MCP (Model Context Protocol) integration in Stagehand enables the browser automation framework to communicate with external MCP servers, allowing AI agents to interact with specialized tools and services beyond built-in browser automation capabilities.

Architecture

MCP integration follows a client-server architecture where Stagehand acts as an MCP client that connects to external MCP servers providing additional functionality.

graph TD
    A[Stagehand Agent] --> B[MCP Connection Manager]
    B --> C[MCP Server 1]
    B --> D[MCP Server 2]
    B --> E[MCP Server N]
    C --> F[External Tools/Services]
    D --> G[External Tools/Services]
    E --> H[External Tools/Services]

Connection Management

The MCP connection module (connection.ts) handles the lifecycle of MCP server connections:

Method	Purpose
`connect()`	Establish connection to an MCP server
`disconnect()`	Close connection and cleanup resources
`sendRequest()`	Send request to connected MCP server
`receiveResponse()`	Handle responses from MCP server

Utility Functions

The MCP utilities module (utils.ts) provides helper functions for:

Message formatting and parsing
Error handling and retry logic
Connection state management
Tool call serialization/deserialization

Usage Example

Basic integration pattern (from packages/core/examples/mcp.ts):

import { Stagehand } from "@browserbasehq/stagehand";

// Initialize Stagehand with MCP configuration
const stagehand = new Stagehand({
  mcpServers: [
    {
      name: "my-mcp-server",
      command: "npx",
      args: ["-y", "@my-org/mcp-server"],
    },
  ],
});

await stagehand.init();

// Use Stagehand with MCP tools available
await stagehand.page.goto("https://example.com");

Configuration Options

Option	Type	Description
`name`	`string`	Display name for the MCP server
`command`	`string`	Executable to run the MCP server
`args`	`string[]`	Arguments passed to the MCP server command
`env`	`Record<string, string>`	Environment variables for the server
`timeout`	`number`	Connection timeout in milliseconds

CLI Daemon (packages/cli): Handles MCP server spawning and lifecycle
Shutdown Supervisor (supervisor.ts): Manages cleanup of MCP connections on exit
Agent System Prompt (agentSystemPrompt.ts): Provides instructions for using MCP tools

References

CLI daemon management: packages/cli/README.md
Shutdown handling: packages/core/lib/v3/shutdown/supervisor.ts:1-50
Agent tool usage: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:1-100

Note: For complete MCP implementation details, refer to the source files listed at the top of this page once they are available in the repository context.

Source: https://github.com/browserbase/stagehand / Human Manual

Server API

Related topics: Architecture Overview, CLI Tools

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Server Versions

Continue reading this section for the full explanation and source context.

Section Component Structure

Continue reading this section for the full explanation and source context.

Section Request Flow

Continue reading this section for the full explanation and source context.

Related topics: Architecture Overview, CLI Tools

Server API

The Stagehand Server API provides a headless browser automation service designed to power AI agents and programmatic browser control. It exposes HTTP endpoints for launching, controlling, and managing browser sessions, serving as the backend infrastructure for the browse CLI command and SDK integrations.

Overview

The Server API is a RESTful service built in TypeScript that manages browser sessions using Playwright. It supports both local and remote execution modes, with the remote mode utilizing Browserbase's cloud infrastructure for scalable browser automation. The API handles session lifecycle management, CDP (Chrome DevTools Protocol) proxying, and persistent browser state across requests.

graph TD
    subgraph "Client Layer"
        CLI[CLI: browse]
        SDK[SDK Client]
    end
    
    subgraph "Server Layer"
        V3[Server v3]
        V4[Server v4]
    end
    
    subgraph "Session Management"
        SS[SessionStore]
        DB[(Database)]
    end
    
    subgraph "Browser Layer"
        Local[Local Playwright]
        Remote[Browserbase Cloud]
    end
    
    CLI --> V3
    CLI --> V4
    SDK --> V4
    V3 --> SS
    V4 --> SS
    SS --> DB
    V3 --> Local
    V3 --> Remote
    V4 --> Local
    V4 --> Remote

Architecture

Server Versions

Version	Package	Description
v3	`packages/server-v3`	Legacy server implementation with SessionStore
v4	`packages/server-v4`	Current production server with routing and database integration

Component Structure

The server infrastructure consists of three primary layers:

HTTP Server Layer (server.ts): Handles incoming requests, manages WebSocket upgrades for CDP connections, and orchestrates session routing
Session Management Layer (SessionStore.ts): Maintains in-memory and persisted session state, handles cleanup and timeout management
Browser Execution Layer: Interfaces with Playwright locally or Browserbase remotely for actual browser automation

Request Flow

sequenceDiagram
    participant Client
    participant Server
    participant SessionStore
    participant Browser
    
    Client->>Server: POST /browsersession (create)
    Server->>SessionStore: allocate session
    SessionStore->>Browser: launch/attach
    Browser-->>SessionStore: session ready
    SessionStore-->>Server: session handle
    Server-->>Client: session ID + CDP endpoint
    
    Client->>Server: WS /browsersession/:id/cdp
    Server->>Browser: proxy CDP messages
    Browser-->>Server: CDP responses
    Server-->>Client: WS stream

Session Management

SessionStore

The SessionStore class manages browser session lifecycle:

Method	Purpose
`create()`	Initialize new browser session
`get(id)`	Retrieve session by ID
`list()`	List all active sessions
`delete(id)`	Terminate and cleanup session
`cleanup()`	Remove expired/stale sessions

Sessions store metadata including:

Session identifier
Creation timestamp
Last activity timestamp
Browser type and configuration
CDP WebSocket URL
Associated project/API key

Session Lifecycle

Sessions follow this state machine:

stateDiagram-v2
    [*] --> Created: allocate()
    Created --> Launching: spawn browser
    Launching --> Ready: browser connected
    Ready --> Active: first CDP command
    Active --> Idle: no activity timeout
    Idle --> Active: new CDP command
    Active --> Closing: close() or timeout
    Closing --> [*]: cleanup complete
    
    Launching --> Error: spawn failed
    Error --> [*]: cleanup

API Endpoints

Browser Session Routes

The primary routing module at packages/server-v4/src/routes/v4/browsersession/routes.ts defines the session REST API:

Endpoint	Method	Description
`/browsersession`	POST	Create new browser session
`/browsersession/:id`	GET	Get session details
`/browsersession/:id`	DELETE	Close session
`/browsersession/:id/screenshot`	GET	Capture screenshot
`/browsersession/:id/snapshot`	GET	Get accessibility tree

Query Parameters

Parameter	Type	Description
`headless`	boolean	Run without visible window
`viewport.width`	number	Browser viewport width
`viewport.height`	number	Browser viewport height
`contextId`	string	Browserbase Context ID to resume
`persist`	boolean	Persist session across disconnects

Database Integration

The database client at packages/server-v4/src/db/client.ts provides persistence layer:

Session persistence: Store session metadata for recovery
Audit logging: Track session creation, commands, and errors
Usage metrics: Monitor API usage per project/key

Configuration

Environment Variables

Variable	Description
`BROWSERBASE_API_KEY`	Browserbase API key for remote execution
`BROWSERBASE_PROJECT_ID`	Browserbase project identifier
`DATABASE_URL`	PostgreSQL connection string for session storage
`PORT`	HTTP server port (default: 3000)

Execution Modes

Mode	Trigger	Browser Location
`local`	Default (no API key)	Local Playwright installation
`remote`	`BROWSERBASE_API_KEY` set	Browserbase cloud browsers

CDP Proxying

The server acts as a WebSocket proxy for Chrome DevTools Protocol commands:

Client establishes WebSocket connection to /browsersession/:id/cdp
Server forwards messages to the underlying browser session
Responses and events stream back to client
Connection persists until client disconnects or session expires

CLI Integration

The browse CLI commands connect to the server:

# Local execution (daemon auto-starts)
browse open https://example.com

# Remote execution via Browserbase
browse env remote
browse open https://example.com

# Specify server endpoint
browse --ws ws://localhost:3000 open https://example.com

The daemon architecture provides:

Automatic server startup/shutdown
Session persistence across CLI invocations
Graceful recovery on connection failure

Security Considerations

Session tokens are generated securely and scoped to individual sessions
CDP endpoints require valid session authentication
Remote execution uses Browserbase's authentication infrastructure
Sessions auto-expire after configurable inactivity timeout

References

CLI Daemon: packages/cli/README.md
Session Context: packages/core/lib/v3/understudy/context.ts
Frame Locator: packages/core/lib/v3/understudy/frameLocator.ts
Lifecycle Watcher: packages/core/lib/v3/understudy/lifecycleWatcher.ts

Source: https://github.com/browserbase/stagehand / Human Manual

CLI Tools

The Stagehand CLI (browse command) is a command-line interface for browser automation designed specifically for AI agents. It provides a text-based interface for controlling web browsers through CDP (Chrome DevTools Protocol), enabling programmatic browser control without requiring a graphical interface.

Overview

The CLI serves as the foundational tool layer for the Stagehand browser automation framework. It abstracts CDP operations into human-readable commands, making it accessible for shell scripts, AI agent integrations, and development workflows.

The CLI operates in two modes:

Mode	Description	Default Condition
`local`	Runs Chrome locally via bundled Playwright	When no `BROWSERBASE_API_KEY` is set
`remote`	Connects to Browserbase cloud infrastructure	When `BROWSERBASE_API_KEY` is set

Sources: packages/cli/README.md

Architecture

graph TD
    A[User / AI Agent] --> B[browse CLI]
    B --> C{browse env mode}
    C -->|local| D[Local Chrome CDP]
    C -->|remote| E[Browserbase Cloud]
    D --> F[Local Strategy]
    E --> G[Remote Strategy]
    F --> H[Chrome Instance]
    G --> I[Browserbase Infrastructure]
    H --> J[CDP WebSocket]
    I --> J

Daemon Architecture

The CLI uses a persistent daemon process for efficiency:

First invocation starts the daemon in the background
Subsequent commands reuse the existing daemon
The daemon manages Chrome lifecycle and CDP connections

Recovery behavior:

Detects stale daemon
Cleans up old files
Restarts the daemon
Retries the command

Sources: packages/cli/README.md

Command	Description
`browse open <url>`	Navigate to a URL
`browse reload`	Reload current page
`browse back`	Navigate back in history
`browse forward`	Navigate forward in history

Open Command Options

Option	Description	Default
`--wait <state>`	Wait for load state	`load`
`-t, --timeout <ms>`	Page load timeout	30000ms
`--context-id <id>`	Load Browserbase Context
`--persist`	Persist Context across sessions

The --timeout flag controls how long to wait for the page load state. Use longer timeouts for slow-loading pages:

browse open https://slow-site.com --timeout 60000

Interaction Commands

Click Actions

Command	Description
`browse click <ref>`	Click by element ref (e.g., `@0-5`)
`browse click <ref> [-b button]`	Click with button (left/right/middle)
`browse click <ref> [-c count]`	Click multiple times
`browse click_xy <x> <y>`	Click at coordinates

Coordinate Actions

Command	Description
`browse hover <x> <y>`	Hover at coordinates
`browse scroll <x> <y> <dx> <dy>`	Scroll from position
`browse drag <fx> <fy> <tx> <ty>`	Drag from to coordinates

Keyboard Commands

Command	Description
`browse type <text>`	Type text into focused element
`browse type <text> [-d delay]`	Type with character delay
`browse press <key>`	Press keyboard key

Supported keys include standard keys (Enter, Tab, Escape) and modifier combinations (Cmd+A, Cmd+C, etc.).

Form Handling

Command	Description
`browse fill <selector> <value>`	Fill form field
`browse select <selector> <values...>`	Select dropdown options
`browse highlight <selector>`	Highlight element

The fill command automatically presses Enter after typing. Use --no-press-enter to prevent this behavior.

Page Information

Command	Description
`browse get url`	Get current URL
`browse get title`	Get page title
`browse get text <selector>`	Get element text content
`browse get html <selector>`	Get element HTML
`browse get value <selector>`	Get form field value
`browse get box <selector>`	Get center coordinates
`browse get markdown [selector]`	Convert HTML to markdown

The snapshot command returns the accessibility tree with element references:

browse snapshot [-c|--compact]

Output format includes refs like [0-5], [1-2]:

RootWebArea "Example" url="https://example.com"
  [0-0] link "Home"
  [0-1] link "About"
  [0-2] button "Sign In"

Sources: packages/cli/README.md

Waiting Commands

Command	Description
`browse wait load [state]`	Wait for page load state
`browse wait selector <selector>`	Wait for element
`browse wait timeout <ms>`	Wait for duration

Selector wait options:

Option	Description	Default
`-t timeout`	Maximum wait time	-
`-s visible\	hidden\	attached\	detached`	Element state	visible

Multi-Tab Management

Command	Description
`browse pages`	List all open tabs
`browse newpage [url]`	Open new tab
`browse tab_switch <n>`	Switch to tab by index
`browse tab_close [n]`	Close tab (default: last)

Network Capture

Enable network request capture for debugging and inspection:

Command	Description
`browse network on`	Start capturing requests
`browse network off`	Stop capturing
`browse network path`	Get capture directory
`browse network clear`	Clear captured requests

Capture Directory Structure

/tmp/browse-default-network/
  001-GET-api.github.com-repos/
    request.json      # method, url, headers, body
    response.json     # status, headers, body, duration

Session Management

Command	Description
`browse start`	Start daemon
`browse stop`	Stop daemon
`browse status`	Check daemon status
`browse env <target>`	Set environment mode

Session Options

Option	Description
`--session <name>`	Session name (default: "default")
`--headless`	Run Chrome headless
`--headed`	Run Chrome with visible window
`--ws <url\	port>`	One-shot CDP connection

Environment Modes

# Start with specific mode
browse env local
browse env remote

# Attach to running Chrome
browse env local <port|url>

# Persist override and restart daemon
browse env <target> --session <name>

# Clear override (fallback to env-var detection)
browse stop

Auto-detection priority:

Check browse env persist setting
Check BROWSE_SESSION environment variable
Default: remote if BROWSERBASE_API_KEY is set, else local

Element References

After running browse snapshot, elements can be referenced by their ref ID:

# Get snapshot with refs
browse snapshot -c

# Click using ref (multiple formats)
browse click @0-2       # @ prefix
browse click 0-2        # Plain ref
browse click --ref 0-2 # --ref flag

Global Options

Option	Description
`--session <name>`	Session name for multiple browsers
`--headless`	Run Chrome in headless mode
`--headed`	Run Chrome with visible window
`--ws <url\	port>`	One-shot CDP connection (bypasses daemon)
`--json`	Output as JSON

Environment Variables

Variable	Description
`BROWSE_SESSION`	Default session name
`BROWSERBASE_API_KEY`	Browserbase API key (required for remote mode)

Output Formats

The CLI supports JSON output for programmatic consumption:

browse get url --json
browse snapshot --json

This is particularly useful for AI agent integrations that need structured data.

Stagehand Core Package - Browser automation primitives
Browserbase Platform - Remote browser infrastructure
CDP Protocol - Chrome DevTools Protocol reference

Sources: [packages/cli/README.md](https://github.com/browserbase/stagehand/blob/main/packages/cli/README.md)

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

medium Project risk needs validation

The project should not be treated as fully validated until this signal is reviewed.

medium README/documentation is current enough for a first validation pass.

The project should not be treated as fully validated until this signal is reviewed.

medium Maintainer activity is unknown

Users cannot judge support quality until recent activity, releases, and issue response are checked.

medium no_demo

The project may affect permissions, credentials, data exposure, or host boundaries.

Doramagic Pitfall Log

Doramagic extracted 7 source-linked risk signals. Review them before installing or handing real data to the project.

1. Project risk: Project risk needs validation

Severity: medium
Finding: Project risk is backed by a source signal: Project risk needs validation. Treat it as a review item until the current version is checked.
User impact: The project should not be treated as fully validated until this signal is reviewed.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: identity.distribution | github_repo:776908852 | https://github.com/browserbase/stagehand | repo=stagehand; install=create-browser-app

2. Capability assumption: README/documentation is current enough for a first validation pass.

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: The project should not be treated as fully validated until this signal is reviewed.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: capability.assumptions | github_repo:776908852 | https://github.com/browserbase/stagehand | README/documentation is current enough for a first validation pass.

3. Maintenance risk: Maintainer activity is unknown

Severity: medium
Finding: Maintenance risk is backed by a source signal: Maintainer activity is unknown. Treat it as a review item until the current version is checked.
User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: evidence.maintainer_signals | github_repo:776908852 | https://github.com/browserbase/stagehand | last_activity_observed missing

4. Security or permission risk: no_demo

Severity: medium
Finding: no_demo
User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: downstream_validation.risk_items | github_repo:776908852 | https://github.com/browserbase/stagehand | no_demo; severity=medium

5. Security or permission risk: no_demo

Severity: medium
Finding: no_demo
User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: risks.scoring_risks | github_repo:776908852 | https://github.com/browserbase/stagehand | no_demo; severity=medium

6. Maintenance risk: issue_or_pr_quality=unknown

Severity: low
Finding: issue_or_pr_quality=unknown。
User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: evidence.maintainer_signals | github_repo:776908852 | https://github.com/browserbase/stagehand | issue_or_pr_quality=unknown

7. Maintenance risk: release_recency=unknown

Severity: low
Finding: release_recency=unknown。
User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
Evidence: evidence.maintainer_signals | github_repo:776908852 | https://github.com/browserbase/stagehand | release_recency=unknown

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 11

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using stagehand with real data or production workflows.

Storage state persistence broken in Stagehand v3 - userDataDir doesn't w - github / github_issue
Reduce Bundle Size - github / github_issue
Feature: pay-per-call captcha unstuck via x402 (Onyx Actions) - github / github_issue
stagehand/server-v3 v3.6.6 - github / github_release
stagehand/server-v3 v3.6.5 - github / github_release
@browserbasehq/[email protected] - github / github_release
stagehand/server-v3 v3.6.3 - github / github_release
stagehand/server-v3 v3.6.2 - github / github_release
@browserbasehq/[email protected] - github / github_release
@browserbasehq/[email protected] - github / github_release
Project risk needs validation - GitHub / issue

Source: Project Pack community evidence and pitfall evidence

stagehand

Project Introduction

Related Pages

Project Introduction

Project Overview

Key Value Propositions

Architecture Overview

Core Packages

Project Structure

Agent System Architecture

Microsoft CUA Integration

System Prompt Structure

Browser Context Management

Frame Locator System

Inference and Caching

Summary File Structure

CLI Architecture

Execution Modes

CLI Command Categories

Getting Started

Installation from Source

Branch Installation

Environment Configuration

Version History

License and Community

Architecture Overview

Related Pages

Architecture Overview

Introduction

Core Architecture Layers

Browser Context Management

Page Management

Initialization Scripts

Frame Handling Architecture

Lifecycle-Aware Frame Attachment

Agent System Design

System Prompt Architecture

Hybrid Mode Page Understanding

Microsoft CUA Agent Actions

FARA Function Calling Protocol

Variable System

Session Management

Daemon-Based Architecture

Multi-Session Support

Tool Categories

Navigation Tools

Interaction Tools

Page Information Tools

Strategy Guidelines

Configuration Options

Error Handling

Extension Points

CDP Engine

Related Pages

CDP Engine

Architecture Overview

Core Components

Launch Module

Session Manager

Frame Registry

Execution Context Registry

Frame Lifecycle Management

Cookie Management

CDP Command Dispatch

Supported Command Categories

Initialization Scripts

Error Handling

Integration with Agent System

Configuration

Summary

Core Actions

Related Pages

Core Actions

Overview

Architecture

Handler Components

Action Types

Browser Navigation Actions

Interaction Actions

Information Retrieval Actions