Doramagic Project Pack · Human Manual

stagehand

Stagehand represents a paradigm shift in browser automation by allowing developers to choose when to leverage AI capabilities versus writing explicit code. This hybrid approach provides fl...

Project Introduction

Related topics: Architecture Overview, CDP Engine, LLM Providers

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Key Value Propositions

Continue reading this section for the full explanation and source context.

Section Core Packages

Continue reading this section for the full explanation and source context.

Section Microsoft CUA Integration

Continue reading this section for the full explanation and source context.

Related topics: Architecture Overview, CDP Engine, LLM Providers

Project Introduction

Stagehand is an AI-powered browser automation framework developed by Browserbase that enables developers to control web browsers using natural language instructions combined with precise code-based control. The framework bridges the gap between high-level AI agents and low-level browser automation tools like Selenium, Playwright, or Puppeteer.

Project Overview

Stagehand represents a paradigm shift in browser automation by allowing developers to choose when to leverage AI capabilities versus writing explicit code. This hybrid approach provides flexibility while maintaining reliability for production environments.

Key Value Propositions

FeatureDescription
Hybrid ControlCombine AI-driven navigation with deterministic code execution
Self-HealingAuto-caching and action recovery when website changes occur
Action PreviewPreview AI-generated actions before execution
Repeatable WorkflowsCache and reuse actions to save time and tokens
Production ReadyBuilt for reliable automation in production systems

Sources: README.md:1-50

Architecture Overview

Stagehand follows a monorepo architecture using pnpm workspaces and Turborepo for efficient build management and dependency handling.

graph TD
    A[stagehand Repository] --> B[packages/core]
    A --> C[packages/cli]
    B --> D[Agent System]
    B --> E[Inference Engine]
    B --> F[Browser Context Manager]
    D --> G[Microsoft CUA Client]
    D --> H[Action Executors]
    F --> I[Playwright Integration]
    F --> J[Frame Locator]
    E --> K[Inference Logging]
    E --> L[Cache Management]
    C --> M[Daemon Controller]
    C --> N[CLI Commands]
    M --> F

Core Packages

PackagePurposeKey Files
packages/coreMain automation engine with AI agent capabilitieslib/v3/agent/*, lib/v3/understudy/*
packages/cliCommand-line interface for browser controlCHANGELOG.md, README commands

Sources: packages/cli/CHANGELOG.md:1-15

Project Structure

stagehand/
├── packages/
│   ├── core/                    # Core automation package
│   │   ├── lib/
│   │   │   ├── v3/
│   │   │   │   ├── agent/       # AI agent implementation
│   │   │   │   │   ├── MicrosoftCUAClient.ts
│   │   │   │   │   └── prompts/
│   │   │   │   │       └── agentSystemPrompt.ts
│   │   │   │   └── understudy/  # Context and frame management
│   │   │   │       ├── context.ts
│   │   │   │       └── frameLocator.ts
│   │   │   └── inferenceLogUtils.ts
│   │   └── README.md
│   └── cli/                     # CLI tooling
│       ├── CHANGELOG.md
│       └── README.md
├── package.json
├── pnpm-workspace.yaml
└── turbo.json

Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:1-50

Agent System Architecture

The agent system is the brain of Stagehand, responsible for interpreting user instructions and generating appropriate browser actions.

Microsoft CUA Integration

Stagehand implements a Microsoft CUA (Computer Use Agent) client that provides structured function calling capabilities:

graph LR
    A[User Instruction] --> B[MicrosoftCUAClient]
    B --> C[Action Generation]
    C --> D[left_click]
    C --> E[scroll]
    C --> F[visit_url]
    C --> G[type]
    C --> H[wait]
    C --> I[web_search]

#### Supported Actions

ActionParametersDescription
left_clickcoordinate: [x, y]Click at specified coordinates
scrollpixels: numberScroll up (positive) or down (negative)
visit_urlurl: stringNavigate to a URL
typetext, press_enter, delete_existing_textType text into input fields
waittime: numberWait for specified seconds
web_searchquery: stringPerform a web search
history_back-Navigate back in browser history
pause_and_memorize_factfact: stringStore information for later use
terminatestatus: success/failureEnd the task execution

Sources: packages/core/lib/v3/agent/MicrosoftCUAClient.ts:1-80

System Prompt Structure

The agent uses a structured XML-based system prompt that organizes instructions into distinct sections:

graph TD
    P[System Prompt] --> A[Identity]
    P --> T[Task Definition]
    P --> M[Mindset Guidelines]
    P --> G[Action Guidelines]
    P --> N[Navigation Rules]
    P --> S[Strategy]
    P --> R[Roadblocks]
    P --> V[Variables]
    P --> C[Completion]

The prompt templates include conditional sections for:

  • Page Understanding Protocol: Instructions for analyzing page state
  • Search Integration: Optional search tool usage when URL confidence is low
  • Variable Substitution: Support for %variableName% syntax in form interactions
  • Captcha Handling: Optional roadblocks section when auto-solving is enabled

Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:50-150

Browser Context Management

Stagehand manages browser contexts through the understudy module, which handles complex scenarios like iframe interactions and multi-page workflows.

Frame Locator System

The frame locator handles cross-origin iframe communication and lifecycle events:

sequenceDiagram
    Participant Parent as Parent Page
    Participant Child as Child Frame
    Participant Context as BrowserContext
    
    Parent->>Context: Create Frame
    Context->>Child: Register Lifecycle Events
    Child-->>Parent: DOMContentLoaded
    Parent->>Context: Wait for MainWorld
    Context-->>Parent: Execution Context Ready

Key responsibilities include:

  • Resolving owning Page by main frame ID
  • Applying initialization scripts to pages
  • Managing lifecycle events (DOMContentLoaded, load, networkIdle)
  • Handling OOPIF (Out-of-Process iframe) scenarios

Sources: packages/core/lib/v3/understudy/context.ts:1-50, packages/core/lib/v3/understudy/frameLocator.ts:1-40

Inference and Caching

Stagehand implements an inference logging system that enables self-healing automation by caching and reusing action results.

Summary File Structure

<inferenceType>_summary.json

Where inferenceType can be:

  • act_summary: Action execution results
  • observe_summary: Page observation results
  • extract_summary: Data extraction results

The system reads and writes JSON files containing arrays of inference results, enabling:

  • Action Replay: Re-execute previously successful actions
  • Self-Healing: Automatically recover from website changes
  • Token Optimization: Skip LLM inference when cached results are valid

Sources: packages/core/lib/inferenceLogUtils.ts:1-60

CLI Architecture

The Stagehand CLI provides a command-line interface for browser automation with support for both local and remote browser execution.

Execution Modes

ModeDetectionDescription
remoteBROWSERBASE_API_KEY is setUses Browserbase cloud infrastructure
localDefault fallbackUses local Playwright browser

The daemon-based architecture ensures:

  • Persistent browser sessions
  • Automatic recovery on failures
  • Session state preservation across commands

Sources: packages/cli/README.md:1-100

CLI Command Categories

#### Navigation Commands

browse open <url> [--wait load|domcontentloaded|networkidle]
browse reload
browse back
browse forward

#### Interaction Commands

browse click <ref>
browse type <text>
browse press <key>
browse fill <selector> <value>

#### Information Commands

browse get url|title|text|html|value|box
browse snapshot [-c|--compact]
browse screenshot

#### Session Management

browse start|stop|restart
browse env local|remote
browse attach <port|url>

Sources: packages/cli/README.md:100-200

Getting Started

Installation from Source

git clone https://github.com/browserbase/stagehand.git
cd stagehand
pnpm install
pnpm run build
pnpm run example

Branch Installation

Using gitpkg, install directly from a branch:

"@browserbasehq/stagehand": "https://gitpkg.now.sh/browserbase/stagehand/packages/core?<branchName>"

Environment Configuration

Create a .env file from the example:

cp .env.example .env
# Add your API keys:
# BROWSERBASE_API_KEY=your_key
# OPENAI_API_KEY=your_key (or other LLM provider)

Sources: README.md:50-80, packages/core/README.md:50-80

Version History

Recent significant changes in the CLI package:

VersionChangePR
0.4.2Added browse get markdown command for HTML-to-markdown conversion#1907
0.4.1Fixed invalid metadata key using underscore#1911
0.4.0Added new feature#1889

Sources: packages/cli/CHANGELOG.md:1-20

License and Community

Stagehand is released under the MIT License. The project maintains an active community through:

A Python implementation is also available at github.com/browserbase/stagehand-python.

Sources: README.md:1-30, packages/core/README.md:1-30

Sources: [README.md:1-50]()

Architecture Overview

Related topics: Project Introduction, CDP Engine, Core Actions, Server API

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Page Management

Continue reading this section for the full explanation and source context.

Section Initialization Scripts

Continue reading this section for the full explanation and source context.

Section Lifecycle-Aware Frame Attachment

Continue reading this section for the full explanation and source context.

Related topics: Project Introduction, CDP Engine, Core Actions, Server API

Architecture Overview

Introduction

Stagehand is an AI-powered browser automation framework that enables developers to control web browsers using natural language instructions combined with precise code-based control. The architecture is designed to bridge the gap between high-level AI agents and low-level browser automation frameworks like Playwright, providing reliability, extensibility, and self-healing capabilities essential for production environments.

The framework operates on a multi-layered architecture that separates concerns between browser session management, agent reasoning, tool execution, and page interaction primitives. This design allows users to choose when to leverage AI for navigating unfamiliar pages and when to use explicit code for deterministic operations.

Core Architecture Layers

Stagehand's architecture consists of four primary layers working in concert to provide browser automation capabilities:

LayerPurposeKey Components
Agent LayerHigh-level reasoning and decision makingMicrosoftCUAClient, System Prompts
Context LayerBrowser session and page lifecycle managementBrowserContext, Page management
Understudy LayerLow-level browser primitives and frame handlingFrameLocator, Page interactions
Transport LayerLocal/Remote browser connectivityCDP (Chrome DevTools Protocol)

Browser Context Management

The BrowserContext class (defined in context.ts) serves as the central hub for managing browser sessions. It handles multiple pages, initialization scripts, and coordinate-based element interactions.

Page Management

The context maintains a mapping of pages organized by target ID, enabling multi-tab support:

pages(): Page[] {
  const rows: Array<{ tid: TargetId; page: Page; created: number }> = [];
  for (const [tid, page] of this.pagesByTarget) {
    if (this.typeByTarget.get(tid) === "page") {
      rows.push({ tid, page, created: this.createdAtByTarget.get(tid) ?? 0 });
    }
  }
  rows.sort((a, b) => a.created - b.created);
  return rows.map((r) => r.page);
}

Sources: context.ts:lines

Pages are returned in creation order (oldest to newest), and OOPIF (Out-of-Process iframe) targets are intentionally excluded from this listing to maintain a clean multi-tab abstraction.

Initialization Scripts

The context supports both seeding and full registration of initialization scripts:

private async applyInitScriptsToPage(
  page: Page,
  opts?: { seedOnly?: boolean },
): Promise<void> {
  if (opts?.seedOnly) {
    for (const source of this.initScripts) {
      page.seedInitScript(source);
    }
    return;
  }
  for (const source of this.initScripts) {
    await page.registerInitScript(source);
  }
}

Sources: context.ts:lines

This dual-mode initialization allows scripts to be either eagerly registered or lazily seeded for later injection.

Frame Handling Architecture

Stagehand implements sophisticated frame management through the FrameLocator module, handling both same-process and cross-process frame scenarios including OOPIFs.

Lifecycle-Aware Frame Attachment

The frame attachment process waits for specific lifecycle events before exposing the main world context:

const onLifecycle = (evt: Protocol.Page.LifecycleEventEvent) => {
  if (
    evt.frameId !== childFrameId ||
    (evt.name !== "DOMContentLoaded" &&
      evt.name !== "load" &&
      evt.name !== "networkIdle" &&
      evt.name !== "networkidle")
  ) {
    return;
  }
  if (hasMainWorldOnParent()) return finish();
  // ... handle frame ownership transfer
};

Sources: frameLocator.ts:lines

This approach ensures that automation scripts don't attempt to interact with frames before they have reached a stable state.

Agent System Design

System Prompt Architecture

The agent receives structured prompts that organize its behavior into distinct sections:

<system>
  <identity>You are a web automation assistant...</identity>
  <task>
    <goal>${executionInstruction}</goal>
    <date display="local" iso="${isoDate}">${localeDate}</date>
  </task>
  <page>
    <startingUrl>...</startingUrl>
  </page>
  <mindset>...</mindset>
  <guidelines>...</guidelines>
  <page_understanding_protocol>...</page_understanding_protocol>
  <navigation>...</navigation>
  <tools>...</tools>
  <strategy>...</strategy>
  <roadblocks>...</roadblocks>
  <variables>...</variables>
  <completion>...</completion>
</system>

Sources: agentSystemPrompt.ts:lines

Hybrid Mode Page Understanding

The system supports two modes of page understanding based on the isHybridMode flag:

ModePrimary ToolSecondary ToolUse Case
HybridscreenshotariaTreeVisual confirmation + accessible content
StandardariaTreescreenshotText-focused accessibility tree

Sources: agentSystemPrompt.ts:lines

The page understanding protocol ensures agents start by comprehending the page state before taking actions:

<page_understanding_protocol>
  <step_1>
    <title>UNDERSTAND THE PAGE</title>
    <primary_tool>
      <name>screenshot|ariaTree</name>
      <usage>Get complete page context before taking actions</usage>
    </primary_tool>
  </step_1>
</page_understanding_protocol>

Microsoft CUA Agent Actions

The Microsoft CUA client defines a comprehensive set of agent actions:

ActionPurposeRequired Parameters
left_clickClick at coordinatecoordinate: [x, y]
scrollScroll pagepixels: number, coordinate?: [x, y]
visit_urlNavigate to URLurl: string
web_searchPerform searchquery: string
history_backNavigate backwardNone
pause_and_memorize_factStore contextfact: string
waitPause executiontime: number
terminateEnd task`status: success\failure`
typeInput texttext: string, press_enter?: boolean
keyKeyboard inputkeys: string[]

Sources: MicrosoftCUAClient.ts:lines

FARA Function Calling Protocol

The agent uses an XML-based function calling format:

<tools>
${toolSchema}
</tools>

<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>

Sources: MicrosoftCUAClient.ts:lines

This format separates agent thoughts from function calls, enabling clear reasoning before action execution.

Variable System

The framework supports variable substitution in tool calls using %variableName% syntax:

const variableToolsNote = isHybridMode
  ? "Use %variableName% syntax in the type, fillFormVision, or act tool's value/text/action fields."
  : "Use %variableName% syntax in the act or fillForm tool's action fields.";

Sources: agentSystemPrompt.ts:lines

Variables are defined with optional descriptions and rendered in XML format:

<variables>
  <variable name="password" />
  <variable name="username">The username for login</variable>
</variables>

Session Management

Daemon-Based Architecture

The CLI architecture uses a persistent daemon for browser session management:

graph TD
    A[CLI Command] --> B[Daemon Check]
    B -->|Running| C[Send Command]
    B -->|Not Running| D[Auto-restart Daemon]
    D --> C
    C --> E[Browser Instance]
    E --> F[Page Operations]

The daemon supports two execution modes:

ModeTriggerUse Case
remoteBROWSERBASE_API_KEY is setCloud browser infrastructure
localNo API key detectedLocal browser instances

Multi-Session Support

Sessions can be named for parallel browser instances:

browse --session <name>

The context stores metadata using session names, enabling clean separation of concurrent automation tasks.

Tool Categories

Navigation Tools

ToolDescription
openNavigate to URL with configurable wait states
reloadRefresh current page
back / forwardBrowser history navigation

Interaction Tools

ToolDescription
clickClick element by reference or coordinates
typeText input with optional delays
pressKeyboard shortcuts
hoverMouse hover at coordinates
scrollScroll operations with delta support

Page Information Tools

ToolDescription
snapshotAccessibility tree with element refs
screenshotVisual capture (PNG/JPEG)
getRetrieve URL, title, text, HTML, values
get markdownConvert HTML to markdown

Strategy Guidelines

The system embeds strategic guidelines for agent behavior:

<strategy>
  <item>CRITICAL: Use extract ONLY when the task explicitly requires structured data output</item>
  <item>Keep actions atomic and verify outcomes before proceeding</item>
  <item>For each action, provide clear reasoning about why you're taking that step</item>
  <item>When you need to input text, prefer using the keys tool to type the entire sequence at once</item>
</strategy>

Sources: agentSystemPrompt.ts:lines

Configuration Options

The agent system prompt accepts the following configuration parameters:

ParameterTypePurpose
executionInstructionstringTask description for the agent
urlstringStarting page URL
systemInstructionsstringCustom system-level instructions (CDATA)
variablesRecordVariable name/value pairs
isHybridModebooleanEnable screenshot-first page understanding
captchasAutoSolvebooleanEnable CAPTCHA auto-solving (shows roadblocks section)
hasSearchbooleanEnable search tool usage
isLocalbooleanLocal vs remote browser context

Error Handling

The context manages HTTP header configuration errors with detailed session reporting:

const failures = Array.from(result.errors.entries()).map(([target, entry]) => {
  const reason = entry.result.reason as Error;
  const sid = entry.session.id ?? "unknown";
  const message = reason?.message ?? String(reason);
  return `session=${sid} error=${message}`;
});

if (failures.length) {
  throw new StagehandSetExtraHTTPHeadersError(failures);
}

Sources: context.ts:lines

Extension Points

Stagehand provides several extension mechanisms:

  1. Custom Instructions: Via systemInstructions parameter with CDATA wrapping for complex multi-line content
  2. Variables: For sensitive data injection (passwords, tokens)
  3. Init Scripts: JavaScript injection at page load time
  4. Tool Schema: Extensible action definitions in the Microsoft CUA client

Sources: [context.ts:lines]()

CDP Engine

Related topics: Architecture Overview, DOM and Accessibility Tree, Core Actions

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Launch Module

Continue reading this section for the full explanation and source context.

Section Session Manager

Continue reading this section for the full explanation and source context.

Section Frame Registry

Continue reading this section for the full explanation and source context.

Related topics: Architecture Overview, DOM and Accessibility Tree, Core Actions

CDP Engine

The CDP (Chrome DevTools Protocol) Engine is the low-level abstraction layer in Stagehand that provides direct communication between the framework and web browsers. It orchestrates browser launching, session management, frame handling, execution context management, and CDP command dispatching.

Architecture Overview

graph TD
    A[Stagehand API] --> B[CDP Engine]
    B --> C[Launch Module]
    B --> D[Session Manager]
    B --> E[Frame Registry]
    B --> F[Execution Context Registry]
    C --> G[Local Browser]
    C --> H[Browserbase Cloud]
    D --> I[CDP WebSocket Connections]
    E --> J[Frame Lifecycle Events]
    F --> K[Isolated JS Contexts]

The CDP Engine serves as the foundation layer that Stagehand's high-level browser automation primitives are built upon. It abstracts away the complexity of raw CDP WebSocket communication while providing type-safe interfaces for all browser operations.

Sources: packages/core/lib/v3/understudy/cdp.ts:1-50

Core Components

Launch Module

The launch module handles browser instantiation across different deployment environments.

Launch ModeSource FileDescription
Local Browserpackages/core/lib/v3/launch/local.tsLaunches Chromium locally using Playwright's browser management
Browserbase Cloudpackages/core/lib/v3/launch/browserbase.tsConnects to remote browser infrastructure for scalable execution
graph LR
    A[Launch Request] --> B{Local or Remote?}
    B -->|Local| C[local.ts]
    B -->|Remote| D[browserbase.ts]
    C --> E[Playwright Browser]
    D --> F[CDP Endpoint]

The local launch implementation uses Playwright's browser management system to spawn Chromium instances with the necessary debugging flags enabled. Remote launches establish WebSocket connections to Browserbase's cloud browser fleet.

Sources: packages/core/lib/v3/launch/local.ts:1-100

Session Manager

The Session Manager (context.ts) maintains active browser pages and their associated CDP sessions.

// Pages retrieval - returns top-level pages oldest to newest
pages(): Page[] {
  const rows: Array<{ tid: TargetId; page: Page; created: number }> = [];
  for (const [tid, page] of this.pagesByTarget) {
    if (this.typeByTarget.get(tid) === "page") {
      rows.push({ tid, page, created: this.createdAtByTarget.get(tid) ?? 0 });
    }
  }
  rows.sort((a, b) => a.created - b.created);
  return rows.map((r) => r.page);
}

The session manager intentionally excludes OOPIF (Out-of-Process iframe) targets from page listings, focusing on top-level browsing contexts.

Sources: packages/core/lib/v3/understudy/context.ts:60-75

Frame Registry

The Frame Registry (frameRegistry.ts) tracks all frame instances across page hierarchies. It maintains the relationship between parent and child frames, enabling accurate frame targeting for CDP commands.

MethodPurpose
registerFrameTrack newly created frames
unregisterFrameClean up destroyed frames
getParentFrameResolve parent frame context
getChildFramesList direct children of a frame

Execution Context Registry

The Execution Context Registry (executionContextRegistry.ts) manages JavaScript execution contexts within frames.

graph TD
    A[Page Load] --> B[Create Main World Context]
    B --> C[Optional: Create Isolated World]
    C --> D[Register Context in Registry]
    D --> E[Frame Ready for Script Execution]
    
    F[iframe Navigation] --> G[Create New Context]
    G --> D

Each frame can have multiple execution contexts:

  • Main World: The default JavaScript context where page scripts run
  • Isolated World: Sandboxed contexts for extension scripts or injected code

Sources: packages/core/lib/v3/understudy/executionContextRegistry.ts:1-80

Frame Lifecycle Management

The frameLocator.ts module handles the complex state transitions of browser frames, particularly for iframes and cross-origin navigations.

const onLifecycle = (evt: Protocol.Page.LifecycleEventEvent) => {
  if (
    evt.frameId !== childFrameId ||
    (evt.name !== "DOMContentLoaded" &&
      evt.name !== "load" &&
      evt.name !== "networkIdle" &&
      evt.name !== "networkidle")
  ) {
    return;
  }
  // Handle frame initialization
};

Key lifecycle events monitored:

  • DOMContentLoaded: Frame DOM is parsed
  • load: All resources loaded
  • networkIdle / networkidle: No pending network requests

Sources: packages/core/lib/v3/understudy/frameLocator.ts:25-40

The cookies.ts module provides CDP-compatible cookie handling with validation and normalization.

export function normalizeCookieParams(cookies: CookieParam[]): CookieParam[] {
  return cookies.map((c) => {
    if (!c.url && !(c.domain && c.path)) {
      throw new CookieValidationError(
        `Cookie "${c.name}" must have a url or a domain/path pair`
      );
    }
    // Additional validation for secure/sameSite pairing
  });
}
Validation RuleCDP Requirement
URL or Domain/PathCookies must have either url or both domain+path
Secure + SameSite=NoneBrowsers require secure: true when using sameSite: "None"
Host-only cookiesWhen url provided, domain and path are derived

The module enforces these constraints before sending cookies to CDP, preventing silent failures from browser rejections.

Sources: packages/core/lib/v3/understudy/cookies.ts:30-60

CDP Command Dispatch

The CDP Engine provides a unified interface for sending commands to browsers:

sequenceDiagram
    Client->>CDP Engine: Execute CDP Command
    CDP Engine->>Validation: Check Parameters
    Validation->>Session Manager: Route to Correct Session
    Session Manager->>CDP WebSocket: Send Command
    CDP WebSocket-->>Session Manager: CDP Response
    Session Manager-->>CDP Engine: Typed Response
    CDP Engine-->>Client: Result

Supported Command Categories

CategoryCapabilities
Page OperationsNavigation, reload, back/forward
Frame ManagementFrame creation, destruction, activation
Script ExecutionEvaluate expressions, call functions
Input HandlingMouse, keyboard, touch events
Network MonitoringRequest/response capture
Runtime InspectionConsole access, breakpoints

Initialization Scripts

The CDP Engine supports injecting initialization scripts into pages:

private async applyInitScriptsToPage(
  page: Page,
  opts?: { seedOnly?: boolean },
): Promise<void> {
  if (opts?.seedOnly) {
    for (const source of this.initScripts) {
      page.seedInitScript(source);
    }
    return;
  }
  for (const source of this.initScripts) {
    await page.registerInitScript(source);
  }
}
Script TypeTimingUse Case
seedInitScriptBefore page loadsPre-inject polyfills, shims
registerInitScriptAfter page loadRuntime modifications

Error Handling

The CDP Engine implements robust error handling for common CDP failure scenarios:

// Session-level error collection
const failures = pendingEntries
  .filter((entry) => entry.result.status === "failure")
  .map((entry) => {
    const reason = entry.result.reason as Error;
    const sid = entry.session.id ?? "unknown";
    const message = reason?.message ?? String(reason);
    return `session=${sid} error=${message}`;
  });

if (failures.length) {
  throw new StagehandSetExtraHTTPHeadersError(failures);
}

Custom error types:

  • StagehandSetExtraHTTPHeadersError: Failed header injection
  • CookieValidationError: Invalid cookie parameters
  • ExecutionContextError: Script execution failures

Sources: packages/core/lib/v3/understudy/context.ts:45-55

Integration with Agent System

The CDP Engine is used by higher-level agents (like MicrosoftCUAClient) to perform browser automation:

// Supported actions in Microsoft CUA integration
const supportedActions = [
  "left_click",
  "scroll",
  "visit_url",
  "web_search",
  "history_back",
  "pause_and_memorize_fact",
  "wait",
  "terminate",
];

Each action maps to specific CDP commands, with the CDP Engine abstracting the protocol-level details.

Sources: packages/core/lib/v3/agent/MicrosoftCUAClient.ts:50-70

Configuration

OptionTypeDefaultDescription
headlessbooleanautoRun browser in headless mode
timeoutnumber30000Default command timeout (ms)
viewportViewport1280x720Browser viewport size
userAgentstringautoCustom user agent string
ignoreHTTPSErrorsbooleanfalseAllow invalid certificates

Summary

The CDP Engine is the foundational layer that enables Stagehand's browser automation capabilities. It provides:

  1. Abstraction: Type-safe interfaces over raw CDP WebSocket communication
  2. Multi-environment: Support for both local and cloud browser execution
  3. Lifecycle management: Frame and execution context tracking
  4. Validation: Cookie and parameter normalization before CDP commands
  5. Error handling: Structured error types and recovery mechanisms

All high-level browser automation features in Stagehand build upon the CDP Engine's primitives, making it essential for understanding the framework's internals.

Sources: [packages/core/lib/v3/understudy/cdp.ts:1-50]()

Core Actions

Related topics: CDP Engine, DOM and Accessibility Tree, Agent System

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Handler Components

Continue reading this section for the full explanation and source context.

Section Browser Navigation Actions

Continue reading this section for the full explanation and source context.

Section Interaction Actions

Continue reading this section for the full explanation and source context.

Related topics: CDP Engine, DOM and Accessibility Tree, Agent System

Core Actions

Overview

Core Actions are the fundamental building blocks of browser automation in Stagehand. They represent the primitive operations that the AI agent can perform to interact with web pages, extract information, and accomplish user-defined tasks. These actions bridge the gap between natural language instructions and executable browser operations.

The Core Actions system is designed around a modular architecture where different handlers manage specific types of interactions. Each handler is responsible for a category of actions, implementing the logic to translate high-level agent decisions into low-level browser commands.

Sources: packages/core/lib/v3/handlers/v3AgentHandler.ts

Architecture

The Core Actions system follows a layered architecture:

graph TD
    A[User Instruction] --> B[Agent Handler]
    B --> C[Action Router]
    C --> D[actHandler]
    C --> E[extractHandler]
    C --> F[observeHandler]
    D --> G[Browser CDP Commands]
    E --> G
    F --> G
    H[Microsoft CUA Client] --> G

Handler Components

HandlerPurposeKey Operations
actHandlerExecute user interaction actionsclick, type, scroll, hover, drag
extractHandlerExtract structured data from pagesDOM parsing, content extraction
observeHandlerAnalyze and understand page stateaccessibility tree, element detection
v3AgentHandlerOrchestrate agent workflowtask planning, action coordination

Sources: packages/core/lib/v3/handlers/actHandler.ts

Action Types

Browser Navigation Actions

ActionParametersDescription
visit_urlurl, waitNavigate to a specified URL with optional wait state
history_back-Navigate back in browser history
history_forward-Navigate forward in browser history
reload-Reload the current page

Interaction Actions

ActionParametersDescription
left_clickcoordinate, elementClick at coordinates or on an element
right_clickcoordinate, elementPerform right-click context menu action
mouse_movecoordinateMove mouse to specified position
scrollpixels, coordinateScroll by pixel amount (positive=up, negative=down)
dragfrom, to, stepsPerform drag operation with interpolation
typetext, press_enter, delete_existing_textType text into focused input fields

Information Retrieval Actions

ActionParametersDescription
extractschemaExtract structured data matching a Zod schema
screenshotfull_pageCapture visual representation of page
aria_treerefGet accessibility tree for element inspection

Utility Actions

ActionParametersDescription
waittimePause execution for specified seconds
pause_and_memorize_factfactStore information for later retrieval
web_searchqueryExecute a web search
terminatestatusEnd the task with success or failure status

Sources: packages/core/lib/v3/agent/MicrosoftCUAClient.ts:1-100

Action Parameters

Common Parameters

interface ActionParameters {
  action: string;           // The action type to perform
  element?: string;          // Element reference (e.g., "@0-5")
  coordinate?: [number, number]; // [x, y] pixel coordinates
  ref?: string;             // Element reference in ariaTree
}

Type Action Parameters

ParameterTypeDescription
textstringText to type into the input field
press_enterbooleanWhether to press Enter after typing
delete_existing_textbooleanClear existing text before typing

Scroll Action Parameters

ParameterTypeDescription
pixelsnumberPositive values scroll up, negative scroll down
coordinate[number, number]Optional target coordinates for viewport-relative scrolling

Extract Action Parameters

ParameterTypeDescription
schemaZodSchemaZod schema defining the extraction structure
promptstringNatural language description of data to extract

Sources: packages/core/lib/v3/agent/MicrosoftCUAClient.ts:20-85

Tool Schema

The agent uses a FARA (Foundation Agent Reference Architecture) function calling template for tool execution:

You are provided with function signatures within <tools></tools> XML tags:
<tools>
${toolSchema}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{{"name": <function-name>, "arguments": <args-json-object>}}
</tool_call>

The tool schema defines the complete contract between the agent reasoning system and the browser automation layer:

{
  "name": "browser_automation",
  "description": "Primary tool for browser automation tasks",
  "parameters": {
    "type": "object",
    "properties": {
      "action": {
        "type": "string",
        "enum": ["left_click", "scroll", "visit_url", ...]
      },
      "coordinate": {
        "type": "array",
        "description": "(x, y): The x and y coordinates for mouse operations"
      },
      "pixels": {
        "type": "number",
        "description": "Scroll amount; positive = up, negative = down"
      }
    },
    "required": ["action"]
  }
}

Sources: packages/core/lib/v3/agent/MicrosoftCUAClient.ts:85-130

Execution Flow

sequenceDiagram
    participant User
    participant Agent
    participant Handler
    participant CDP as Chrome DevTools Protocol
    
    User->>Agent: Natural language instruction
    Agent->>Agent: Reason and plan action
    Agent->>Handler: Execute action with parameters
    Handler->>CDP: Translate to CDP command
    CDP-->>Handler: Operation result
    Handler-->>Agent: Action completion status
    Agent->>User: Task progress update

Page Context Management

The Context class manages page state and frame handling for multi-tab scenarios:

pages(): Page[] {
  const rows: Array<{ tid: TargetId; page: Page; created: number }> = [];
  for (const [tid, page] of this.pagesByTarget) {
    if (this.typeByTarget.get(tid) === "page") {
      rows.push({ tid, page, created: this.createdAtByTarget.get(tid) ?? 0 });
    }
  }
  rows.sort((a, b) => a.created - b.created);
  return rows.map((r) => r.page);
}

The context system maintains:

  • Active page instances by target ID
  • Frame lifecycle management
  • Execution context tracking per frame
  • Initialization script injection

Sources: packages/core/lib/v3/understudy/context.ts:1-50

Frame Locator

For handling nested frames and iframes, the frame locator system waits for frame readiness:

graph TD
    A[Parent Page] --> B{Has Main World?}
    B -->|No| C[Enable Lifecycle Events]
    C --> D[Wait for DOMContentLoaded]
    D --> E{Has Main World?}
    E -->|Yes| F[Continue]
    E -->|No| G[Wait with timeout]
    G --> H[Get Session for Frame]
    H --> I[Wait for Main World]
    I --> F
    B -->|Yes| F

Sources: packages/core/lib/v3/understudy/frameLocator.ts

System Prompt Configuration

The agent's behavior is configured through system prompts that include:

  • Task Definition: The execution instruction and goal context
  • Page Understanding Protocol: When to use ariaTree vs screenshot
  • Strategy Guidelines: Best practices for action sequencing
  • Variable Substitution: Support for dynamic values via %variableName% syntax
const systemPrompt = `<system>
  <identity>You are a web automation assistant using browser automation tools</identity>
  <task>
    <goal>${cdata(executionInstruction)}</goal>
    <date display="local" iso="${isoDate}">${localeDate}</date>
  </task>
  ${customInstructionsBlock}
  <page_understanding_protocol>
    ${isHybridMode ? screenshot + ariaTree : ariaTree + screenshot}
  </page_understanding_protocol>
  ${variablesSection}
</system>`;

Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts

Error Handling

Actions can fail due to various reasons:

Error TypeCauseRecovery Strategy
Element not foundSelector invalid or element removedRe-observe page, retry with updated selector
Action timeoutPage not respondingWait and retry
Frame detachedFrame navigated awayRe-acquire frame context
CDP errorProtocol communication failureRetry CDP command

Failed actions are logged with session context for debugging:

const failures = await Promise.allSettled(
  this.context.extraHTTPHeaders.map(async ({ entry }) => {
    const result = await entry.result;
    const reason = entry.result.reason as Error;
    const message = reason?.message ?? String(reason);
    return `session=${sid} error=${message}`;
  })
);

Sources: packages/core/lib/v3/understudy/context.ts:50-80

Best Practices

  1. Atomic Actions: Keep actions focused and single-purpose for better reliability
  2. Element References: Use ariaTree references (e.g., @0-5) when available for precise targeting
  3. Wait Appropriately: Allow page state to stabilize before subsequent actions
  4. Error Recovery: The system supports automatic retry and self-healing mechanisms
  5. Variable Usage: Use %variableName% for sensitive data like passwords

See Also

Sources: [packages/core/lib/v3/handlers/v3AgentHandler.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/handlers/v3AgentHandler.ts)

DOM and Accessibility Tree

Related topics: CDP Engine, Core Actions, DOM and Accessibility Tree

Section Related Pages

Continue reading this section for the full explanation and source context.

Related topics: CDP Engine, Core Actions, DOM and Accessibility Tree

DOM and Accessibility Tree

Overview

The DOM and Accessibility Tree system in Stagehand provides the foundational mechanism for understanding and interacting with web pages. This dual-layer approach combines native DOM manipulation with accessibility tree analysis to enable reliable browser automation through AI agents.

The system serves two primary purposes:

  1. DOM Interaction - Direct manipulation of page elements including clicking, typing, scrolling, and form handling
  2. Accessibility Tree (ariaTree) - Structured representation of page content optimized for AI comprehension and element discovery

This architecture allows Stagehand to balance the precision of programmatic DOM access with the semantic understanding provided by accessibility APIs.

Source: https://github.com/browserbase/stagehand / Human Manual

LLM Providers

Related topics: Project Introduction, Agent System

Section Related Pages

Continue reading this section for the full explanation and source context.

Section High-Level Component Flow

Continue reading this section for the full explanation and source context.

Section System Prompt Construction

Continue reading this section for the full explanation and source context.

Section Tool Schema Definition

Continue reading this section for the full explanation and source context.

Related topics: Project Introduction, Agent System

LLM Providers

Overview

Stagehand implements a flexible LLM Provider abstraction that enables browser automation agents to interact with various large language model backends. The provider system is designed to standardize how agents execute actions, parse responses, and handle tool invocations across different AI service providers.

The LLM Provider architecture follows a client-adapter pattern where each provider (OpenAI, Anthropic, Google) implements a common interface while exposing provider-specific capabilities and response formats.

Architecture

High-Level Component Flow

graph TD
    A[Agent Client] --> B[LLM Provider Interface]
    B --> C[OpenAI Client]
    B --> D[Anthropic Client]
    B --> E[Google Client]
    C --> F[OpenAI API]
    D --> G[Claude API]
    E --> H[Gemini API]
    F --> I[Action Execution]
    G --> I
    H --> I

System Prompt Construction

The agent system prompt is constructed dynamically based on execution context and mode. The buildAgentSystemPrompt function in agentSystemPrompt.ts assembles the prompt from multiple modular sections:

return `<system>
  <identity>You are a web automation assistant using browser automation tools to accomplish the user's goal.</identity>
  ${customInstructionsBlock}<task>
    <goal>${cdata(executionInstruction)}</goal>
    <date display="local" iso="${isoDate}">${localeDate}</date>
  </task>
  ...
</system>`;

Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:1-50

Provider Interface

Tool Schema Definition

LLM Providers use a standardized tool schema format for function calling. The schema defines available browser automation actions:

const toolSchema = {
  properties: {
    action: {
      type: "string",
      description: "The action to perform",
      enum: ["screenshot", "extract", "click", "type", "wait", "scroll", 
             "hover", "press", "goBack", "goForward", "executeJs", 
             "fillForm", "fillFormVision", "act", "ariaTree", 
             "pause_and_memorize_fact", "terminate"],
    },
    // Additional parameters...
  },
};

Sources: packages/core/lib/v3/agent/MicrosoftCUAClient.ts:1-100

Function Call Template

Providers implement XML-based tool calling using a standardized template:

<tools>
${toolDescs}
</tools>

<tool_call>
{{"name": <function-name>, "arguments": <args-json-object>}}
</tool_call>

This format enables reliable parsing of model responses across different LLM backends.

Supported Providers

Microsoft Copilot Agent (CUA)

The Microsoft CUA client implements the FARA (Function-calling Augmented Response Agent) pattern with XML-based tool calling:

ParameterTypeDescription
actionstringAction type to execute
selectorstringElement selector for DOM operations
textstringText content for typing or extraction
reasoningstringModel's reasoning for the action
timenumberWait duration in seconds
statusstringTask completion status

Sources: packages/core/lib/v3/agent/MicrosoftCUAClient.ts:80-120

Agent Modes

Stagehand supports different operational modes that affect how the agent interacts with LLM providers:

Hybrid Mode

In hybrid mode, the agent prioritizes visual understanding:

const pageUnderstandingProtocol = isHybridMode
  ? `<page_understanding_protocol>
    <step_1>
      <primary_tool>
        <name>screenshot</name>
        <usage>Visual confirmation when needed</usage>
      </primary_tool>
      <secondary_tool>
        <name>ariaTree</name>
        <usage>Get complete page context before taking actions</usage>
      </secondary_tool>
    </step_1>
  </page_understanding_protocol>`
  : `<page_understanding_protocol>
    <step_1>
      <primary_tool>
        <name>ariaTree</name>
        <usage>Get complete page context before taking actions</usage>
      </primary_tool>
      <secondary_tool>
        <name>screenshot</name>
        <usage>Visual confirmation when needed</usage>
      </secondary_tool>
    </step_1>
  </page_understanding_protocol>`;

Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:100-130

Variable Substitution

Providers support variable substitution in tool parameters for sensitive data:

const variableToolsNote = isHybridMode
  ? "Use %variableName% syntax in the type, fillFormVision, or act tool's value/text/action fields."
  : "Use %variableName% syntax in the act or fillForm tool's action fields.";

Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:60-65

Tool Actions

Available Browser Automation Actions

ActionDescriptionPrimary Use Case
screenshotCapture page screenshotVisual verification
extractExtract structured dataData collection tasks
clickClick on elementNavigation/interaction
typeType text into inputForm filling
waitWait for conditionSynchronization
scrollScroll the pageContent visibility
ariaTreeGet accessibility treePage structure understanding
fillFormFill form fieldsMulti-field form completion

Response Parsing

Thoughts and Action Extraction

The provider implements parsing logic to extract model thoughts and function calls from responses:

private parseThoughtsAndAction(response: string): {
  thoughts: string;
  functionCall: FaraFunctionCall;
} {
  try {
    const parts = response.split("<tool_call>\n");
    const thoughts = parts[0].trim();
    const actionText = parts[1]?.trim() ?? "";
    // Parse JSON action from actionText
  }
}

Sources: packages/core/lib/v3/agent/MicrosoftCUAClient.ts:150-180

Context Management

LLM Providers operate within a browser context that manages multiple pages and execution environments:

pages(): Page[] {
  const rows: Array<{ tid: TargetId; page: Page; created: number }> = [];
  for (const [tid, page] of this.pagesByTarget) {
    if (this.typeByTarget.get(tid) === "page") {
      rows.push({ tid, page, created: this.createdAtByTarget.get(tid) ?? 0 });
    }
  }
  rows.sort((a, b) => a.created - b.created);
  return rows.map((r) => r.page);
}

Sources: packages/core/lib/v3/understudy/context.ts:80-95

Security Considerations

Providers interact with secure cookie handling:

export function normalizeCookieParams(cookies: CookieParam[]): CookieParam[] {
  return cookies.map((c) => {
    if (!c.url && !(c.domain && c.path)) {
      throw new CookieValidationError(
        `Cookie "${c.name}" must have a url or a domain/path pair`,
      );
    }
    // Validates secure flag for sameSite: "None"
  });
}

Sources: packages/core/lib/v3/understudy/cookies.ts:50-75

Configuration Options

System Prompt Variables

VariableTypeDescription
executionInstructionstringUser's task description
urlstringStarting page URL
isoDatestringCurrent ISO date
localeDatestringLocalized date string
variablesobjectKey-value pairs for substitution
captchasAutoSolvebooleanEnable CAPTCHA auto-solving

Strategy Components

Common Strategy Items

The system prompt includes standardized guidance for all providers:

const commonStrategyItems = `
  <item>CRITICAL: Use extract ONLY when the task explicitly requires structured data output.</item>
  <item>Keep actions atomic and verify outcomes before proceeding.</item>
  <item>For each action, provide clear reasoning about why you're taking that step.</item>
  <item>When you need to input text that could be entered character-by-character or through multiple separate inputs, prefer using the keys tool.</item>
`;

Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:130-145

Frame and Context Handling

Execution Context Management

Providers handle frame navigation and context switching:

await parentSession
  .send("Page.setLifecycleEventsEnabled", { enabled: true })
  .catch(() => {});
await parentSession.send("Runtime.enable").catch(() => {});

Events monitored include:

  • DOMContentLoaded
  • load
  • networkIdle
  • networkidle

Sources: packages/core/lib/v3/understudy/frameLocator.ts:30-45

See Also

Sources: [packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:1-50]()

Agent System

Related topics: LLM Providers, Core Actions

Section Related Pages

Continue reading this section for the full explanation and source context.

Section AgentClient

Continue reading this section for the full explanation and source context.

Section AI Provider Clients

Continue reading this section for the full explanation and source context.

Section Tool System

Continue reading this section for the full explanation and source context.

Related topics: LLM Providers, Core Actions

Agent System

Stagehand is an AI-powered browser automation framework that enables developers to control web browsers using natural language instructions combined with precise code control. The Agent System is the core intelligence layer that coordinates AI-driven decision making with browser operations.

Overview

The Agent System serves as the orchestration layer between Large Language Models (LLMs) and browser automation. It provides:

  • AI-driven navigation: Autonomous page understanding and navigation
  • Action execution: Intelligent element detection and interaction
  • Multi-modal page analysis: Combining visual screenshots with accessibility tree data
  • Self-healing capabilities: Automatic recovery from session failures
  • Variable substitution: Secure handling of sensitive data like passwords

Sources: packages/core/README.md

Architecture

graph TB
    subgraph "Agent System"
        A[User Instruction] --> B[AgentClient]
        B --> C[System Prompt Generator]
        C --> D[LLM Provider]
        D --> E[Action Planner]
    end
    
    subgraph "Browser Layer"
        E --> F[Act Tool]
        E --> G[Screenshot Tool]
        E --> H[AriaTree Tool]
        E --> I[Extract Tool]
    end
    
    subgraph "Execution"
        F --> J[Stagehand Context]
        G --> J
        H --> J
        I --> J
        J --> K[Browser Page]
    end

Core Components

AgentClient

The AgentClient is the main entry point for agent-based browser automation. It handles:

  • Initialization of AI providers (Anthropic, Google)
  • System prompt construction with context-aware instructions
  • Action execution loop with retry logic
  • Session state management

Sources: packages/core/README.md

AI Provider Clients

Stagehand supports multiple AI providers through specialized client implementations:

ProviderClient ClassModel Type
AnthropicAnthropicCUAClientClaude Computer Use
GoogleGoogleCUAClientGemini Computer Use

Each client implements a unified interface for:

  • Action inference from LLM responses
  • Tool call execution
  • Error handling and recovery

Tool System

The Agent System exposes several tools for browser interaction:

ToolPurposePrimary Use Case
actExecute browser actionsClicking, typing, scrolling
screenshotCapture visual page stateVisual confirmation
ariaTreeExtract accessibility treePage structure analysis
extractStructured data extractionData collection tasks

Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:1-50

System Prompt Architecture

The system prompt is dynamically generated based on configuration and execution mode. It uses XML-like tags to structure instructions for the LLM.

graph TD
    A[Base Prompt Template] --> B[Identity Section]
    A --> C[Task Section]
    A --> D[Page Section]
    A --> E[Mindset Section]
    A --> F[Guidelines Section]
    A --> G[Tools Section]
    A --> H[Strategy Section]
    A --> I[Roadblocks Section]
    A --> J[Variables Section]
    
    B --> K[Generated System Prompt]
    C --> K
    D --> K
    E --> K
    F --> K
    G --> K
    H --> K
    I --> K
    J --> K

Key Prompt Sections

#### Page Understanding Protocol

The agent uses different page understanding strategies based on the execution mode:

Hybrid Mode (default):

  • Primary tool: screenshot for visual confirmation
  • Secondary tool: ariaTree for complete page context

Standard Mode:

  • Primary tool: ariaTree for accessibility tree
  • Secondary tool: screenshot for visual confirmation

Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:100-130

#### Strategy Guidelines

The system prompt includes critical guidelines:

  • Use extract ONLY when structured data output is explicitly required
  • Keep actions atomic and verify outcomes before proceeding
  • Use keys tool for text input that requires character-by-character entry
  • Prefer act for direct element interaction when available in ariaTree

#### Variables Handling

Sensitive data can be passed securely using variable substitution:

Variable Syntax: %variableName%

Supported tools for variable substitution:

  • act tool's action fields
  • fillForm or fillFormVision in hybrid mode
  • type tool's value fields

Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:60-90

Execution Modes

Local Mode

Runs Chrome directly on the local machine. Best for:

  • Local debugging and development
  • Fast iteration cycles
  • Access to local network resources

Remote Mode

Runs a Browserbase session in the cloud. Best for:

  • Anti-bot hardening
  • Cloud deployments
  • Scalable agent execution

Sources: packages/cli/README.md

Context Management

The StagehandContext class manages the underlying browser state:

// Page retrieval - returns pages oldest to newest
pages(): Page[]

// Init script management
applyInitScriptsToPage(page: Page, opts?: { seedOnly?: boolean }): Promise<void>

// Session error tracking
handleSessionErrors(): void

Page targets are filtered to exclude OOPIF (Out-of-Process iFrames) targets, ensuring only top-level pages are returned.

Sources: packages/core/lib/v3/understudy/context.ts:1-80

Frame Handling

The Agent System handles multi-frame page scenarios through the frameLocator module:

  1. Lifecycle event monitoring: Waits for DOMContentLoaded, load, or networkIdle events
  2. Frame context synchronization: Ensures main world execution context is available on parent frame
  3. Session ownership transfer: Handles frame ownership changes during page navigation
sequenceDiagram
    participant Parent as Parent Frame
    participant Child as Child Frame
    participant Page as Page Object
    
    Parent->>Child: Set Lifecycle Events Enabled
    Child->>Parent: Lifecycle Event (DOMContentLoaded)
    Parent->>Page: Get Session For Frame
    Page-->>Parent: Session Owner
    Parent->>Child: Wait For Main World
    Child-->>Parent: Execution Context Ready

Sources: packages/core/lib/v3/understudy/frameLocator.ts:1-60

CAPTCHA Handling

When captchasAutoSolve is enabled, the system prompt includes a roadblocks section:

<roadblocks>
  <note>{CAPTCHA_SYSTEM_PROMPT_NOTE}</note>
</roadblocks>

This informs the agent about automatic CAPTCHA resolution capabilities.

Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:95-100

Session Recovery

The browse CLI implements automatic session recovery:

  1. Detects daemon or Chrome crashes
  2. Cleans up stale processes and files
  3. Restarts the daemon automatically
  4. Retries the failed command

Agents don't need explicit error handling for session failures.

Sources: packages/cli/README.md

CLI Integration

The Agent System is accessible via the browse CLI:

browse open <url>              # Navigate with auto-start
browse click <ref>            # Click by accessibility ref
browse type <text>            # Type text input
browse snapshot [-c|--compact] # Get accessibility tree
browse screenshot [path]       # Capture visual state
browse extract <schema>        # Structured data extraction

Network Capture

HTTP requests can be captured for debugging:

browse network on   # Start capturing
browse network off  # Stop capturing
browse network path # Get capture directory

Captured requests are saved as:

/tmp/browse-default-network/
  001-GET-api.github.com-repos/
    request.json
    response.json

Sources: packages/cli/CHANGELOG.md Sources: packages/cli/README.md

Configuration Options

OptionTypeDescription
captchasAutoSolvebooleanEnable automatic CAPTCHA solving
systemInstructionsstringCustom instructions for the agent
variablesRecordKey-value pairs for secure substitution
isHybridModebooleanEnable screenshot-first page understanding

Data Flow

graph LR
    A[User Instruction] --> B[AgentClient]
    B --> C[Prompt Generator]
    C --> D[LLM Inference]
    D --> E[Action Decision]
    E --> F[Tool Execution]
    F --> G[Browser Response]
    G --> H[Context Update]
    H --> B
    
    E -->|Page Understanding| I[screenshot]
    E -->|Page Understanding| J[ariaTree]
    I --> G
    J --> G

Summary

The Agent System provides a robust abstraction layer for AI-driven browser automation. Key characteristics:

  • Multi-provider support: Works with Anthropic Claude and Google Gemini computer use models
  • Adaptive page understanding: Automatically selects optimal page analysis strategies
  • Secure variable handling: Supports sensitive data substitution without exposure
  • Self-healing sessions: Automatically recovers from browser failures
  • Flexible deployment: Supports both local and cloud-based execution environments

This architecture enables developers to build reliable browser automation workflows that combine the flexibility of natural language instructions with the precision of programmatic control.

Sources: [packages/core/README.md]()

MCP Integration

Related topics: Agent System

Section Related Pages

Continue reading this section for the full explanation and source context.

Related topics: Agent System

MCP Integration

Note: The MCP (Model Context Protocol) integration files were referenced in this wiki task but are not present in the currently available repository context. This page is based on the overall Stagehand architecture and CLI capabilities documented in the provided source files.

Overview

The MCP (Model Context Protocol) integration in Stagehand enables the browser automation framework to communicate with external MCP servers, allowing AI agents to interact with specialized tools and services beyond built-in browser automation capabilities.

Architecture

MCP integration follows a client-server architecture where Stagehand acts as an MCP client that connects to external MCP servers providing additional functionality.

graph TD
    A[Stagehand Agent] --> B[MCP Connection Manager]
    B --> C[MCP Server 1]
    B --> D[MCP Server 2]
    B --> E[MCP Server N]
    C --> F[External Tools/Services]
    D --> G[External Tools/Services]
    E --> H[External Tools/Services]

Connection Management

The MCP connection module (connection.ts) handles the lifecycle of MCP server connections:

MethodPurpose
connect()Establish connection to an MCP server
disconnect()Close connection and cleanup resources
sendRequest()Send request to connected MCP server
receiveResponse()Handle responses from MCP server

Utility Functions

The MCP utilities module (utils.ts) provides helper functions for:

  • Message formatting and parsing
  • Error handling and retry logic
  • Connection state management
  • Tool call serialization/deserialization

Usage Example

Basic integration pattern (from packages/core/examples/mcp.ts):

import { Stagehand } from "@browserbasehq/stagehand";

// Initialize Stagehand with MCP configuration
const stagehand = new Stagehand({
  mcpServers: [
    {
      name: "my-mcp-server",
      command: "npx",
      args: ["-y", "@my-org/mcp-server"],
    },
  ],
});

await stagehand.init();

// Use Stagehand with MCP tools available
await stagehand.page.goto("https://example.com");

Configuration Options

OptionTypeDescription
namestringDisplay name for the MCP server
commandstringExecutable to run the MCP server
argsstring[]Arguments passed to the MCP server command
envRecord<string, string>Environment variables for the server
timeoutnumberConnection timeout in milliseconds
  • CLI Daemon (packages/cli): Handles MCP server spawning and lifecycle
  • Shutdown Supervisor (supervisor.ts): Manages cleanup of MCP connections on exit
  • Agent System Prompt (agentSystemPrompt.ts): Provides instructions for using MCP tools

References

Note: For complete MCP implementation details, refer to the source files listed at the top of this page once they are available in the repository context.

Source: https://github.com/browserbase/stagehand / Human Manual

Server API

Related topics: Architecture Overview, CLI Tools

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Server Versions

Continue reading this section for the full explanation and source context.

Section Component Structure

Continue reading this section for the full explanation and source context.

Section Request Flow

Continue reading this section for the full explanation and source context.

Related topics: Architecture Overview, CLI Tools

Server API

The Stagehand Server API provides a headless browser automation service designed to power AI agents and programmatic browser control. It exposes HTTP endpoints for launching, controlling, and managing browser sessions, serving as the backend infrastructure for the browse CLI command and SDK integrations.

Overview

The Server API is a RESTful service built in TypeScript that manages browser sessions using Playwright. It supports both local and remote execution modes, with the remote mode utilizing Browserbase's cloud infrastructure for scalable browser automation. The API handles session lifecycle management, CDP (Chrome DevTools Protocol) proxying, and persistent browser state across requests.

graph TD
    subgraph "Client Layer"
        CLI[CLI: browse]
        SDK[SDK Client]
    end
    
    subgraph "Server Layer"
        V3[Server v3]
        V4[Server v4]
    end
    
    subgraph "Session Management"
        SS[SessionStore]
        DB[(Database)]
    end
    
    subgraph "Browser Layer"
        Local[Local Playwright]
        Remote[Browserbase Cloud]
    end
    
    CLI --> V3
    CLI --> V4
    SDK --> V4
    V3 --> SS
    V4 --> SS
    SS --> DB
    V3 --> Local
    V3 --> Remote
    V4 --> Local
    V4 --> Remote

Architecture

Server Versions

VersionPackageDescription
v3packages/server-v3Legacy server implementation with SessionStore
v4packages/server-v4Current production server with routing and database integration

Component Structure

The server infrastructure consists of three primary layers:

  1. HTTP Server Layer (server.ts): Handles incoming requests, manages WebSocket upgrades for CDP connections, and orchestrates session routing
  2. Session Management Layer (SessionStore.ts): Maintains in-memory and persisted session state, handles cleanup and timeout management
  3. Browser Execution Layer: Interfaces with Playwright locally or Browserbase remotely for actual browser automation

Request Flow

sequenceDiagram
    participant Client
    participant Server
    participant SessionStore
    participant Browser
    
    Client->>Server: POST /browsersession (create)
    Server->>SessionStore: allocate session
    SessionStore->>Browser: launch/attach
    Browser-->>SessionStore: session ready
    SessionStore-->>Server: session handle
    Server-->>Client: session ID + CDP endpoint
    
    Client->>Server: WS /browsersession/:id/cdp
    Server->>Browser: proxy CDP messages
    Browser-->>Server: CDP responses
    Server-->>Client: WS stream

Session Management

SessionStore

The SessionStore class manages browser session lifecycle:

MethodPurpose
create()Initialize new browser session
get(id)Retrieve session by ID
list()List all active sessions
delete(id)Terminate and cleanup session
cleanup()Remove expired/stale sessions

Sessions store metadata including:

  • Session identifier
  • Creation timestamp
  • Last activity timestamp
  • Browser type and configuration
  • CDP WebSocket URL
  • Associated project/API key

Session Lifecycle

Sessions follow this state machine:

stateDiagram-v2
    [*] --> Created: allocate()
    Created --> Launching: spawn browser
    Launching --> Ready: browser connected
    Ready --> Active: first CDP command
    Active --> Idle: no activity timeout
    Idle --> Active: new CDP command
    Active --> Closing: close() or timeout
    Closing --> [*]: cleanup complete
    
    Launching --> Error: spawn failed
    Error --> [*]: cleanup

API Endpoints

Browser Session Routes

The primary routing module at packages/server-v4/src/routes/v4/browsersession/routes.ts defines the session REST API:

EndpointMethodDescription
/browsersessionPOSTCreate new browser session
/browsersession/:idGETGet session details
/browsersession/:idDELETEClose session
/browsersession/:id/screenshotGETCapture screenshot
/browsersession/:id/snapshotGETGet accessibility tree

Query Parameters

ParameterTypeDescription
headlessbooleanRun without visible window
viewport.widthnumberBrowser viewport width
viewport.heightnumberBrowser viewport height
contextIdstringBrowserbase Context ID to resume
persistbooleanPersist session across disconnects

Database Integration

The database client at packages/server-v4/src/db/client.ts provides persistence layer:

  • Session persistence: Store session metadata for recovery
  • Audit logging: Track session creation, commands, and errors
  • Usage metrics: Monitor API usage per project/key

Configuration

Environment Variables

VariableDescription
BROWSERBASE_API_KEYBrowserbase API key for remote execution
BROWSERBASE_PROJECT_IDBrowserbase project identifier
DATABASE_URLPostgreSQL connection string for session storage
PORTHTTP server port (default: 3000)

Execution Modes

ModeTriggerBrowser Location
localDefault (no API key)Local Playwright installation
remoteBROWSERBASE_API_KEY setBrowserbase cloud browsers

CDP Proxying

The server acts as a WebSocket proxy for Chrome DevTools Protocol commands:

  1. Client establishes WebSocket connection to /browsersession/:id/cdp
  2. Server forwards messages to the underlying browser session
  3. Responses and events stream back to client
  4. Connection persists until client disconnects or session expires

CLI Integration

The browse CLI commands connect to the server:

# Local execution (daemon auto-starts)
browse open https://example.com

# Remote execution via Browserbase
browse env remote
browse open https://example.com

# Specify server endpoint
browse --ws ws://localhost:3000 open https://example.com

The daemon architecture provides:

  • Automatic server startup/shutdown
  • Session persistence across CLI invocations
  • Graceful recovery on connection failure

Security Considerations

  • Session tokens are generated securely and scoped to individual sessions
  • CDP endpoints require valid session authentication
  • Remote execution uses Browserbase's authentication infrastructure
  • Sessions auto-expire after configurable inactivity timeout

References

Source: https://github.com/browserbase/stagehand / Human Manual

CLI Tools

Related topics: Server API, CDP Engine

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Daemon Architecture

Continue reading this section for the full explanation and source context.

Section Open Command Options

Continue reading this section for the full explanation and source context.

Section Click Actions

Continue reading this section for the full explanation and source context.

Related topics: Server API, CDP Engine

CLI Tools

The Stagehand CLI (browse command) is a command-line interface for browser automation designed specifically for AI agents. It provides a text-based interface for controlling web browsers through CDP (Chrome DevTools Protocol), enabling programmatic browser control without requiring a graphical interface.

Overview

The CLI serves as the foundational tool layer for the Stagehand browser automation framework. It abstracts CDP operations into human-readable commands, making it accessible for shell scripts, AI agent integrations, and development workflows.

The CLI operates in two modes:

ModeDescriptionDefault Condition
localRuns Chrome locally via bundled PlaywrightWhen no BROWSERBASE_API_KEY is set
remoteConnects to Browserbase cloud infrastructureWhen BROWSERBASE_API_KEY is set

Sources: packages/cli/README.md

Architecture

graph TD
    A[User / AI Agent] --> B[browse CLI]
    B --> C{browse env mode}
    C -->|local| D[Local Chrome CDP]
    C -->|remote| E[Browserbase Cloud]
    D --> F[Local Strategy]
    E --> G[Remote Strategy]
    F --> H[Chrome Instance]
    G --> I[Browserbase Infrastructure]
    H --> J[CDP WebSocket]
    I --> J

Daemon Architecture

The CLI uses a persistent daemon process for efficiency:

  1. First invocation starts the daemon in the background
  2. Subsequent commands reuse the existing daemon
  3. The daemon manages Chrome lifecycle and CDP connections

Recovery behavior:

  1. Detects stale daemon
  2. Cleans up old files
  3. Restarts the daemon
  4. Retries the command

Sources: packages/cli/README.md

Navigation Commands

CommandDescription
browse open <url>Navigate to a URL
browse reloadReload current page
browse backNavigate back in history
browse forwardNavigate forward in history

Open Command Options

OptionDescriptionDefault
--wait <state>Wait for load stateload
-t, --timeout <ms>Page load timeout30000ms
--context-id <id>Load Browserbase Context
--persistPersist Context across sessions

The --timeout flag controls how long to wait for the page load state. Use longer timeouts for slow-loading pages:

browse open https://slow-site.com --timeout 60000

Interaction Commands

Click Actions

CommandDescription
browse click <ref>Click by element ref (e.g., @0-5)
browse click <ref> [-b button]Click with button (left/right/middle)
browse click <ref> [-c count]Click multiple times
browse click_xy <x> <y>Click at coordinates

Coordinate Actions

CommandDescription
browse hover <x> <y>Hover at coordinates
browse scroll <x> <y> <dx> <dy>Scroll from position
browse drag <fx> <fy> <tx> <ty>Drag from to coordinates

Keyboard Commands

CommandDescription
browse type <text>Type text into focused element
browse type <text> [-d delay]Type with character delay
browse press <key>Press keyboard key

Supported keys include standard keys (Enter, Tab, Escape) and modifier combinations (Cmd+A, Cmd+C, etc.).

Form Handling

CommandDescription
browse fill <selector> <value>Fill form field
browse select <selector> <values...>Select dropdown options
browse highlight <selector>Highlight element

The fill command automatically presses Enter after typing. Use --no-press-enter to prevent this behavior.

Page Information

CommandDescription
browse get urlGet current URL
browse get titleGet page title
browse get text <selector>Get element text content
browse get html <selector>Get element HTML
browse get value <selector>Get form field value
browse get box <selector>Get center coordinates
browse get markdown [selector]Convert HTML to markdown

The snapshot command returns the accessibility tree with element references:

browse snapshot [-c|--compact]

Output format includes refs like [0-5], [1-2]:

RootWebArea "Example" url="https://example.com"
  [0-0] link "Home"
  [0-1] link "About"
  [0-2] button "Sign In"

Sources: packages/cli/README.md

Waiting Commands

CommandDescription
browse wait load [state]Wait for page load state
browse wait selector <selector>Wait for element
browse wait timeout <ms>Wait for duration

Selector wait options:

OptionDescriptionDefault
-t timeoutMaximum wait time-
`-s visible\hidden\attached\detached`Element statevisible

Multi-Tab Management

CommandDescription
browse pagesList all open tabs
browse newpage [url]Open new tab
browse tab_switch <n>Switch to tab by index
browse tab_close [n]Close tab (default: last)

Network Capture

Enable network request capture for debugging and inspection:

CommandDescription
browse network onStart capturing requests
browse network offStop capturing
browse network pathGet capture directory
browse network clearClear captured requests

Capture Directory Structure

/tmp/browse-default-network/
  001-GET-api.github.com-repos/
    request.json      # method, url, headers, body
    response.json     # status, headers, body, duration

Session Management

CommandDescription
browse startStart daemon
browse stopStop daemon
browse statusCheck daemon status
browse env <target>Set environment mode

Session Options

OptionDescription
--session <name>Session name (default: "default")
--headlessRun Chrome headless
--headedRun Chrome with visible window
`--ws <url\port>`One-shot CDP connection

Environment Modes

# Start with specific mode
browse env local
browse env remote

# Attach to running Chrome
browse env local <port|url>

# Persist override and restart daemon
browse env <target> --session <name>

# Clear override (fallback to env-var detection)
browse stop

Auto-detection priority:

  1. Check browse env persist setting
  2. Check BROWSE_SESSION environment variable
  3. Default: remote if BROWSERBASE_API_KEY is set, else local

Element References

After running browse snapshot, elements can be referenced by their ref ID:

# Get snapshot with refs
browse snapshot -c

# Click using ref (multiple formats)
browse click @0-2       # @ prefix
browse click 0-2        # Plain ref
browse click --ref 0-2 # --ref flag

Global Options

OptionDescription
--session <name>Session name for multiple browsers
--headlessRun Chrome in headless mode
--headedRun Chrome with visible window
`--ws <url\port>`One-shot CDP connection (bypasses daemon)
--jsonOutput as JSON

Environment Variables

VariableDescription
BROWSE_SESSIONDefault session name
BROWSERBASE_API_KEYBrowserbase API key (required for remote mode)

Output Formats

The CLI supports JSON output for programmatic consumption:

browse get url --json
browse snapshot --json

This is particularly useful for AI agent integrations that need structured data.

Sources: [packages/cli/README.md](https://github.com/browserbase/stagehand/blob/main/packages/cli/README.md)

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

medium Project risk needs validation

The project should not be treated as fully validated until this signal is reviewed.

medium README/documentation is current enough for a first validation pass.

The project should not be treated as fully validated until this signal is reviewed.

medium Maintainer activity is unknown

Users cannot judge support quality until recent activity, releases, and issue response are checked.

medium no_demo

The project may affect permissions, credentials, data exposure, or host boundaries.

Doramagic Pitfall Log

Doramagic extracted 7 source-linked risk signals. Review them before installing or handing real data to the project.

1. Project risk: Project risk needs validation

  • Severity: medium
  • Finding: Project risk is backed by a source signal: Project risk needs validation. Treat it as a review item until the current version is checked.
  • User impact: The project should not be treated as fully validated until this signal is reviewed.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: identity.distribution | github_repo:776908852 | https://github.com/browserbase/stagehand | repo=stagehand; install=create-browser-app

2. Capability assumption: README/documentation is current enough for a first validation pass.

  • Severity: medium
  • Finding: README/documentation is current enough for a first validation pass.
  • User impact: The project should not be treated as fully validated until this signal is reviewed.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: capability.assumptions | github_repo:776908852 | https://github.com/browserbase/stagehand | README/documentation is current enough for a first validation pass.

3. Maintenance risk: Maintainer activity is unknown

  • Severity: medium
  • Finding: Maintenance risk is backed by a source signal: Maintainer activity is unknown. Treat it as a review item until the current version is checked.
  • User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: evidence.maintainer_signals | github_repo:776908852 | https://github.com/browserbase/stagehand | last_activity_observed missing

4. Security or permission risk: no_demo

  • Severity: medium
  • Finding: no_demo
  • User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: downstream_validation.risk_items | github_repo:776908852 | https://github.com/browserbase/stagehand | no_demo; severity=medium

5. Security or permission risk: no_demo

  • Severity: medium
  • Finding: no_demo
  • User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: risks.scoring_risks | github_repo:776908852 | https://github.com/browserbase/stagehand | no_demo; severity=medium

6. Maintenance risk: issue_or_pr_quality=unknown

  • Severity: low
  • Finding: issue_or_pr_quality=unknown。
  • User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: evidence.maintainer_signals | github_repo:776908852 | https://github.com/browserbase/stagehand | issue_or_pr_quality=unknown

7. Maintenance risk: release_recency=unknown

  • Severity: low
  • Finding: release_recency=unknown。
  • User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
  • Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
  • Evidence: evidence.maintainer_signals | github_repo:776908852 | https://github.com/browserbase/stagehand | release_recency=unknown

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 11

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using stagehand with real data or production workflows.

Source: Project Pack community evidence and pitfall evidence