Doramagic Project Pack · Human Manual
stagehand
Stagehand represents a paradigm shift in browser automation by allowing developers to choose when to leverage AI capabilities versus writing explicit code. This hybrid approach provides fl...
Project Introduction
Related topics: Architecture Overview, CDP Engine, LLM Providers
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Architecture Overview, CDP Engine, LLM Providers
Project Introduction
Stagehand is an AI-powered browser automation framework developed by Browserbase that enables developers to control web browsers using natural language instructions combined with precise code-based control. The framework bridges the gap between high-level AI agents and low-level browser automation tools like Selenium, Playwright, or Puppeteer.
Project Overview
Stagehand represents a paradigm shift in browser automation by allowing developers to choose when to leverage AI capabilities versus writing explicit code. This hybrid approach provides flexibility while maintaining reliability for production environments.
Key Value Propositions
| Feature | Description |
|---|---|
| Hybrid Control | Combine AI-driven navigation with deterministic code execution |
| Self-Healing | Auto-caching and action recovery when website changes occur |
| Action Preview | Preview AI-generated actions before execution |
| Repeatable Workflows | Cache and reuse actions to save time and tokens |
| Production Ready | Built for reliable automation in production systems |
Sources: README.md:1-50
Architecture Overview
Stagehand follows a monorepo architecture using pnpm workspaces and Turborepo for efficient build management and dependency handling.
graph TD
A[stagehand Repository] --> B[packages/core]
A --> C[packages/cli]
B --> D[Agent System]
B --> E[Inference Engine]
B --> F[Browser Context Manager]
D --> G[Microsoft CUA Client]
D --> H[Action Executors]
F --> I[Playwright Integration]
F --> J[Frame Locator]
E --> K[Inference Logging]
E --> L[Cache Management]
C --> M[Daemon Controller]
C --> N[CLI Commands]
M --> FCore Packages
| Package | Purpose | Key Files |
|---|---|---|
packages/core | Main automation engine with AI agent capabilities | lib/v3/agent/*, lib/v3/understudy/* |
packages/cli | Command-line interface for browser control | CHANGELOG.md, README commands |
Sources: packages/cli/CHANGELOG.md:1-15
Project Structure
stagehand/
├── packages/
│ ├── core/ # Core automation package
│ │ ├── lib/
│ │ │ ├── v3/
│ │ │ │ ├── agent/ # AI agent implementation
│ │ │ │ │ ├── MicrosoftCUAClient.ts
│ │ │ │ │ └── prompts/
│ │ │ │ │ └── agentSystemPrompt.ts
│ │ │ │ └── understudy/ # Context and frame management
│ │ │ │ ├── context.ts
│ │ │ │ └── frameLocator.ts
│ │ │ └── inferenceLogUtils.ts
│ │ └── README.md
│ └── cli/ # CLI tooling
│ ├── CHANGELOG.md
│ └── README.md
├── package.json
├── pnpm-workspace.yaml
└── turbo.json
Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:1-50
Agent System Architecture
The agent system is the brain of Stagehand, responsible for interpreting user instructions and generating appropriate browser actions.
Microsoft CUA Integration
Stagehand implements a Microsoft CUA (Computer Use Agent) client that provides structured function calling capabilities:
graph LR
A[User Instruction] --> B[MicrosoftCUAClient]
B --> C[Action Generation]
C --> D[left_click]
C --> E[scroll]
C --> F[visit_url]
C --> G[type]
C --> H[wait]
C --> I[web_search]#### Supported Actions
| Action | Parameters | Description |
|---|---|---|
left_click | coordinate: [x, y] | Click at specified coordinates |
scroll | pixels: number | Scroll up (positive) or down (negative) |
visit_url | url: string | Navigate to a URL |
type | text, press_enter, delete_existing_text | Type text into input fields |
wait | time: number | Wait for specified seconds |
web_search | query: string | Perform a web search |
history_back | - | Navigate back in browser history |
pause_and_memorize_fact | fact: string | Store information for later use |
terminate | status: success/failure | End the task execution |
Sources: packages/core/lib/v3/agent/MicrosoftCUAClient.ts:1-80
System Prompt Structure
The agent uses a structured XML-based system prompt that organizes instructions into distinct sections:
graph TD
P[System Prompt] --> A[Identity]
P --> T[Task Definition]
P --> M[Mindset Guidelines]
P --> G[Action Guidelines]
P --> N[Navigation Rules]
P --> S[Strategy]
P --> R[Roadblocks]
P --> V[Variables]
P --> C[Completion]The prompt templates include conditional sections for:
- Page Understanding Protocol: Instructions for analyzing page state
- Search Integration: Optional search tool usage when URL confidence is low
- Variable Substitution: Support for
%variableName%syntax in form interactions - Captcha Handling: Optional roadblocks section when auto-solving is enabled
Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:50-150
Browser Context Management
Stagehand manages browser contexts through the understudy module, which handles complex scenarios like iframe interactions and multi-page workflows.
Frame Locator System
The frame locator handles cross-origin iframe communication and lifecycle events:
sequenceDiagram
Participant Parent as Parent Page
Participant Child as Child Frame
Participant Context as BrowserContext
Parent->>Context: Create Frame
Context->>Child: Register Lifecycle Events
Child-->>Parent: DOMContentLoaded
Parent->>Context: Wait for MainWorld
Context-->>Parent: Execution Context ReadyKey responsibilities include:
- Resolving owning
Pageby main frame ID - Applying initialization scripts to pages
- Managing lifecycle events (DOMContentLoaded, load, networkIdle)
- Handling OOPIF (Out-of-Process iframe) scenarios
Sources: packages/core/lib/v3/understudy/context.ts:1-50, packages/core/lib/v3/understudy/frameLocator.ts:1-40
Inference and Caching
Stagehand implements an inference logging system that enables self-healing automation by caching and reusing action results.
Summary File Structure
<inferenceType>_summary.json
Where inferenceType can be:
act_summary: Action execution resultsobserve_summary: Page observation resultsextract_summary: Data extraction results
The system reads and writes JSON files containing arrays of inference results, enabling:
- Action Replay: Re-execute previously successful actions
- Self-Healing: Automatically recover from website changes
- Token Optimization: Skip LLM inference when cached results are valid
Sources: packages/core/lib/inferenceLogUtils.ts:1-60
CLI Architecture
The Stagehand CLI provides a command-line interface for browser automation with support for both local and remote browser execution.
Execution Modes
| Mode | Detection | Description |
|---|---|---|
remote | BROWSERBASE_API_KEY is set | Uses Browserbase cloud infrastructure |
local | Default fallback | Uses local Playwright browser |
The daemon-based architecture ensures:
- Persistent browser sessions
- Automatic recovery on failures
- Session state preservation across commands
Sources: packages/cli/README.md:1-100
CLI Command Categories
#### Navigation Commands
browse open <url> [--wait load|domcontentloaded|networkidle]
browse reload
browse back
browse forward
#### Interaction Commands
browse click <ref>
browse type <text>
browse press <key>
browse fill <selector> <value>
#### Information Commands
browse get url|title|text|html|value|box
browse snapshot [-c|--compact]
browse screenshot
#### Session Management
browse start|stop|restart
browse env local|remote
browse attach <port|url>
Sources: packages/cli/README.md:100-200
Getting Started
Installation from Source
git clone https://github.com/browserbase/stagehand.git
cd stagehand
pnpm install
pnpm run build
pnpm run example
Branch Installation
Using gitpkg, install directly from a branch:
"@browserbasehq/stagehand": "https://gitpkg.now.sh/browserbase/stagehand/packages/core?<branchName>"
Environment Configuration
Create a .env file from the example:
cp .env.example .env
# Add your API keys:
# BROWSERBASE_API_KEY=your_key
# OPENAI_API_KEY=your_key (or other LLM provider)
Sources: README.md:50-80, packages/core/README.md:50-80
Version History
Recent significant changes in the CLI package:
| Version | Change | PR |
|---|---|---|
| 0.4.2 | Added browse get markdown command for HTML-to-markdown conversion | #1907 |
| 0.4.1 | Fixed invalid metadata key using underscore | #1911 |
| 0.4.0 | Added new feature | #1889 |
Sources: packages/cli/CHANGELOG.md:1-20
License and Community
Stagehand is released under the MIT License. The project maintains an active community through:
- Documentation: docs.stagehand.dev
- Discord Community: stagehand.dev/discord
- DeepWiki Integration: Ask questions at deepwiki.com/browserbase/stagehand
A Python implementation is also available at github.com/browserbase/stagehand-python.
Sources: README.md:1-30, packages/core/README.md:1-30
Sources: [README.md:1-50]()
Architecture Overview
Related topics: Project Introduction, CDP Engine, Core Actions, Server API
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Project Introduction, CDP Engine, Core Actions, Server API
Architecture Overview
Introduction
Stagehand is an AI-powered browser automation framework that enables developers to control web browsers using natural language instructions combined with precise code-based control. The architecture is designed to bridge the gap between high-level AI agents and low-level browser automation frameworks like Playwright, providing reliability, extensibility, and self-healing capabilities essential for production environments.
The framework operates on a multi-layered architecture that separates concerns between browser session management, agent reasoning, tool execution, and page interaction primitives. This design allows users to choose when to leverage AI for navigating unfamiliar pages and when to use explicit code for deterministic operations.
Core Architecture Layers
Stagehand's architecture consists of four primary layers working in concert to provide browser automation capabilities:
| Layer | Purpose | Key Components |
|---|---|---|
| Agent Layer | High-level reasoning and decision making | MicrosoftCUAClient, System Prompts |
| Context Layer | Browser session and page lifecycle management | BrowserContext, Page management |
| Understudy Layer | Low-level browser primitives and frame handling | FrameLocator, Page interactions |
| Transport Layer | Local/Remote browser connectivity | CDP (Chrome DevTools Protocol) |
Browser Context Management
The BrowserContext class (defined in context.ts) serves as the central hub for managing browser sessions. It handles multiple pages, initialization scripts, and coordinate-based element interactions.
Page Management
The context maintains a mapping of pages organized by target ID, enabling multi-tab support:
pages(): Page[] {
const rows: Array<{ tid: TargetId; page: Page; created: number }> = [];
for (const [tid, page] of this.pagesByTarget) {
if (this.typeByTarget.get(tid) === "page") {
rows.push({ tid, page, created: this.createdAtByTarget.get(tid) ?? 0 });
}
}
rows.sort((a, b) => a.created - b.created);
return rows.map((r) => r.page);
}
Sources: context.ts:lines
Pages are returned in creation order (oldest to newest), and OOPIF (Out-of-Process iframe) targets are intentionally excluded from this listing to maintain a clean multi-tab abstraction.
Initialization Scripts
The context supports both seeding and full registration of initialization scripts:
private async applyInitScriptsToPage(
page: Page,
opts?: { seedOnly?: boolean },
): Promise<void> {
if (opts?.seedOnly) {
for (const source of this.initScripts) {
page.seedInitScript(source);
}
return;
}
for (const source of this.initScripts) {
await page.registerInitScript(source);
}
}
Sources: context.ts:lines
This dual-mode initialization allows scripts to be either eagerly registered or lazily seeded for later injection.
Frame Handling Architecture
Stagehand implements sophisticated frame management through the FrameLocator module, handling both same-process and cross-process frame scenarios including OOPIFs.
Lifecycle-Aware Frame Attachment
The frame attachment process waits for specific lifecycle events before exposing the main world context:
const onLifecycle = (evt: Protocol.Page.LifecycleEventEvent) => {
if (
evt.frameId !== childFrameId ||
(evt.name !== "DOMContentLoaded" &&
evt.name !== "load" &&
evt.name !== "networkIdle" &&
evt.name !== "networkidle")
) {
return;
}
if (hasMainWorldOnParent()) return finish();
// ... handle frame ownership transfer
};
Sources: frameLocator.ts:lines
This approach ensures that automation scripts don't attempt to interact with frames before they have reached a stable state.
Agent System Design
System Prompt Architecture
The agent receives structured prompts that organize its behavior into distinct sections:
<system>
<identity>You are a web automation assistant...</identity>
<task>
<goal>${executionInstruction}</goal>
<date display="local" iso="${isoDate}">${localeDate}</date>
</task>
<page>
<startingUrl>...</startingUrl>
</page>
<mindset>...</mindset>
<guidelines>...</guidelines>
<page_understanding_protocol>...</page_understanding_protocol>
<navigation>...</navigation>
<tools>...</tools>
<strategy>...</strategy>
<roadblocks>...</roadblocks>
<variables>...</variables>
<completion>...</completion>
</system>
Sources: agentSystemPrompt.ts:lines
Hybrid Mode Page Understanding
The system supports two modes of page understanding based on the isHybridMode flag:
| Mode | Primary Tool | Secondary Tool | Use Case |
|---|---|---|---|
| Hybrid | screenshot | ariaTree | Visual confirmation + accessible content |
| Standard | ariaTree | screenshot | Text-focused accessibility tree |
Sources: agentSystemPrompt.ts:lines
The page understanding protocol ensures agents start by comprehending the page state before taking actions:
<page_understanding_protocol>
<step_1>
<title>UNDERSTAND THE PAGE</title>
<primary_tool>
<name>screenshot|ariaTree</name>
<usage>Get complete page context before taking actions</usage>
</primary_tool>
</step_1>
</page_understanding_protocol>
Microsoft CUA Agent Actions
The Microsoft CUA client defines a comprehensive set of agent actions:
| Action | Purpose | Required Parameters | |
|---|---|---|---|
left_click | Click at coordinate | coordinate: [x, y] | |
scroll | Scroll page | pixels: number, coordinate?: [x, y] | |
visit_url | Navigate to URL | url: string | |
web_search | Perform search | query: string | |
history_back | Navigate backward | None | |
pause_and_memorize_fact | Store context | fact: string | |
wait | Pause execution | time: number | |
terminate | End task | `status: success\ | failure` |
type | Input text | text: string, press_enter?: boolean | |
key | Keyboard input | keys: string[] |
Sources: MicrosoftCUAClient.ts:lines
FARA Function Calling Protocol
The agent uses an XML-based function calling format:
<tools>
${toolSchema}
</tools>
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
Sources: MicrosoftCUAClient.ts:lines
This format separates agent thoughts from function calls, enabling clear reasoning before action execution.
Variable System
The framework supports variable substitution in tool calls using %variableName% syntax:
const variableToolsNote = isHybridMode
? "Use %variableName% syntax in the type, fillFormVision, or act tool's value/text/action fields."
: "Use %variableName% syntax in the act or fillForm tool's action fields.";
Sources: agentSystemPrompt.ts:lines
Variables are defined with optional descriptions and rendered in XML format:
<variables>
<variable name="password" />
<variable name="username">The username for login</variable>
</variables>
Session Management
Daemon-Based Architecture
The CLI architecture uses a persistent daemon for browser session management:
graph TD
A[CLI Command] --> B[Daemon Check]
B -->|Running| C[Send Command]
B -->|Not Running| D[Auto-restart Daemon]
D --> C
C --> E[Browser Instance]
E --> F[Page Operations]The daemon supports two execution modes:
| Mode | Trigger | Use Case |
|---|---|---|
remote | BROWSERBASE_API_KEY is set | Cloud browser infrastructure |
local | No API key detected | Local browser instances |
Multi-Session Support
Sessions can be named for parallel browser instances:
browse --session <name>
The context stores metadata using session names, enabling clean separation of concurrent automation tasks.
Tool Categories
Navigation Tools
| Tool | Description |
|---|---|
open | Navigate to URL with configurable wait states |
reload | Refresh current page |
back / forward | Browser history navigation |
Interaction Tools
| Tool | Description |
|---|---|
click | Click element by reference or coordinates |
type | Text input with optional delays |
press | Keyboard shortcuts |
hover | Mouse hover at coordinates |
scroll | Scroll operations with delta support |
Page Information Tools
| Tool | Description |
|---|---|
snapshot | Accessibility tree with element refs |
screenshot | Visual capture (PNG/JPEG) |
get | Retrieve URL, title, text, HTML, values |
get markdown | Convert HTML to markdown |
Strategy Guidelines
The system embeds strategic guidelines for agent behavior:
<strategy>
<item>CRITICAL: Use extract ONLY when the task explicitly requires structured data output</item>
<item>Keep actions atomic and verify outcomes before proceeding</item>
<item>For each action, provide clear reasoning about why you're taking that step</item>
<item>When you need to input text, prefer using the keys tool to type the entire sequence at once</item>
</strategy>
Sources: agentSystemPrompt.ts:lines
Configuration Options
The agent system prompt accepts the following configuration parameters:
| Parameter | Type | Purpose |
|---|---|---|
executionInstruction | string | Task description for the agent |
url | string | Starting page URL |
systemInstructions | string | Custom system-level instructions (CDATA) |
variables | Record | Variable name/value pairs |
isHybridMode | boolean | Enable screenshot-first page understanding |
captchasAutoSolve | boolean | Enable CAPTCHA auto-solving (shows roadblocks section) |
hasSearch | boolean | Enable search tool usage |
isLocal | boolean | Local vs remote browser context |
Error Handling
The context manages HTTP header configuration errors with detailed session reporting:
const failures = Array.from(result.errors.entries()).map(([target, entry]) => {
const reason = entry.result.reason as Error;
const sid = entry.session.id ?? "unknown";
const message = reason?.message ?? String(reason);
return `session=${sid} error=${message}`;
});
if (failures.length) {
throw new StagehandSetExtraHTTPHeadersError(failures);
}
Sources: context.ts:lines
Extension Points
Stagehand provides several extension mechanisms:
- Custom Instructions: Via
systemInstructionsparameter with CDATA wrapping for complex multi-line content - Variables: For sensitive data injection (passwords, tokens)
- Init Scripts: JavaScript injection at page load time
- Tool Schema: Extensible action definitions in the Microsoft CUA client
Sources: [context.ts:lines]()
CDP Engine
Related topics: Architecture Overview, DOM and Accessibility Tree, Core Actions
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Architecture Overview, DOM and Accessibility Tree, Core Actions
CDP Engine
The CDP (Chrome DevTools Protocol) Engine is the low-level abstraction layer in Stagehand that provides direct communication between the framework and web browsers. It orchestrates browser launching, session management, frame handling, execution context management, and CDP command dispatching.
Architecture Overview
graph TD
A[Stagehand API] --> B[CDP Engine]
B --> C[Launch Module]
B --> D[Session Manager]
B --> E[Frame Registry]
B --> F[Execution Context Registry]
C --> G[Local Browser]
C --> H[Browserbase Cloud]
D --> I[CDP WebSocket Connections]
E --> J[Frame Lifecycle Events]
F --> K[Isolated JS Contexts]The CDP Engine serves as the foundation layer that Stagehand's high-level browser automation primitives are built upon. It abstracts away the complexity of raw CDP WebSocket communication while providing type-safe interfaces for all browser operations.
Sources: packages/core/lib/v3/understudy/cdp.ts:1-50
Core Components
Launch Module
The launch module handles browser instantiation across different deployment environments.
| Launch Mode | Source File | Description |
|---|---|---|
| Local Browser | packages/core/lib/v3/launch/local.ts | Launches Chromium locally using Playwright's browser management |
| Browserbase Cloud | packages/core/lib/v3/launch/browserbase.ts | Connects to remote browser infrastructure for scalable execution |
graph LR
A[Launch Request] --> B{Local or Remote?}
B -->|Local| C[local.ts]
B -->|Remote| D[browserbase.ts]
C --> E[Playwright Browser]
D --> F[CDP Endpoint]The local launch implementation uses Playwright's browser management system to spawn Chromium instances with the necessary debugging flags enabled. Remote launches establish WebSocket connections to Browserbase's cloud browser fleet.
Sources: packages/core/lib/v3/launch/local.ts:1-100
Session Manager
The Session Manager (context.ts) maintains active browser pages and their associated CDP sessions.
// Pages retrieval - returns top-level pages oldest to newest
pages(): Page[] {
const rows: Array<{ tid: TargetId; page: Page; created: number }> = [];
for (const [tid, page] of this.pagesByTarget) {
if (this.typeByTarget.get(tid) === "page") {
rows.push({ tid, page, created: this.createdAtByTarget.get(tid) ?? 0 });
}
}
rows.sort((a, b) => a.created - b.created);
return rows.map((r) => r.page);
}
The session manager intentionally excludes OOPIF (Out-of-Process iframe) targets from page listings, focusing on top-level browsing contexts.
Sources: packages/core/lib/v3/understudy/context.ts:60-75
Frame Registry
The Frame Registry (frameRegistry.ts) tracks all frame instances across page hierarchies. It maintains the relationship between parent and child frames, enabling accurate frame targeting for CDP commands.
| Method | Purpose |
|---|---|
registerFrame | Track newly created frames |
unregisterFrame | Clean up destroyed frames |
getParentFrame | Resolve parent frame context |
getChildFrames | List direct children of a frame |
Execution Context Registry
The Execution Context Registry (executionContextRegistry.ts) manages JavaScript execution contexts within frames.
graph TD
A[Page Load] --> B[Create Main World Context]
B --> C[Optional: Create Isolated World]
C --> D[Register Context in Registry]
D --> E[Frame Ready for Script Execution]
F[iframe Navigation] --> G[Create New Context]
G --> DEach frame can have multiple execution contexts:
- Main World: The default JavaScript context where page scripts run
- Isolated World: Sandboxed contexts for extension scripts or injected code
Sources: packages/core/lib/v3/understudy/executionContextRegistry.ts:1-80
Frame Lifecycle Management
The frameLocator.ts module handles the complex state transitions of browser frames, particularly for iframes and cross-origin navigations.
const onLifecycle = (evt: Protocol.Page.LifecycleEventEvent) => {
if (
evt.frameId !== childFrameId ||
(evt.name !== "DOMContentLoaded" &&
evt.name !== "load" &&
evt.name !== "networkIdle" &&
evt.name !== "networkidle")
) {
return;
}
// Handle frame initialization
};
Key lifecycle events monitored:
DOMContentLoaded: Frame DOM is parsedload: All resources loadednetworkIdle/networkidle: No pending network requests
Sources: packages/core/lib/v3/understudy/frameLocator.ts:25-40
Cookie Management
The cookies.ts module provides CDP-compatible cookie handling with validation and normalization.
export function normalizeCookieParams(cookies: CookieParam[]): CookieParam[] {
return cookies.map((c) => {
if (!c.url && !(c.domain && c.path)) {
throw new CookieValidationError(
`Cookie "${c.name}" must have a url or a domain/path pair`
);
}
// Additional validation for secure/sameSite pairing
});
}
| Validation Rule | CDP Requirement |
|---|---|
| URL or Domain/Path | Cookies must have either url or both domain+path |
| Secure + SameSite=None | Browsers require secure: true when using sameSite: "None" |
| Host-only cookies | When url provided, domain and path are derived |
The module enforces these constraints before sending cookies to CDP, preventing silent failures from browser rejections.
Sources: packages/core/lib/v3/understudy/cookies.ts:30-60
CDP Command Dispatch
The CDP Engine provides a unified interface for sending commands to browsers:
sequenceDiagram
Client->>CDP Engine: Execute CDP Command
CDP Engine->>Validation: Check Parameters
Validation->>Session Manager: Route to Correct Session
Session Manager->>CDP WebSocket: Send Command
CDP WebSocket-->>Session Manager: CDP Response
Session Manager-->>CDP Engine: Typed Response
CDP Engine-->>Client: ResultSupported Command Categories
| Category | Capabilities |
|---|---|
| Page Operations | Navigation, reload, back/forward |
| Frame Management | Frame creation, destruction, activation |
| Script Execution | Evaluate expressions, call functions |
| Input Handling | Mouse, keyboard, touch events |
| Network Monitoring | Request/response capture |
| Runtime Inspection | Console access, breakpoints |
Initialization Scripts
The CDP Engine supports injecting initialization scripts into pages:
private async applyInitScriptsToPage(
page: Page,
opts?: { seedOnly?: boolean },
): Promise<void> {
if (opts?.seedOnly) {
for (const source of this.initScripts) {
page.seedInitScript(source);
}
return;
}
for (const source of this.initScripts) {
await page.registerInitScript(source);
}
}
| Script Type | Timing | Use Case |
|---|---|---|
seedInitScript | Before page loads | Pre-inject polyfills, shims |
registerInitScript | After page load | Runtime modifications |
Error Handling
The CDP Engine implements robust error handling for common CDP failure scenarios:
// Session-level error collection
const failures = pendingEntries
.filter((entry) => entry.result.status === "failure")
.map((entry) => {
const reason = entry.result.reason as Error;
const sid = entry.session.id ?? "unknown";
const message = reason?.message ?? String(reason);
return `session=${sid} error=${message}`;
});
if (failures.length) {
throw new StagehandSetExtraHTTPHeadersError(failures);
}
Custom error types:
StagehandSetExtraHTTPHeadersError: Failed header injectionCookieValidationError: Invalid cookie parametersExecutionContextError: Script execution failures
Sources: packages/core/lib/v3/understudy/context.ts:45-55
Integration with Agent System
The CDP Engine is used by higher-level agents (like MicrosoftCUAClient) to perform browser automation:
// Supported actions in Microsoft CUA integration
const supportedActions = [
"left_click",
"scroll",
"visit_url",
"web_search",
"history_back",
"pause_and_memorize_fact",
"wait",
"terminate",
];
Each action maps to specific CDP commands, with the CDP Engine abstracting the protocol-level details.
Sources: packages/core/lib/v3/agent/MicrosoftCUAClient.ts:50-70
Configuration
| Option | Type | Default | Description |
|---|---|---|---|
headless | boolean | auto | Run browser in headless mode |
timeout | number | 30000 | Default command timeout (ms) |
viewport | Viewport | 1280x720 | Browser viewport size |
userAgent | string | auto | Custom user agent string |
ignoreHTTPSErrors | boolean | false | Allow invalid certificates |
Summary
The CDP Engine is the foundational layer that enables Stagehand's browser automation capabilities. It provides:
- Abstraction: Type-safe interfaces over raw CDP WebSocket communication
- Multi-environment: Support for both local and cloud browser execution
- Lifecycle management: Frame and execution context tracking
- Validation: Cookie and parameter normalization before CDP commands
- Error handling: Structured error types and recovery mechanisms
All high-level browser automation features in Stagehand build upon the CDP Engine's primitives, making it essential for understanding the framework's internals.
Sources: [packages/core/lib/v3/understudy/cdp.ts:1-50]()
Core Actions
Related topics: CDP Engine, DOM and Accessibility Tree, Agent System
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: CDP Engine, DOM and Accessibility Tree, Agent System
Core Actions
Overview
Core Actions are the fundamental building blocks of browser automation in Stagehand. They represent the primitive operations that the AI agent can perform to interact with web pages, extract information, and accomplish user-defined tasks. These actions bridge the gap between natural language instructions and executable browser operations.
The Core Actions system is designed around a modular architecture where different handlers manage specific types of interactions. Each handler is responsible for a category of actions, implementing the logic to translate high-level agent decisions into low-level browser commands.
Sources: packages/core/lib/v3/handlers/v3AgentHandler.ts
Architecture
The Core Actions system follows a layered architecture:
graph TD
A[User Instruction] --> B[Agent Handler]
B --> C[Action Router]
C --> D[actHandler]
C --> E[extractHandler]
C --> F[observeHandler]
D --> G[Browser CDP Commands]
E --> G
F --> G
H[Microsoft CUA Client] --> GHandler Components
| Handler | Purpose | Key Operations |
|---|---|---|
actHandler | Execute user interaction actions | click, type, scroll, hover, drag |
extractHandler | Extract structured data from pages | DOM parsing, content extraction |
observeHandler | Analyze and understand page state | accessibility tree, element detection |
v3AgentHandler | Orchestrate agent workflow | task planning, action coordination |
Sources: packages/core/lib/v3/handlers/actHandler.ts
Action Types
Browser Navigation Actions
| Action | Parameters | Description |
|---|---|---|
visit_url | url, wait | Navigate to a specified URL with optional wait state |
history_back | - | Navigate back in browser history |
history_forward | - | Navigate forward in browser history |
reload | - | Reload the current page |
Interaction Actions
| Action | Parameters | Description |
|---|---|---|
left_click | coordinate, element | Click at coordinates or on an element |
right_click | coordinate, element | Perform right-click context menu action |
mouse_move | coordinate | Move mouse to specified position |
scroll | pixels, coordinate | Scroll by pixel amount (positive=up, negative=down) |
drag | from, to, steps | Perform drag operation with interpolation |
type | text, press_enter, delete_existing_text | Type text into focused input fields |
Information Retrieval Actions
| Action | Parameters | Description |
|---|---|---|
extract | schema | Extract structured data matching a Zod schema |
screenshot | full_page | Capture visual representation of page |
aria_tree | ref | Get accessibility tree for element inspection |
Utility Actions
| Action | Parameters | Description |
|---|---|---|
wait | time | Pause execution for specified seconds |
pause_and_memorize_fact | fact | Store information for later retrieval |
web_search | query | Execute a web search |
terminate | status | End the task with success or failure status |
Sources: packages/core/lib/v3/agent/MicrosoftCUAClient.ts:1-100
Action Parameters
Common Parameters
interface ActionParameters {
action: string; // The action type to perform
element?: string; // Element reference (e.g., "@0-5")
coordinate?: [number, number]; // [x, y] pixel coordinates
ref?: string; // Element reference in ariaTree
}
Type Action Parameters
| Parameter | Type | Description |
|---|---|---|
text | string | Text to type into the input field |
press_enter | boolean | Whether to press Enter after typing |
delete_existing_text | boolean | Clear existing text before typing |
Scroll Action Parameters
| Parameter | Type | Description |
|---|---|---|
pixels | number | Positive values scroll up, negative scroll down |
coordinate | [number, number] | Optional target coordinates for viewport-relative scrolling |
Extract Action Parameters
| Parameter | Type | Description |
|---|---|---|
schema | ZodSchema | Zod schema defining the extraction structure |
prompt | string | Natural language description of data to extract |
Sources: packages/core/lib/v3/agent/MicrosoftCUAClient.ts:20-85
Tool Schema
The agent uses a FARA (Foundation Agent Reference Architecture) function calling template for tool execution:
You are provided with function signatures within <tools></tools> XML tags:
<tools>
${toolSchema}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{{"name": <function-name>, "arguments": <args-json-object>}}
</tool_call>
The tool schema defines the complete contract between the agent reasoning system and the browser automation layer:
{
"name": "browser_automation",
"description": "Primary tool for browser automation tasks",
"parameters": {
"type": "object",
"properties": {
"action": {
"type": "string",
"enum": ["left_click", "scroll", "visit_url", ...]
},
"coordinate": {
"type": "array",
"description": "(x, y): The x and y coordinates for mouse operations"
},
"pixels": {
"type": "number",
"description": "Scroll amount; positive = up, negative = down"
}
},
"required": ["action"]
}
}
Sources: packages/core/lib/v3/agent/MicrosoftCUAClient.ts:85-130
Execution Flow
sequenceDiagram
participant User
participant Agent
participant Handler
participant CDP as Chrome DevTools Protocol
User->>Agent: Natural language instruction
Agent->>Agent: Reason and plan action
Agent->>Handler: Execute action with parameters
Handler->>CDP: Translate to CDP command
CDP-->>Handler: Operation result
Handler-->>Agent: Action completion status
Agent->>User: Task progress updatePage Context Management
The Context class manages page state and frame handling for multi-tab scenarios:
pages(): Page[] {
const rows: Array<{ tid: TargetId; page: Page; created: number }> = [];
for (const [tid, page] of this.pagesByTarget) {
if (this.typeByTarget.get(tid) === "page") {
rows.push({ tid, page, created: this.createdAtByTarget.get(tid) ?? 0 });
}
}
rows.sort((a, b) => a.created - b.created);
return rows.map((r) => r.page);
}
The context system maintains:
- Active page instances by target ID
- Frame lifecycle management
- Execution context tracking per frame
- Initialization script injection
Sources: packages/core/lib/v3/understudy/context.ts:1-50
Frame Locator
For handling nested frames and iframes, the frame locator system waits for frame readiness:
graph TD
A[Parent Page] --> B{Has Main World?}
B -->|No| C[Enable Lifecycle Events]
C --> D[Wait for DOMContentLoaded]
D --> E{Has Main World?}
E -->|Yes| F[Continue]
E -->|No| G[Wait with timeout]
G --> H[Get Session for Frame]
H --> I[Wait for Main World]
I --> F
B -->|Yes| FSources: packages/core/lib/v3/understudy/frameLocator.ts
System Prompt Configuration
The agent's behavior is configured through system prompts that include:
- Task Definition: The execution instruction and goal context
- Page Understanding Protocol: When to use ariaTree vs screenshot
- Strategy Guidelines: Best practices for action sequencing
- Variable Substitution: Support for dynamic values via
%variableName%syntax
const systemPrompt = `<system>
<identity>You are a web automation assistant using browser automation tools</identity>
<task>
<goal>${cdata(executionInstruction)}</goal>
<date display="local" iso="${isoDate}">${localeDate}</date>
</task>
${customInstructionsBlock}
<page_understanding_protocol>
${isHybridMode ? screenshot + ariaTree : ariaTree + screenshot}
</page_understanding_protocol>
${variablesSection}
</system>`;
Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts
Error Handling
Actions can fail due to various reasons:
| Error Type | Cause | Recovery Strategy |
|---|---|---|
| Element not found | Selector invalid or element removed | Re-observe page, retry with updated selector |
| Action timeout | Page not responding | Wait and retry |
| Frame detached | Frame navigated away | Re-acquire frame context |
| CDP error | Protocol communication failure | Retry CDP command |
Failed actions are logged with session context for debugging:
const failures = await Promise.allSettled(
this.context.extraHTTPHeaders.map(async ({ entry }) => {
const result = await entry.result;
const reason = entry.result.reason as Error;
const message = reason?.message ?? String(reason);
return `session=${sid} error=${message}`;
})
);
Sources: packages/core/lib/v3/understudy/context.ts:50-80
Best Practices
- Atomic Actions: Keep actions focused and single-purpose for better reliability
- Element References: Use ariaTree references (e.g.,
@0-5) when available for precise targeting - Wait Appropriately: Allow page state to stabilize before subsequent actions
- Error Recovery: The system supports automatic retry and self-healing mechanisms
- Variable Usage: Use
%variableName%for sensitive data like passwords
See Also
Sources: [packages/core/lib/v3/handlers/v3AgentHandler.ts](https://github.com/browserbase/stagehand/blob/main/packages/core/lib/v3/handlers/v3AgentHandler.ts)
DOM and Accessibility Tree
Related topics: CDP Engine, Core Actions, DOM and Accessibility Tree
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: CDP Engine, Core Actions, DOM and Accessibility Tree
DOM and Accessibility Tree
Overview
The DOM and Accessibility Tree system in Stagehand provides the foundational mechanism for understanding and interacting with web pages. This dual-layer approach combines native DOM manipulation with accessibility tree analysis to enable reliable browser automation through AI agents.
The system serves two primary purposes:
- DOM Interaction - Direct manipulation of page elements including clicking, typing, scrolling, and form handling
- Accessibility Tree (ariaTree) - Structured representation of page content optimized for AI comprehension and element discovery
This architecture allows Stagehand to balance the precision of programmatic DOM access with the semantic understanding provided by accessibility APIs.
Source: https://github.com/browserbase/stagehand / Human Manual
LLM Providers
Related topics: Project Introduction, Agent System
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Project Introduction, Agent System
LLM Providers
Overview
Stagehand implements a flexible LLM Provider abstraction that enables browser automation agents to interact with various large language model backends. The provider system is designed to standardize how agents execute actions, parse responses, and handle tool invocations across different AI service providers.
The LLM Provider architecture follows a client-adapter pattern where each provider (OpenAI, Anthropic, Google) implements a common interface while exposing provider-specific capabilities and response formats.
Architecture
High-Level Component Flow
graph TD
A[Agent Client] --> B[LLM Provider Interface]
B --> C[OpenAI Client]
B --> D[Anthropic Client]
B --> E[Google Client]
C --> F[OpenAI API]
D --> G[Claude API]
E --> H[Gemini API]
F --> I[Action Execution]
G --> I
H --> ISystem Prompt Construction
The agent system prompt is constructed dynamically based on execution context and mode. The buildAgentSystemPrompt function in agentSystemPrompt.ts assembles the prompt from multiple modular sections:
return `<system>
<identity>You are a web automation assistant using browser automation tools to accomplish the user's goal.</identity>
${customInstructionsBlock}<task>
<goal>${cdata(executionInstruction)}</goal>
<date display="local" iso="${isoDate}">${localeDate}</date>
</task>
...
</system>`;
Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:1-50
Provider Interface
Tool Schema Definition
LLM Providers use a standardized tool schema format for function calling. The schema defines available browser automation actions:
const toolSchema = {
properties: {
action: {
type: "string",
description: "The action to perform",
enum: ["screenshot", "extract", "click", "type", "wait", "scroll",
"hover", "press", "goBack", "goForward", "executeJs",
"fillForm", "fillFormVision", "act", "ariaTree",
"pause_and_memorize_fact", "terminate"],
},
// Additional parameters...
},
};
Sources: packages/core/lib/v3/agent/MicrosoftCUAClient.ts:1-100
Function Call Template
Providers implement XML-based tool calling using a standardized template:
<tools>
${toolDescs}
</tools>
<tool_call>
{{"name": <function-name>, "arguments": <args-json-object>}}
</tool_call>
This format enables reliable parsing of model responses across different LLM backends.
Supported Providers
Microsoft Copilot Agent (CUA)
The Microsoft CUA client implements the FARA (Function-calling Augmented Response Agent) pattern with XML-based tool calling:
| Parameter | Type | Description |
|---|---|---|
action | string | Action type to execute |
selector | string | Element selector for DOM operations |
text | string | Text content for typing or extraction |
reasoning | string | Model's reasoning for the action |
time | number | Wait duration in seconds |
status | string | Task completion status |
Sources: packages/core/lib/v3/agent/MicrosoftCUAClient.ts:80-120
Agent Modes
Stagehand supports different operational modes that affect how the agent interacts with LLM providers:
Hybrid Mode
In hybrid mode, the agent prioritizes visual understanding:
const pageUnderstandingProtocol = isHybridMode
? `<page_understanding_protocol>
<step_1>
<primary_tool>
<name>screenshot</name>
<usage>Visual confirmation when needed</usage>
</primary_tool>
<secondary_tool>
<name>ariaTree</name>
<usage>Get complete page context before taking actions</usage>
</secondary_tool>
</step_1>
</page_understanding_protocol>`
: `<page_understanding_protocol>
<step_1>
<primary_tool>
<name>ariaTree</name>
<usage>Get complete page context before taking actions</usage>
</primary_tool>
<secondary_tool>
<name>screenshot</name>
<usage>Visual confirmation when needed</usage>
</secondary_tool>
</step_1>
</page_understanding_protocol>`;
Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:100-130
Variable Substitution
Providers support variable substitution in tool parameters for sensitive data:
const variableToolsNote = isHybridMode
? "Use %variableName% syntax in the type, fillFormVision, or act tool's value/text/action fields."
: "Use %variableName% syntax in the act or fillForm tool's action fields.";
Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:60-65
Tool Actions
Available Browser Automation Actions
| Action | Description | Primary Use Case |
|---|---|---|
screenshot | Capture page screenshot | Visual verification |
extract | Extract structured data | Data collection tasks |
click | Click on element | Navigation/interaction |
type | Type text into input | Form filling |
wait | Wait for condition | Synchronization |
scroll | Scroll the page | Content visibility |
ariaTree | Get accessibility tree | Page structure understanding |
fillForm | Fill form fields | Multi-field form completion |
Response Parsing
Thoughts and Action Extraction
The provider implements parsing logic to extract model thoughts and function calls from responses:
private parseThoughtsAndAction(response: string): {
thoughts: string;
functionCall: FaraFunctionCall;
} {
try {
const parts = response.split("<tool_call>\n");
const thoughts = parts[0].trim();
const actionText = parts[1]?.trim() ?? "";
// Parse JSON action from actionText
}
}
Sources: packages/core/lib/v3/agent/MicrosoftCUAClient.ts:150-180
Context Management
LLM Providers operate within a browser context that manages multiple pages and execution environments:
pages(): Page[] {
const rows: Array<{ tid: TargetId; page: Page; created: number }> = [];
for (const [tid, page] of this.pagesByTarget) {
if (this.typeByTarget.get(tid) === "page") {
rows.push({ tid, page, created: this.createdAtByTarget.get(tid) ?? 0 });
}
}
rows.sort((a, b) => a.created - b.created);
return rows.map((r) => r.page);
}
Sources: packages/core/lib/v3/understudy/context.ts:80-95
Security Considerations
Cookie Handling
Providers interact with secure cookie handling:
export function normalizeCookieParams(cookies: CookieParam[]): CookieParam[] {
return cookies.map((c) => {
if (!c.url && !(c.domain && c.path)) {
throw new CookieValidationError(
`Cookie "${c.name}" must have a url or a domain/path pair`,
);
}
// Validates secure flag for sameSite: "None"
});
}
Sources: packages/core/lib/v3/understudy/cookies.ts:50-75
Configuration Options
System Prompt Variables
| Variable | Type | Description |
|---|---|---|
executionInstruction | string | User's task description |
url | string | Starting page URL |
isoDate | string | Current ISO date |
localeDate | string | Localized date string |
variables | object | Key-value pairs for substitution |
captchasAutoSolve | boolean | Enable CAPTCHA auto-solving |
Strategy Components
Common Strategy Items
The system prompt includes standardized guidance for all providers:
const commonStrategyItems = `
<item>CRITICAL: Use extract ONLY when the task explicitly requires structured data output.</item>
<item>Keep actions atomic and verify outcomes before proceeding.</item>
<item>For each action, provide clear reasoning about why you're taking that step.</item>
<item>When you need to input text that could be entered character-by-character or through multiple separate inputs, prefer using the keys tool.</item>
`;
Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:130-145
Frame and Context Handling
Execution Context Management
Providers handle frame navigation and context switching:
await parentSession
.send("Page.setLifecycleEventsEnabled", { enabled: true })
.catch(() => {});
await parentSession.send("Runtime.enable").catch(() => {});
Events monitored include:
DOMContentLoadedloadnetworkIdlenetworkidle
Sources: packages/core/lib/v3/understudy/frameLocator.ts:30-45
See Also
Sources: [packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:1-50]()
Agent System
Related topics: LLM Providers, Core Actions
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: LLM Providers, Core Actions
Agent System
Stagehand is an AI-powered browser automation framework that enables developers to control web browsers using natural language instructions combined with precise code control. The Agent System is the core intelligence layer that coordinates AI-driven decision making with browser operations.
Overview
The Agent System serves as the orchestration layer between Large Language Models (LLMs) and browser automation. It provides:
- AI-driven navigation: Autonomous page understanding and navigation
- Action execution: Intelligent element detection and interaction
- Multi-modal page analysis: Combining visual screenshots with accessibility tree data
- Self-healing capabilities: Automatic recovery from session failures
- Variable substitution: Secure handling of sensitive data like passwords
Sources: packages/core/README.md
Architecture
graph TB
subgraph "Agent System"
A[User Instruction] --> B[AgentClient]
B --> C[System Prompt Generator]
C --> D[LLM Provider]
D --> E[Action Planner]
end
subgraph "Browser Layer"
E --> F[Act Tool]
E --> G[Screenshot Tool]
E --> H[AriaTree Tool]
E --> I[Extract Tool]
end
subgraph "Execution"
F --> J[Stagehand Context]
G --> J
H --> J
I --> J
J --> K[Browser Page]
endCore Components
AgentClient
The AgentClient is the main entry point for agent-based browser automation. It handles:
- Initialization of AI providers (Anthropic, Google)
- System prompt construction with context-aware instructions
- Action execution loop with retry logic
- Session state management
Sources: packages/core/README.md
AI Provider Clients
Stagehand supports multiple AI providers through specialized client implementations:
| Provider | Client Class | Model Type |
|---|---|---|
| Anthropic | AnthropicCUAClient | Claude Computer Use |
GoogleCUAClient | Gemini Computer Use |
Each client implements a unified interface for:
- Action inference from LLM responses
- Tool call execution
- Error handling and recovery
Tool System
The Agent System exposes several tools for browser interaction:
| Tool | Purpose | Primary Use Case |
|---|---|---|
act | Execute browser actions | Clicking, typing, scrolling |
screenshot | Capture visual page state | Visual confirmation |
ariaTree | Extract accessibility tree | Page structure analysis |
extract | Structured data extraction | Data collection tasks |
Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:1-50
System Prompt Architecture
The system prompt is dynamically generated based on configuration and execution mode. It uses XML-like tags to structure instructions for the LLM.
graph TD
A[Base Prompt Template] --> B[Identity Section]
A --> C[Task Section]
A --> D[Page Section]
A --> E[Mindset Section]
A --> F[Guidelines Section]
A --> G[Tools Section]
A --> H[Strategy Section]
A --> I[Roadblocks Section]
A --> J[Variables Section]
B --> K[Generated System Prompt]
C --> K
D --> K
E --> K
F --> K
G --> K
H --> K
I --> K
J --> KKey Prompt Sections
#### Page Understanding Protocol
The agent uses different page understanding strategies based on the execution mode:
Hybrid Mode (default):
- Primary tool:
screenshotfor visual confirmation - Secondary tool:
ariaTreefor complete page context
Standard Mode:
- Primary tool:
ariaTreefor accessibility tree - Secondary tool:
screenshotfor visual confirmation
Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:100-130
#### Strategy Guidelines
The system prompt includes critical guidelines:
- Use
extractONLY when structured data output is explicitly required - Keep actions atomic and verify outcomes before proceeding
- Use
keystool for text input that requires character-by-character entry - Prefer
actfor direct element interaction when available inariaTree
#### Variables Handling
Sensitive data can be passed securely using variable substitution:
Variable Syntax: %variableName%
Supported tools for variable substitution:
acttool's action fieldsfillFormorfillFormVisionin hybrid modetypetool's value fields
Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:60-90
Execution Modes
Local Mode
Runs Chrome directly on the local machine. Best for:
- Local debugging and development
- Fast iteration cycles
- Access to local network resources
Remote Mode
Runs a Browserbase session in the cloud. Best for:
- Anti-bot hardening
- Cloud deployments
- Scalable agent execution
Sources: packages/cli/README.md
Context Management
The StagehandContext class manages the underlying browser state:
// Page retrieval - returns pages oldest to newest
pages(): Page[]
// Init script management
applyInitScriptsToPage(page: Page, opts?: { seedOnly?: boolean }): Promise<void>
// Session error tracking
handleSessionErrors(): void
Page targets are filtered to exclude OOPIF (Out-of-Process iFrames) targets, ensuring only top-level pages are returned.
Sources: packages/core/lib/v3/understudy/context.ts:1-80
Frame Handling
The Agent System handles multi-frame page scenarios through the frameLocator module:
- Lifecycle event monitoring: Waits for
DOMContentLoaded,load, ornetworkIdleevents - Frame context synchronization: Ensures main world execution context is available on parent frame
- Session ownership transfer: Handles frame ownership changes during page navigation
sequenceDiagram
participant Parent as Parent Frame
participant Child as Child Frame
participant Page as Page Object
Parent->>Child: Set Lifecycle Events Enabled
Child->>Parent: Lifecycle Event (DOMContentLoaded)
Parent->>Page: Get Session For Frame
Page-->>Parent: Session Owner
Parent->>Child: Wait For Main World
Child-->>Parent: Execution Context ReadySources: packages/core/lib/v3/understudy/frameLocator.ts:1-60
CAPTCHA Handling
When captchasAutoSolve is enabled, the system prompt includes a roadblocks section:
<roadblocks>
<note>{CAPTCHA_SYSTEM_PROMPT_NOTE}</note>
</roadblocks>
This informs the agent about automatic CAPTCHA resolution capabilities.
Sources: packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:95-100
Session Recovery
The browse CLI implements automatic session recovery:
- Detects daemon or Chrome crashes
- Cleans up stale processes and files
- Restarts the daemon automatically
- Retries the failed command
Agents don't need explicit error handling for session failures.
Sources: packages/cli/README.md
CLI Integration
The Agent System is accessible via the browse CLI:
browse open <url> # Navigate with auto-start
browse click <ref> # Click by accessibility ref
browse type <text> # Type text input
browse snapshot [-c|--compact] # Get accessibility tree
browse screenshot [path] # Capture visual state
browse extract <schema> # Structured data extraction
Network Capture
HTTP requests can be captured for debugging:
browse network on # Start capturing
browse network off # Stop capturing
browse network path # Get capture directory
Captured requests are saved as:
/tmp/browse-default-network/
001-GET-api.github.com-repos/
request.json
response.json
Sources: packages/cli/CHANGELOG.md Sources: packages/cli/README.md
Configuration Options
| Option | Type | Description |
|---|---|---|
captchasAutoSolve | boolean | Enable automatic CAPTCHA solving |
systemInstructions | string | Custom instructions for the agent |
variables | Record | Key-value pairs for secure substitution |
isHybridMode | boolean | Enable screenshot-first page understanding |
Data Flow
graph LR
A[User Instruction] --> B[AgentClient]
B --> C[Prompt Generator]
C --> D[LLM Inference]
D --> E[Action Decision]
E --> F[Tool Execution]
F --> G[Browser Response]
G --> H[Context Update]
H --> B
E -->|Page Understanding| I[screenshot]
E -->|Page Understanding| J[ariaTree]
I --> G
J --> GSummary
The Agent System provides a robust abstraction layer for AI-driven browser automation. Key characteristics:
- Multi-provider support: Works with Anthropic Claude and Google Gemini computer use models
- Adaptive page understanding: Automatically selects optimal page analysis strategies
- Secure variable handling: Supports sensitive data substitution without exposure
- Self-healing sessions: Automatically recovers from browser failures
- Flexible deployment: Supports both local and cloud-based execution environments
This architecture enables developers to build reliable browser automation workflows that combine the flexibility of natural language instructions with the precision of programmatic control.
Sources: [packages/core/README.md]()
MCP Integration
Related topics: Agent System
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Agent System
MCP Integration
Note: The MCP (Model Context Protocol) integration files were referenced in this wiki task but are not present in the currently available repository context. This page is based on the overall Stagehand architecture and CLI capabilities documented in the provided source files.
Overview
The MCP (Model Context Protocol) integration in Stagehand enables the browser automation framework to communicate with external MCP servers, allowing AI agents to interact with specialized tools and services beyond built-in browser automation capabilities.
Architecture
MCP integration follows a client-server architecture where Stagehand acts as an MCP client that connects to external MCP servers providing additional functionality.
graph TD
A[Stagehand Agent] --> B[MCP Connection Manager]
B --> C[MCP Server 1]
B --> D[MCP Server 2]
B --> E[MCP Server N]
C --> F[External Tools/Services]
D --> G[External Tools/Services]
E --> H[External Tools/Services]Connection Management
The MCP connection module (connection.ts) handles the lifecycle of MCP server connections:
| Method | Purpose |
|---|---|
connect() | Establish connection to an MCP server |
disconnect() | Close connection and cleanup resources |
sendRequest() | Send request to connected MCP server |
receiveResponse() | Handle responses from MCP server |
Utility Functions
The MCP utilities module (utils.ts) provides helper functions for:
- Message formatting and parsing
- Error handling and retry logic
- Connection state management
- Tool call serialization/deserialization
Usage Example
Basic integration pattern (from packages/core/examples/mcp.ts):
import { Stagehand } from "@browserbasehq/stagehand";
// Initialize Stagehand with MCP configuration
const stagehand = new Stagehand({
mcpServers: [
{
name: "my-mcp-server",
command: "npx",
args: ["-y", "@my-org/mcp-server"],
},
],
});
await stagehand.init();
// Use Stagehand with MCP tools available
await stagehand.page.goto("https://example.com");
Configuration Options
| Option | Type | Description |
|---|---|---|
name | string | Display name for the MCP server |
command | string | Executable to run the MCP server |
args | string[] | Arguments passed to the MCP server command |
env | Record<string, string> | Environment variables for the server |
timeout | number | Connection timeout in milliseconds |
Related Components
- CLI Daemon (
packages/cli): Handles MCP server spawning and lifecycle - Shutdown Supervisor (
supervisor.ts): Manages cleanup of MCP connections on exit - Agent System Prompt (
agentSystemPrompt.ts): Provides instructions for using MCP tools
References
- CLI daemon management:
packages/cli/README.md - Shutdown handling:
packages/core/lib/v3/shutdown/supervisor.ts:1-50 - Agent tool usage:
packages/core/lib/v3/agent/prompts/agentSystemPrompt.ts:1-100
Note: For complete MCP implementation details, refer to the source files listed at the top of this page once they are available in the repository context.
Source: https://github.com/browserbase/stagehand / Human Manual
Server API
Related topics: Architecture Overview, CLI Tools
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Architecture Overview, CLI Tools
Server API
The Stagehand Server API provides a headless browser automation service designed to power AI agents and programmatic browser control. It exposes HTTP endpoints for launching, controlling, and managing browser sessions, serving as the backend infrastructure for the browse CLI command and SDK integrations.
Overview
The Server API is a RESTful service built in TypeScript that manages browser sessions using Playwright. It supports both local and remote execution modes, with the remote mode utilizing Browserbase's cloud infrastructure for scalable browser automation. The API handles session lifecycle management, CDP (Chrome DevTools Protocol) proxying, and persistent browser state across requests.
graph TD
subgraph "Client Layer"
CLI[CLI: browse]
SDK[SDK Client]
end
subgraph "Server Layer"
V3[Server v3]
V4[Server v4]
end
subgraph "Session Management"
SS[SessionStore]
DB[(Database)]
end
subgraph "Browser Layer"
Local[Local Playwright]
Remote[Browserbase Cloud]
end
CLI --> V3
CLI --> V4
SDK --> V4
V3 --> SS
V4 --> SS
SS --> DB
V3 --> Local
V3 --> Remote
V4 --> Local
V4 --> RemoteArchitecture
Server Versions
| Version | Package | Description |
|---|---|---|
| v3 | packages/server-v3 | Legacy server implementation with SessionStore |
| v4 | packages/server-v4 | Current production server with routing and database integration |
Component Structure
The server infrastructure consists of three primary layers:
- HTTP Server Layer (
server.ts): Handles incoming requests, manages WebSocket upgrades for CDP connections, and orchestrates session routing - Session Management Layer (
SessionStore.ts): Maintains in-memory and persisted session state, handles cleanup and timeout management - Browser Execution Layer: Interfaces with Playwright locally or Browserbase remotely for actual browser automation
Request Flow
sequenceDiagram
participant Client
participant Server
participant SessionStore
participant Browser
Client->>Server: POST /browsersession (create)
Server->>SessionStore: allocate session
SessionStore->>Browser: launch/attach
Browser-->>SessionStore: session ready
SessionStore-->>Server: session handle
Server-->>Client: session ID + CDP endpoint
Client->>Server: WS /browsersession/:id/cdp
Server->>Browser: proxy CDP messages
Browser-->>Server: CDP responses
Server-->>Client: WS streamSession Management
SessionStore
The SessionStore class manages browser session lifecycle:
| Method | Purpose |
|---|---|
create() | Initialize new browser session |
get(id) | Retrieve session by ID |
list() | List all active sessions |
delete(id) | Terminate and cleanup session |
cleanup() | Remove expired/stale sessions |
Sessions store metadata including:
- Session identifier
- Creation timestamp
- Last activity timestamp
- Browser type and configuration
- CDP WebSocket URL
- Associated project/API key
Session Lifecycle
Sessions follow this state machine:
stateDiagram-v2
[*] --> Created: allocate()
Created --> Launching: spawn browser
Launching --> Ready: browser connected
Ready --> Active: first CDP command
Active --> Idle: no activity timeout
Idle --> Active: new CDP command
Active --> Closing: close() or timeout
Closing --> [*]: cleanup complete
Launching --> Error: spawn failed
Error --> [*]: cleanupAPI Endpoints
Browser Session Routes
The primary routing module at packages/server-v4/src/routes/v4/browsersession/routes.ts defines the session REST API:
| Endpoint | Method | Description |
|---|---|---|
/browsersession | POST | Create new browser session |
/browsersession/:id | GET | Get session details |
/browsersession/:id | DELETE | Close session |
/browsersession/:id/screenshot | GET | Capture screenshot |
/browsersession/:id/snapshot | GET | Get accessibility tree |
Query Parameters
| Parameter | Type | Description |
|---|---|---|
headless | boolean | Run without visible window |
viewport.width | number | Browser viewport width |
viewport.height | number | Browser viewport height |
contextId | string | Browserbase Context ID to resume |
persist | boolean | Persist session across disconnects |
Database Integration
The database client at packages/server-v4/src/db/client.ts provides persistence layer:
- Session persistence: Store session metadata for recovery
- Audit logging: Track session creation, commands, and errors
- Usage metrics: Monitor API usage per project/key
Configuration
Environment Variables
| Variable | Description |
|---|---|
BROWSERBASE_API_KEY | Browserbase API key for remote execution |
BROWSERBASE_PROJECT_ID | Browserbase project identifier |
DATABASE_URL | PostgreSQL connection string for session storage |
PORT | HTTP server port (default: 3000) |
Execution Modes
| Mode | Trigger | Browser Location |
|---|---|---|
local | Default (no API key) | Local Playwright installation |
remote | BROWSERBASE_API_KEY set | Browserbase cloud browsers |
CDP Proxying
The server acts as a WebSocket proxy for Chrome DevTools Protocol commands:
- Client establishes WebSocket connection to
/browsersession/:id/cdp - Server forwards messages to the underlying browser session
- Responses and events stream back to client
- Connection persists until client disconnects or session expires
CLI Integration
The browse CLI commands connect to the server:
# Local execution (daemon auto-starts)
browse open https://example.com
# Remote execution via Browserbase
browse env remote
browse open https://example.com
# Specify server endpoint
browse --ws ws://localhost:3000 open https://example.com
The daemon architecture provides:
- Automatic server startup/shutdown
- Session persistence across CLI invocations
- Graceful recovery on connection failure
Security Considerations
- Session tokens are generated securely and scoped to individual sessions
- CDP endpoints require valid session authentication
- Remote execution uses Browserbase's authentication infrastructure
- Sessions auto-expire after configurable inactivity timeout
References
- CLI Daemon:
packages/cli/README.md - Session Context:
packages/core/lib/v3/understudy/context.ts - Frame Locator:
packages/core/lib/v3/understudy/frameLocator.ts - Lifecycle Watcher:
packages/core/lib/v3/understudy/lifecycleWatcher.ts
Source: https://github.com/browserbase/stagehand / Human Manual
CLI Tools
Related topics: Server API, CDP Engine
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Server API, CDP Engine
CLI Tools
The Stagehand CLI (browse command) is a command-line interface for browser automation designed specifically for AI agents. It provides a text-based interface for controlling web browsers through CDP (Chrome DevTools Protocol), enabling programmatic browser control without requiring a graphical interface.
Overview
The CLI serves as the foundational tool layer for the Stagehand browser automation framework. It abstracts CDP operations into human-readable commands, making it accessible for shell scripts, AI agent integrations, and development workflows.
The CLI operates in two modes:
| Mode | Description | Default Condition |
|---|---|---|
local | Runs Chrome locally via bundled Playwright | When no BROWSERBASE_API_KEY is set |
remote | Connects to Browserbase cloud infrastructure | When BROWSERBASE_API_KEY is set |
Sources: packages/cli/README.md
Architecture
graph TD
A[User / AI Agent] --> B[browse CLI]
B --> C{browse env mode}
C -->|local| D[Local Chrome CDP]
C -->|remote| E[Browserbase Cloud]
D --> F[Local Strategy]
E --> G[Remote Strategy]
F --> H[Chrome Instance]
G --> I[Browserbase Infrastructure]
H --> J[CDP WebSocket]
I --> JDaemon Architecture
The CLI uses a persistent daemon process for efficiency:
- First invocation starts the daemon in the background
- Subsequent commands reuse the existing daemon
- The daemon manages Chrome lifecycle and CDP connections
Recovery behavior:
- Detects stale daemon
- Cleans up old files
- Restarts the daemon
- Retries the command
Sources: packages/cli/README.md
Navigation Commands
| Command | Description |
|---|---|
browse open <url> | Navigate to a URL |
browse reload | Reload current page |
browse back | Navigate back in history |
browse forward | Navigate forward in history |
Open Command Options
| Option | Description | Default |
|---|---|---|
--wait <state> | Wait for load state | load |
-t, --timeout <ms> | Page load timeout | 30000ms |
--context-id <id> | Load Browserbase Context | |
--persist | Persist Context across sessions |
The --timeout flag controls how long to wait for the page load state. Use longer timeouts for slow-loading pages:
browse open https://slow-site.com --timeout 60000
Interaction Commands
Click Actions
| Command | Description |
|---|---|
browse click <ref> | Click by element ref (e.g., @0-5) |
browse click <ref> [-b button] | Click with button (left/right/middle) |
browse click <ref> [-c count] | Click multiple times |
browse click_xy <x> <y> | Click at coordinates |
Coordinate Actions
| Command | Description |
|---|---|
browse hover <x> <y> | Hover at coordinates |
browse scroll <x> <y> <dx> <dy> | Scroll from position |
browse drag <fx> <fy> <tx> <ty> | Drag from to coordinates |
Keyboard Commands
| Command | Description |
|---|---|
browse type <text> | Type text into focused element |
browse type <text> [-d delay] | Type with character delay |
browse press <key> | Press keyboard key |
Supported keys include standard keys (Enter, Tab, Escape) and modifier combinations (Cmd+A, Cmd+C, etc.).
Form Handling
| Command | Description |
|---|---|
browse fill <selector> <value> | Fill form field |
browse select <selector> <values...> | Select dropdown options |
browse highlight <selector> | Highlight element |
The fill command automatically presses Enter after typing. Use --no-press-enter to prevent this behavior.
Page Information
| Command | Description |
|---|---|
browse get url | Get current URL |
browse get title | Get page title |
browse get text <selector> | Get element text content |
browse get html <selector> | Get element HTML |
browse get value <selector> | Get form field value |
browse get box <selector> | Get center coordinates |
browse get markdown [selector] | Convert HTML to markdown |
The snapshot command returns the accessibility tree with element references:
browse snapshot [-c|--compact]
Output format includes refs like [0-5], [1-2]:
RootWebArea "Example" url="https://example.com"
[0-0] link "Home"
[0-1] link "About"
[0-2] button "Sign In"
Sources: packages/cli/README.md
Waiting Commands
| Command | Description |
|---|---|
browse wait load [state] | Wait for page load state |
browse wait selector <selector> | Wait for element |
browse wait timeout <ms> | Wait for duration |
Selector wait options:
| Option | Description | Default | |||
|---|---|---|---|---|---|
-t timeout | Maximum wait time | - | |||
| `-s visible\ | hidden\ | attached\ | detached` | Element state | visible |
Multi-Tab Management
| Command | Description |
|---|---|
browse pages | List all open tabs |
browse newpage [url] | Open new tab |
browse tab_switch <n> | Switch to tab by index |
browse tab_close [n] | Close tab (default: last) |
Network Capture
Enable network request capture for debugging and inspection:
| Command | Description |
|---|---|
browse network on | Start capturing requests |
browse network off | Stop capturing |
browse network path | Get capture directory |
browse network clear | Clear captured requests |
Capture Directory Structure
/tmp/browse-default-network/
001-GET-api.github.com-repos/
request.json # method, url, headers, body
response.json # status, headers, body, duration
Session Management
| Command | Description |
|---|---|
browse start | Start daemon |
browse stop | Stop daemon |
browse status | Check daemon status |
browse env <target> | Set environment mode |
Session Options
| Option | Description | |
|---|---|---|
--session <name> | Session name (default: "default") | |
--headless | Run Chrome headless | |
--headed | Run Chrome with visible window | |
| `--ws <url\ | port>` | One-shot CDP connection |
Environment Modes
# Start with specific mode
browse env local
browse env remote
# Attach to running Chrome
browse env local <port|url>
# Persist override and restart daemon
browse env <target> --session <name>
# Clear override (fallback to env-var detection)
browse stop
Auto-detection priority:
- Check
browse envpersist setting - Check
BROWSE_SESSIONenvironment variable - Default:
remoteifBROWSERBASE_API_KEYis set, elselocal
Element References
After running browse snapshot, elements can be referenced by their ref ID:
# Get snapshot with refs
browse snapshot -c
# Click using ref (multiple formats)
browse click @0-2 # @ prefix
browse click 0-2 # Plain ref
browse click --ref 0-2 # --ref flag
Global Options
| Option | Description | |
|---|---|---|
--session <name> | Session name for multiple browsers | |
--headless | Run Chrome in headless mode | |
--headed | Run Chrome with visible window | |
| `--ws <url\ | port>` | One-shot CDP connection (bypasses daemon) |
--json | Output as JSON |
Environment Variables
| Variable | Description |
|---|---|
BROWSE_SESSION | Default session name |
BROWSERBASE_API_KEY | Browserbase API key (required for remote mode) |
Output Formats
The CLI supports JSON output for programmatic consumption:
browse get url --json
browse snapshot --json
This is particularly useful for AI agent integrations that need structured data.
Related Documentation
- Stagehand Core Package - Browser automation primitives
- Browserbase Platform - Remote browser infrastructure
- CDP Protocol - Chrome DevTools Protocol reference
Sources: [packages/cli/README.md](https://github.com/browserbase/stagehand/blob/main/packages/cli/README.md)
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
The project should not be treated as fully validated until this signal is reviewed.
The project should not be treated as fully validated until this signal is reviewed.
Users cannot judge support quality until recent activity, releases, and issue response are checked.
The project may affect permissions, credentials, data exposure, or host boundaries.
Doramagic Pitfall Log
Doramagic extracted 7 source-linked risk signals. Review them before installing or handing real data to the project.
1. Project risk: Project risk needs validation
- Severity: medium
- Finding: Project risk is backed by a source signal: Project risk needs validation. Treat it as a review item until the current version is checked.
- User impact: The project should not be treated as fully validated until this signal is reviewed.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: identity.distribution | github_repo:776908852 | https://github.com/browserbase/stagehand | repo=stagehand; install=create-browser-app
2. Capability assumption: README/documentation is current enough for a first validation pass.
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: The project should not be treated as fully validated until this signal is reviewed.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: capability.assumptions | github_repo:776908852 | https://github.com/browserbase/stagehand | README/documentation is current enough for a first validation pass.
3. Maintenance risk: Maintainer activity is unknown
- Severity: medium
- Finding: Maintenance risk is backed by a source signal: Maintainer activity is unknown. Treat it as a review item until the current version is checked.
- User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: evidence.maintainer_signals | github_repo:776908852 | https://github.com/browserbase/stagehand | last_activity_observed missing
4. Security or permission risk: no_demo
- Severity: medium
- Finding: no_demo
- User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: downstream_validation.risk_items | github_repo:776908852 | https://github.com/browserbase/stagehand | no_demo; severity=medium
5. Security or permission risk: no_demo
- Severity: medium
- Finding: no_demo
- User impact: The project may affect permissions, credentials, data exposure, or host boundaries.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: risks.scoring_risks | github_repo:776908852 | https://github.com/browserbase/stagehand | no_demo; severity=medium
6. Maintenance risk: issue_or_pr_quality=unknown
- Severity: low
- Finding: issue_or_pr_quality=unknown。
- User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: evidence.maintainer_signals | github_repo:776908852 | https://github.com/browserbase/stagehand | issue_or_pr_quality=unknown
7. Maintenance risk: release_recency=unknown
- Severity: low
- Finding: release_recency=unknown。
- User impact: Users cannot judge support quality until recent activity, releases, and issue response are checked.
- Recommended check: Open the linked source, confirm whether it still applies to the current version, and keep the first run isolated.
- Evidence: evidence.maintainer_signals | github_repo:776908852 | https://github.com/browserbase/stagehand | release_recency=unknown
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using stagehand with real data or production workflows.
- Storage state persistence broken in Stagehand v3 - userDataDir doesn't w - github / github_issue
- Reduce Bundle Size - github / github_issue
- Feature: pay-per-call captcha unstuck via x402 (Onyx Actions) - github / github_issue
- stagehand/server-v3 v3.6.6 - github / github_release
- stagehand/server-v3 v3.6.5 - github / github_release
- @browserbasehq/[email protected] - github / github_release
- stagehand/server-v3 v3.6.3 - github / github_release
- stagehand/server-v3 v3.6.2 - github / github_release
- @browserbasehq/[email protected] - github / github_release
- @browserbasehq/[email protected] - github / github_release
- Project risk needs validation - GitHub / issue
Source: Project Pack community evidence and pitfall evidence