midscene Manual - Doramagic.ai

Doramagic Project Pack · Human Manual

midscene

Midscene.js is an AI-powered UI automation SDK that controls applications, performs assertions, and extracts structured data using natural-language instructions. The project is positioned ...

Introduction & System Architecture

Related topics: Core AI Engine, Planning & Model Strategy, Platform Drivers & Device Abstraction, Developer Tools: CLI, MCP, Skills, Studio, Recorder & Visualizer

Section Related Pages

Continue reading this section for the full explanation and source context.

Introduction & System Architecture

Overview and Purpose

Midscene.js is an AI-powered UI automation SDK that controls applications, performs assertions, and extracts structured data using natural-language instructions. The project is positioned as a "vision-driven" engine: it localizes elements from screenshots rather than from DOM or accessibility trees, which allows it to target icon-only buttons, <canvas> elements, custom controls, cross-origin iframes, and native mobile or desktop surfaces that DOM-based automation cannot reach (README.md:35-55). The same engine powers both test automation and general-purpose UI scripting, exposed via a JavaScript/TypeScript SDK, a Chrome extension, and YAML-based scripting (packages/core/README.md:1-5).

The release cadence is tracked in the workspace: the current published version is 1.9.7 across every official package, including @midscene/core, @midscene/web, @midscene/android, @midscene/ios, @midscene/cli, and @midscene/playground (packages/core/package.json:4, packages/web-integration/package.json:5, packages/android/package.json:4, packages/ios/package.json:4, packages/cli/package.json:4, packages/playground/package.json:4).

Repository Topology and Package Layout

The repository is a pnpm workspace containing device-specific automation packages, a shared abstraction layer, developer tools, and example applications. Every package advertises the same description and homepage (https://midscenejs.com/), confirming that they are coordinated releases of a single product (packages/core/package.json:2-3, packages/shared/package.json, packages/cli/package.json:2-3).

Package	Role	Source
`@midscene/core`	Vision-driven engine, agent orchestration, AI model adapters	packages/core/package.json:2-3
`@midscene/shared`	Cross-package utilities (recorder types, helpers)	packages/shared/src/recorder.ts:1-20
`@midscene/web`	Browser integration (Playwright, Puppeteer, bridge mode)	packages/web-integration/README.md:1-10
`@midscene/android`	Android device automation (scrcpy + yadb)	packages/android/package.json:2-9
`@midscene/ios`	iOS simulator/device automation	packages/ios/package.json:2-10
`@midscene/cli`	Unified CLI entry point (`midscene` bin)	packages/cli/package.json:11-13
`@midscene/playground`	Web playground utilities (Express + CORS)	packages/playground/package.json:2-7
`@midscene/visualizer`	Playback UI for recorded/AI runs	packages/visualizer/README.md:1-3
`apps/chrome-extension`	Recorder + runner shipped as a browser extension	apps/chrome-extension/src/extension/recorder/utils.ts`
`@midscene/harmony-mcp`	HarmonyOS MCP server bridge	packages/harmony-mcp/package.json

The CLI package is the integration hub: it depends on every platform adapter (@midscene/web, @midscene/android, @midscene/ios, @midscene/computer, @midscene/harmony) and on @midscene/core + @midscene/shared, exposing them through a single midscene binary (packages/cli/package.json:17-25). Similarly, @midscene/core consumes @midscene/shared and UI-TARS action parsing, which makes it the engine that all platform adapters share (packages/core/package.json:64-72).

System Architecture and Data Flow

Midscene separates three concerns: the platform adapter (which drives real input/output on a device), the core agent (which plans and verifies), and the AI model (which sees screenshots and returns structured instructions). The same core agent is reused across every platform, while each @midscene/<platform> package supplies the adapter that knows how to capture screenshots and dispatch taps, swipes, and text input on that surface (packages/core/package.json:1-5, packages/web-integration/README.md:1-5, packages/android/package.json:2-9).

flowchart LR
  user[User / Test Runner] --> api[Platform SDK<br/>@midscene/web · android · ios · harmony · computer]
  api --> agent[Core Agent<br/>@midscene/core]
  agent --> screenshot[Screenshot + optional DOM]
  agent --> llm[Multimodal Model<br/>Qwen3.x · Doubao · Gemini · UI-TARS]
  llm --> agent
  agent --> adapter[Action Dispatch<br/>tap · swipe · type · assert]
  adapter --> device[Target UI<br/>Web · Android · iOS · Harmony · Desktop]
  recorder[Recorder / Visualizer] -. replay .-> api

The core package also exports skill and MCP entry points (./skill, ./mcp), so external agents (for example, OpenClaw via midscene-skills) can drive Midscene through a model-context-protocol interface rather than the JS SDK directly (packages/core/package.json:24-34). The Chrome extension and the shared recorder module share the same event model: MidsceneRecorderTarget, MidsceneRecorderGeneratedCode, and MidsceneRecorderMarkdownScreenshotAsset describe platform IDs, generated artifacts (Markdown, YAML, Playwright), and screenshot assets, respectively (packages/shared/src/recorder.ts:1-25). This recorder contract is what powers the community-requested ability to export AI execution steps as reusable Playwright scripts (issue #2240) and to handle elements outside the current viewport via recorded event playback (issue #179).

Platform Support, Extensibility, and Community

The official adapter matrix covers browser (Playwright/Puppeteer), Android, iOS, HarmonyOS, and desktop via libnut. The apps/site/docs getting-started matrix lists all five platforms (README.md:65-75), and the CLI exposes HarmonyOS as a first-class target by depending on @midscene/harmony (packages/cli/package.json:17-21). HarmonyOS support is still maturing — community members continue to ask for richer UI automation on that platform (issue #1594), and the @midscene/harmony-mcp package is the experimental MCP bridge for it (packages/harmony-mcp/package.json).

Community extensions sit alongside the official packages. The README curates an "Awesome Midscene" list including midscene-pc (a Windows/macOS/Linux agent, issue #1389), midscene-ios (iOS mirror automation), Python and Java SDK ports, and Docker images for the PC server. These projects all consume the same core abstractions, which is why a third-party device adapter can be added by implementing the platform interface that @midscene/core expects.

For teams that want to plug domain knowledge into planning, the project tracks a long-running discussion about Retrieval-Augmented Generation support so that arbitrary LLMs can interpret product-specific instructions (issue #426). Until that lands, the recommended mitigation is to opt in to DOM context for data-extraction and page-understanding tasks while keeping action planning strictly visual, as described in the project README's model strategy section (README.md:80-90).

Core AI Engine, Planning & Model Strategy

Related topics: Introduction & System Architecture, Platform Drivers & Device Abstraction

Section Related Pages

Continue reading this section for the full explanation and source context.

Core AI Engine, Planning & Model Strategy

Overview and Scope

Midscene's core AI engine is the vision-driven automation brain shared across every platform integration in the monorepo. Unlike traditional automation tools that read the DOM or the accessibility tree, the engine is built around multimodal models that localize elements from screenshots alone, with natural language used to describe each step. As stated in the main README, "Midscene is all-in on pure vision for UI actions: element localization is based on screenshots only" (README.md). This design removes the dependency on fragile selectors and makes the same engine usable for web, Android, iOS, HarmonyOS, and desktop surfaces.

The engine lives in the @midscene/core package and is consumed by every downstream integration. The package's package.json describes its role as "Automate browser actions, extract data, and perform assertions using AI. It offers JavaScript SDK, Chrome extension, and support for scripting in YAML" (packages/core/package.json). The exported entry points (e.g., ./ai-model, ./agent, ./device, ./tree) reveal the engine's internal layering and provide the surface area used by the web, iOS, Android, computer, and CLI packages.

Architecture and Layering

The engine is delivered as a workspace of cooperating packages rather than a single monolithic module. The core package owns the planning and AI model logic, while thin integration packages translate the abstract "device" interface into platform-specific input mechanisms.

flowchart TB
  User[User / Test Author] -->|natural language| SDK[Platform SDK<br/>@midscene/web, /android, /ios, /computer, /harmony]
  SDK --> Agent[Agent + Task Builder<br/>@midscene/core]
  Agent --> Planning[Planning & Action Parsing<br/>llm-planning + @ui-tars/action-parser]
  Agent --> Inspect[Inspect / Extract<br/>llm-inspect]
  Planning --> Model[Multimodal Model<br/>Qwen / Doubao / GLM / Gemini / UI-TARS]
  Inspect --> Model
  Model -->|structured actions| Planning
  Planning --> Executor[Execution Session]
  Executor --> Device[AbstractDevice<br/>web / android / ios / computer]
  Device -->|screenshot| Inspect
  Device -->|input events| Target[Target Platform]

The @midscene/core package declares a dependency on @ui-tars/action-parser (version 1.2.3), which is the parser that converts raw model output into typed, executable actions (packages/core/package.json). This is the bridge between free-form LLM responses and the deterministic command set that the device layer expects.

The @midscene/web package builds on the core and adds JavaScript SDK, Chrome extension, and YAML scripting surfaces (packages/web-integration/package.json). It also exposes bridge-mode subpaths, allowing an external UI to control a running browser instance. The same pattern is mirrored for iOS via @midscene/ios, which ships its own bin/midscene-ios CLI and a dedicated mcp-server subpath (packages/ios/package.json). The @midscene/android package is the Android counterpart (packages/android/README.md), and the CLI package pulls every platform integration together into a single command-line entry point (packages/cli/package.json).

The recorder in the Chrome extension shows how the engine's outputs are surfaced to humans. The utility module builds structured "session → page → event" mind maps from captured sequences, preserving input values, element descriptions, and page context for later review (apps/chrome-extension/src/extension/recorder/utils.ts). This same execution trace is what the visualizer package renders for debugging (packages/visualizer/README.md).

Model Strategy

Midscene is explicitly model-agnostic and is driven by multimodal models with strong UI grounding. The README enumerates the supported families: Qwen3.x, Doubao-Seed-2.0, GLM-4.6V, gemini-3.5-flash, and UI-TARS, "including open-source options you can self-host" (README.md). Because localization is screenshot-only, the strategy is to keep the model contract narrow: send a screenshot plus a natural-language instruction, and receive a structured action back.

Concern	Approach in Midscene
Element localization	Pure vision on screenshots
Action parsing	`@ui-tars/action-parser` post-processor (packages/core/package.json)
Data extraction / assertions	Optional DOM-augmented mode for richer understanding (README.md)
Model choice	Any multimodal model with UI grounding, including self-hosted OSS variants (README.md)
Execution control	AbstractDevice abstraction, one driver per platform

This strategy is what makes the same engine work for browser, native mobile, and desktop. It also explains why the community has been able to build out a PC controller (midscene-pc) on top of the device interface — issue #1389 specifically requests adding this to the official "Awesome Midscene" list, noting that the project's "integration with any interface" feature is what made the PC device possible.

Planning, Inspection, and Community Friction Points

The engine splits the LLM call into two complementary paths: a planning path that produces action sequences for aiAct-style flows, and an inspect path that produces structured data and assertions. The ai-model and agent subpath exports in @midscene/core are the public seams for these paths (packages/core/package.json). The shared package, described as the home for "AI-powered automation SDK" primitives (packages/shared/README.md), holds cross-cutting types used by both paths.

Several recurring community requests highlight the limits of the current planning strategy and where it is being extended:

RAG over product knowledge. Issue #426 asks for retrieval-augmented generation so that LLMs can interpret high-level, domain-specific instructions before planning concrete steps. This sits directly on top of the planning path.
Elements outside the current viewport. Issue #179 reports that the planner can only target what is visible. Real pages often require scrolling, which is a planning concern (when to issue a scroll action) as well as an executor concern (the device must support the input).
Exporting to Playwright scripts. Issue #2240 proposes converting recorded AI execution steps into reusable Playwright code. The Chrome extension's recorder already captures the structured event sequence (apps/chrome-extension/src/extension/recorder/utils.ts), so the missing piece is a deterministic transpiler from that trace to Playwright.
HarmonyOS support. Issue #1594 asks for HarmonyOS automation, which would round out the platform set. The @midscene/harmony and @midscene/harmony-mcp packages are already present in the workspace, which the main README points to via the "HarmonyOS" getting-started guide (README.md).

For a technical reader, the practical takeaway is that every new capability — whether RAG, viewport expansion, code export, or a new platform — plugs into the same three layers: the planner that talks to a multimodal model, the parser that turns model output into typed actions, and the AbstractDevice that executes them. The core package's export map is the contract to learn first; the platform-specific packages are mostly device implementations plus thin SDK ergonomics.

Platform Drivers & Device Abstraction

Related topics: Introduction & System Architecture, Core AI Engine, Planning & Model Strategy, Developer Tools: CLI, MCP, Skills, Studio, Recorder & Visualizer

Section Related Pages

Continue reading this section for the full explanation and source context.

Platform Drivers & Device Abstraction

Overview

Midscene is a multimodal, vision-driven UI automation SDK that ships separate platform driver packages for each runtime environment it targets, all of which plug into a shared device abstraction layer exposed by @midscene/core. The repository is organized as a pnpm monorepo with one package per driver surface — web, Android, iOS, HarmonyOS, and desktop ("computer") — plus shared utilities, a CLI, and a visualizer (README.md).

The drivers share a vision-first philosophy: instead of relying on DOM selectors or accessibility trees, every driver captures a screenshot of the current viewport and lets a multimodal model localize the target element. This is what allows the same aiAct, aiQuery, and aiAssert surface API to work uniformly across browsers, native mobile apps, and desktop applications (README.md).

The community context for this page reflects this multi-platform design:

Issue #1389 is a request to formally add midscene-pc (the computer driver) to the curated list of integrations, reflecting strong community adoption of the desktop driver.
Issue #1594 asks about HarmonyOS support, which already exists as @midscene/harmony and @midscene/harmony-mcp.
Issue #179 requests out-of-viewport element targeting — a limitation tied to the screenshot-based device abstraction shared by every driver.

Package Layout and Driver Inventory

The monorepo's packages/ directory contains one driver per platform, each published independently with its own CLI and (where applicable) MCP server entry point.

Package	Target	CLI Binary	MCP Server	Source
`@midscene/web`	Browsers (Playwright, Puppeteer, Chrome extension, bridge mode)	—	—	packages/web-integration/package.json
`@midscene/android`	Android devices via adb/scrcpy/yadb	`midscene-android`	`./mcp-server`	packages/android/package.json
`@midscene/ios`	iOS simulator and devices	`midscene-ios`	`./mcp-server`	packages/ios/package.json
`@midscene/harmony` / `@midscene/harmony-mcp`	HarmonyOS devices	—	yes (separate MCP package)	packages/harmony-mcp/package.json
`@midscene/computer`	Desktop OS (Windows/macOS/Linux)	consumed by playground	—	packages/computer-playground/README.md
`@midscene/core`	Engine, agent, device abstraction	—	`./skill`	packages/core/package.json
`@midscene/shared`	Cross-driver types (recorder events, key layout)	—	—	packages/shared/README.md
`@midscene/cli`	Unified CLI that bundles all drivers	`midscene`	—	packages/cli/package.json

@midscene/core is the only package that exports the device subpath (./device, ./agent, ./skill, ./ai-model, ./tree, ./yaml), and every platform driver depends on it via the workspace protocol — for example @midscene/cli lists @midscene/core, @midscene/web, @midscene/android, @midscene/computer, @midscene/harmony, and @midscene/ios as workspace dependencies (packages/cli/package.json). This composition makes core the canonical host of the device abstraction that all drivers implement against.

The Web Driver: A Driver-of-Drivers

The web package is unusual because it does not bind to a single browser automation backend. @midscene/web ships four sub-entry points, each of which is itself a thin driver that wraps a different browser control surface (packages/web-integration/package.json, packages/web-integration/README.md):

. — the default export for direct Playwright/Puppeteer usage.
./bridge-mode — a server-side entry used when an automation host runs in Node and drives a browser elsewhere.
./bridge-mode-browser — the browser-side counterpart loaded into the page being driven.
./utils and ./ui-utils — helper modules shared by all four drivers.

flowchart LR
    User[Caller Code] --> WebPkg["@midscene/web"]
    WebPkg --> PW[Playwright driver]
    WebPkg --> PP[Puppeteer driver]
    WebPkg --> CE[Chrome extension driver]
    WebPkg --> BM[Bridge mode]
    PW --> Core["@midscene/core<br/>(device abstraction + agent)"]
    PP --> Core
    CE --> Core
    BM --> Core
    Core --> LLM[Multimodal model]

The bridge-mode split exists so that an MCP server (such as @midscene/android's ./mcp-server) can live on a developer machine while the browser is on another host, with JSON messages flowing across the bridge — a pattern reused by the mobile drivers (packages/web-integration/README.md).

Mobile and Desktop Drivers

The Android, iOS, and HarmonyOS drivers each expose their own CLI binary and a ./mcp-server sub-entry, so they can be invoked either programmatically or surfaced to an external agent via the Model Context Protocol (packages/android/package.json, packages/ios/package.json). The Android package notably runs a prebuild step that downloads scrcpy-server and yadb binaries — the native components that stream the device screen and inject input events (packages/android/package.json). The HarmonyOS MCP support is split into its own harmony-mcp package, which depends on @modelcontextprotocol/sdk and the MCP inspector for testing (packages/harmony-mcp/package.json).

The desktop driver is layered differently. @midscene/computer provides the core automation primitives, @midscene/computer-mac adds macOS-specific packaging (packages/computer-mac/package.json), and @midscene/computer-playground is a runnable playground app for Windows, macOS, and Linux (packages/computer-playground/README.md). This split lets the same engine run on any host OS while the playground provides a UI for manual exploration.

Shared Device Layer

All drivers converge on @midscene/core, which exports a ./device subpath alongside ./agent, ./skill, ./ai-model, and ./tree (packages/core/package.json). The device module is what gives every platform driver a uniform shape: a way to obtain the current screenshot, perform pointer/keyboard input, navigate, and observe the result, regardless of whether the underlying transport is Playwright, adb, libnut, or a HarmonyOS-specific protocol.

The @midscene/shared package carries the cross-driver data types that travel across that abstraction. recorder.ts defines MidsceneRecorderEvent, MidsceneRecorderTarget, and MidsceneRecorderGeneratedCode — the canonical event stream used by every platform's recording feature, with platformId discriminators that tell consumers whether an event came from a web, Android, iOS, Harmony, or computer session (packages/shared/src/recorder.ts). The us-keyboard-layout.ts module, adapted from the Puppeteer keyboard layout, provides a normalized KeyInput union plus KeyDefinition records so that the web, Android, and computer drivers can describe keystrokes using the same identifier space (packages/shared/src/us-keyboard-layout.ts).

Together these pieces implement the same vision-driven loop the README advertises: capture a screenshot from whichever device the driver controls, ask the multimodal model to localize the target, perform the input through the driver, and replay until the assertion or query succeeds (README.md).

Developer Tools: CLI, MCP, Skills, Studio, Recorder & Visualizer

Related topics: Introduction & System Architecture, Core AI Engine, Planning & Model Strategy, Platform Drivers & Device Abstraction

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Skills

Continue reading this section for the full explanation and source context.

Section Chrome Extension Recorder

Continue reading this section for the full explanation and source context.

Section Studio / Visualizer

Continue reading this section for the full explanation and source context.

Developer Tools: CLI, MCP, Skills, Studio, Recorder & Visualizer

Midscene ships a family of developer-facing tools that wrap the core vision-driven automation engine in different interfaces: a runnable command-line tool, an MCP server for AI agents, a Skill layer for autonomous execution, a Studio/Visualizer for inspecting runs, and a Chrome-extension Recorder that captures user actions and converts them into reusable artifacts. Together they let users operate Midscene from terminal sessions, IDE-embedded agents, in-browser debuggers, or recorder-style authoring flows. Source: README.md.

High-Level Architecture

The following diagram shows how the developer tools relate to the core automation engine.

flowchart LR
    User[User / Agent]
    CLI["@midscene/cli<br/>(midscene binary)"]
    MCP["@midscene/mcp<br/>&amp; @midscene/harmony-mcp"]
    Skill["@midscene/core/skill"]
    Recorder["Chrome Extension<br/>Recorder"]
    Studio["Studio / Visualizer"]
    Core["@midscene/core"]
    Engines["Web / Android / iOS /<br/>Harmony / Computer"]

    User --> CLI
    User --> MCP
    Agent[AI Agent] --> MCP
    Agent --> Skill
    User --> Recorder
    Recorder --> Studio
    CLI --> Core
    MCP --> Core
    Skill --> Core
    Core --> Engines
    Studio -.inspect.-> Core

Source: packages/cli/package.json, packages/core/package.json, packages/harmony-mcp/package.json.

Command-Line Interface (`@midscene/cli`)

The CLI is the terminal entry point for running YAML scripts, batch tests, and bridge-mode sessions. It is published as @midscene/cli at version 1.9.7 and exposes a single binary named midscene (bin.midscene: ./bin/midscene). Source: packages/cli/package.json.

The CLI aggregates every platform adapter as workspace dependencies, so a single install supports web, Android, iOS, HarmonyOS, and computer-control automation:

Dependency	Purpose
`@midscene/web`	Browser automation via Playwright/Puppeteer
`@midscene/android`	Android device automation
`@midscene/ios`	iOS simulator automation
`@midscene/harmony`	HarmonyOS automation
`@midscene/computer`	Desktop OS control (mouse/keyboard via libnut)
`@midscene/core` / `@midscene/shared`	Engine and shared types

Source: packages/cli/package.json.

Common npm scripts include build (rslib), test (vitest), and test:ai (AITEST=true) for running AI-evaluated suites. A bridge-mode test variant is exposed as test:ai:temp (AITEST=true BRIDGE_MODE=true), indicating the CLI can run the Web integration in bridge mode where the page under test is driven by an already-open browser. Source: packages/cli/package.json, packages/web-integration/README.md.

CLI argument handling is centralized in packages/shared/src/cli/cli-args.ts. It normalizes option spellings (canonicalizeCliArgKeys), detects disallowed aliases (buildDisallowedCliSpellings), and renders Zod validation failures with formatCliValidationError. This means every tool in the monorepo — including midscene, midscene-ios, and the MCP servers — shares one consistent schema-driven argument layer. Source: packages/shared/src/cli/cli-args.ts.

Model Context Protocol Server (`@midscene/mcp` and `@midscene/harmony-mcp`)

Midscene exposes its automation as MCP tools so external AI agents (Claude Desktop, MCP Inspector, custom agents) can call actions like aiAct, aiAssert, and aiQuery. The HarmonyOS-specific MCP server (@midscene/harmony-mcp) depends on @modelcontextprotocol/[email protected] and the @modelcontextprotocol/inspector for local debugging. Source: packages/harmony-mcp/package.json.

The MCP layer is published under subpath exports from @midscene/core (.mcp resolves to ./dist/types/skill/index.d.ts and ./dist/es/skill/index.mjs), which means the same Skill module is reused for both standalone scripts and MCP-exposed tools. Source: packages/core/package.json. Community interest in this area is high — a HarmonyOS UI automation request (issue #1594) was raised explicitly to complete the cross-OS surface, and the MCP integration itself is highlighted as a showcase in the README. Source: README.md.

Skills, Studio, Recorder, and Visualizer

Skills

Skills are the high-level, agent-friendly layer exported from @midscene/core/skill. They wrap the underlying action planner so an AI agent can issue multi-step instructions rather than per-element commands. The README frames Skills as one of two testing modes: add Midscene to Playwright/Vitest, or let an AI agent test autonomously via Skills and MCP. Source: README.md, packages/core/package.json.

Chrome Extension Recorder

The Recorder is implemented inside apps/chrome-extension/src/extension/recorder. It captures pointer, input, and navigation events, then summarizes them with an LLM call that includes screenshots of the session. The summarizer prompt explicitly requests a hierarchical Mermaid mindmap and a short title/description JSON, preserving the exact sequence from the recording. Source: apps/chrome-extension/src/extension/recorder/utils.ts.

The recorder emits three reusable artifact formats declared in the shared recorder types:

Artifact	Field	Use
Markdown report	`MidsceneRecorderGeneratedCode.markdown`	Human-readable doc with embedded screenshots
YAML script	`MidsceneRecorderGeneratedCode.yaml`	Re-runnable script for the CLI
Playwright code	`MidsceneRecorderGeneratedCode.playwright`	Conventional Playwright test source

Source: packages/shared/src/recorder.ts. Community issue #2240 explicitly requests exporting AI execution steps as Playwright scripts, which aligns with this playwright output format. Source: apps/chrome-extension/src/extension/recorder/utils.ts.

Pending descriptions in the recorder UI are detected by checking the literal string AI is analyzing element..., which is a deliberate sentinel so the UI can show a spinner while the model plans the next action. Source: packages/shared/src/recorder.ts.

Studio / Visualizer

The Visualizer (@midscene/visualizer) is a standalone package that renders the screenshots, plans, and assertions produced during a run for human inspection. It shares the same description used across the SDK: "An AI-powered automation SDK can control the page, perform assertions, and extract data in JSON format using natural language." Source: packages/visualizer/README.md, packages/shared/README.md.

Studio is the natural companion to the Recorder: a recorded session can be replayed through the CLI and then opened in the Visualizer to step through each AI decision. This addresses the inspection need behind community requests to debug AI-driven tests and verify the rendered UI state, not just DOM presence. Source: README.md.

Common Failure Modes and Notes

Unrecognized CLI flags — Calling the CLI with a deprecated spelling raises a CLIError listing both the canonical and the rejected name. Pass option spellings defined in the tool's def.schema only. Source: packages/shared/src/cli/cli-args.ts.
Bridge-mode flakiness — Tests launched with BRIDGE_MODE=true assume a browser is already attached via @midscene/web/bridge-mode; running them in plain Node will hang waiting for the bridge. Source: packages/cli/package.json, packages/web-integration/package.json.
Viewport-only targeting — Issue #179 documents that off-viewport elements currently require manual scrolling. The Recorder and Skills rely on the same viewport-restricted planner, so long pages or staged overlays may need an explicit scroll instruction. Source: apps/chrome-extension/src/extension/recorder/utils.ts.
AI-driven planning quality — Issue #426 highlights that generic LLMs may misinterpret domain-specific instructions; Skills and Recorder summaries both depend on the model's ability to map natural language to UI semantics, so prompt quality and screenshot context are critical. Source: apps/chrome-extension/src/extension/recorder/utils.ts.

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

high Configuration risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Identity risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Configuration risk requires verification

May increase setup, validation, or first-run risk for the user.

Doramagic Pitfall Log

Found 10 structured pitfall item(s), including 1 high/blocking item(s). Top priority: Configuration risk - Configuration risk requires verification.

1. Configuration risk: Configuration risk requires verification

Severity: high
Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/web-infra-dev/midscene/issues/2063

2. Identity risk: Identity risk requires verification

Severity: medium
Finding: Project evidence flags a identity risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: identity.distribution | https://github.com/web-infra-dev/midscene

3. Installation risk: Installation risk requires verification

Severity: medium
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/web-infra-dev/midscene/issues/2689

4. Configuration risk: Configuration risk requires verification

Severity: medium
Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/web-infra-dev/midscene/issues/2691

5. Capability evidence risk: Capability evidence risk requires verification

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: capability.assumptions | https://github.com/web-infra-dev/midscene

6. Maintenance risk: Maintenance risk requires verification

Severity: medium
Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/web-infra-dev/midscene

7. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: downstream_validation.risk_items | https://github.com/web-infra-dev/midscene

8. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: risks.scoring_risks | https://github.com/web-infra-dev/midscene

9. Maintenance risk: Maintenance risk requires verification

Severity: low
Finding: issue_or_pr_quality=unknown。
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/web-infra-dev/midscene

10. Maintenance risk: Maintenance risk requires verification

Severity: low
Finding: release_recency=unknown。
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/web-infra-dev/midscene

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using midscene with real data or production workflows.

Community source 1 - github / github_issue
Community source 2 - github / github_issue
[[Bug]: Raw \x1b1A\x1b[2K ANSI escape sequences polluting logs in non-TT - github / github_issue
[[Feature]: Let MIDSCENE_MODEL_REASONING_ENABLED disable thinking for sel](https://github.com/web-infra-dev/midscene/issues/2063) - github / github_issue
v1.9.7 - github / github_release
v1.9.6 - github / github_release
v1.9.5 - github / github_release
v1.9.4 - github / github_release
v1.9.3 - github / github_release
v1.9.2 - github / github_release
v1.9.1 - github / github_release
v1.9.0 - github / github_release

Source: Project Pack community evidence and pitfall evidence

midscene

Introduction & System Architecture

Related Pages

Introduction & System Architecture

Overview and Purpose

Repository Topology and Package Layout

System Architecture and Data Flow

Platform Support, Extensibility, and Community

See Also

Core AI Engine, Planning & Model Strategy

Related Pages

Core AI Engine, Planning & Model Strategy

Overview and Scope

Architecture and Layering

Model Strategy

Planning, Inspection, and Community Friction Points

See Also

Platform Drivers & Device Abstraction

Related Pages

Platform Drivers & Device Abstraction

Overview

Package Layout and Driver Inventory

The Web Driver: A Driver-of-Drivers

Mobile and Desktop Drivers

Shared Device Layer

See Also

Developer Tools: CLI, MCP, Skills, Studio, Recorder & Visualizer

Related Pages

Developer Tools: CLI, MCP, Skills, Studio, Recorder & Visualizer

High-Level Architecture

Command-Line Interface (`@midscene/cli`)

Model Context Protocol Server (`@midscene/mcp` and `@midscene/harmony-mcp`)

Skills, Studio, Recorder, and Visualizer

Skills

Chrome Extension Recorder

Studio / Visualizer

Common Failure Modes and Notes

See Also

Doramagic Pitfall Log

Doramagic Pitfall Log

1. Configuration risk: Configuration risk requires verification

2. Identity risk: Identity risk requires verification

3. Installation risk: Installation risk requires verification

4. Configuration risk: Configuration risk requires verification

5. Capability evidence risk: Capability evidence risk requires verification

6. Maintenance risk: Maintenance risk requires verification

7. Security or permission risk: Security or permission risk requires verification

8. Security or permission risk: Security or permission risk requires verification

9. Maintenance risk: Maintenance risk requires verification

10. Maintenance risk: Maintenance risk requires verification

Community Discussion Evidence

Community Discussion Evidence