Doramagic Project Pack · Human Manual
midscene
Midscene.js is an AI-powered UI automation SDK that controls applications, performs assertions, and extracts structured data using natural-language instructions. The project is positioned ...
Introduction & System Architecture
Related topics: Core AI Engine, Planning & Model Strategy, Platform Drivers & Device Abstraction, Developer Tools: CLI, MCP, Skills, Studio, Recorder & Visualizer
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Core AI Engine, Planning & Model Strategy, Platform Drivers & Device Abstraction, Developer Tools: CLI, MCP, Skills, Studio, Recorder & Visualizer
Introduction & System Architecture
Overview and Purpose
Midscene.js is an AI-powered UI automation SDK that controls applications, performs assertions, and extracts structured data using natural-language instructions. The project is positioned as a "vision-driven" engine: it localizes elements from screenshots rather than from DOM or accessibility trees, which allows it to target icon-only buttons, <canvas> elements, custom controls, cross-origin iframes, and native mobile or desktop surfaces that DOM-based automation cannot reach (README.md:35-55). The same engine powers both test automation and general-purpose UI scripting, exposed via a JavaScript/TypeScript SDK, a Chrome extension, and YAML-based scripting (packages/core/README.md:1-5).
The release cadence is tracked in the workspace: the current published version is 1.9.7 across every official package, including @midscene/core, @midscene/web, @midscene/android, @midscene/ios, @midscene/cli, and @midscene/playground (packages/core/package.json:4, packages/web-integration/package.json:5, packages/android/package.json:4, packages/ios/package.json:4, packages/cli/package.json:4, packages/playground/package.json:4).
Repository Topology and Package Layout
The repository is a pnpm workspace containing device-specific automation packages, a shared abstraction layer, developer tools, and example applications. Every package advertises the same description and homepage (https://midscenejs.com/), confirming that they are coordinated releases of a single product (packages/core/package.json:2-3, packages/shared/package.json, packages/cli/package.json:2-3).
| Package | Role | Source |
|---|---|---|
@midscene/core | Vision-driven engine, agent orchestration, AI model adapters | packages/core/package.json:2-3 |
@midscene/shared | Cross-package utilities (recorder types, helpers) | packages/shared/src/recorder.ts:1-20 |
@midscene/web | Browser integration (Playwright, Puppeteer, bridge mode) | packages/web-integration/README.md:1-10 |
@midscene/android | Android device automation (scrcpy + yadb) | packages/android/package.json:2-9 |
@midscene/ios | iOS simulator/device automation | packages/ios/package.json:2-10 |
@midscene/cli | Unified CLI entry point (midscene bin) | packages/cli/package.json:11-13 |
@midscene/playground | Web playground utilities (Express + CORS) | packages/playground/package.json:2-7 |
@midscene/visualizer | Playback UI for recorded/AI runs | packages/visualizer/README.md:1-3 |
apps/chrome-extension | Recorder + runner shipped as a browser extension | apps/chrome-extension/src/extension/recorder/utils.ts` |
@midscene/harmony-mcp | HarmonyOS MCP server bridge | packages/harmony-mcp/package.json |
The CLI package is the integration hub: it depends on every platform adapter (@midscene/web, @midscene/android, @midscene/ios, @midscene/computer, @midscene/harmony) and on @midscene/core + @midscene/shared, exposing them through a single midscene binary (packages/cli/package.json:17-25). Similarly, @midscene/core consumes @midscene/shared and UI-TARS action parsing, which makes it the engine that all platform adapters share (packages/core/package.json:64-72).
System Architecture and Data Flow
Midscene separates three concerns: the platform adapter (which drives real input/output on a device), the core agent (which plans and verifies), and the AI model (which sees screenshots and returns structured instructions). The same core agent is reused across every platform, while each @midscene/<platform> package supplies the adapter that knows how to capture screenshots and dispatch taps, swipes, and text input on that surface (packages/core/package.json:1-5, packages/web-integration/README.md:1-5, packages/android/package.json:2-9).
flowchart LR user[User / Test Runner] --> api[Platform SDK<br/>@midscene/web · android · ios · harmony · computer] api --> agent[Core Agent<br/>@midscene/core] agent --> screenshot[Screenshot + optional DOM] agent --> llm[Multimodal Model<br/>Qwen3.x · Doubao · Gemini · UI-TARS] llm --> agent agent --> adapter[Action Dispatch<br/>tap · swipe · type · assert] adapter --> device[Target UI<br/>Web · Android · iOS · Harmony · Desktop] recorder[Recorder / Visualizer] -. replay .-> api
The core package also exports skill and MCP entry points (./skill, ./mcp), so external agents (for example, OpenClaw via midscene-skills) can drive Midscene through a model-context-protocol interface rather than the JS SDK directly (packages/core/package.json:24-34). The Chrome extension and the shared recorder module share the same event model: MidsceneRecorderTarget, MidsceneRecorderGeneratedCode, and MidsceneRecorderMarkdownScreenshotAsset describe platform IDs, generated artifacts (Markdown, YAML, Playwright), and screenshot assets, respectively (packages/shared/src/recorder.ts:1-25). This recorder contract is what powers the community-requested ability to export AI execution steps as reusable Playwright scripts (issue #2240) and to handle elements outside the current viewport via recorded event playback (issue #179).
Platform Support, Extensibility, and Community
The official adapter matrix covers browser (Playwright/Puppeteer), Android, iOS, HarmonyOS, and desktop via libnut. The apps/site/docs getting-started matrix lists all five platforms (README.md:65-75), and the CLI exposes HarmonyOS as a first-class target by depending on @midscene/harmony (packages/cli/package.json:17-21). HarmonyOS support is still maturing — community members continue to ask for richer UI automation on that platform (issue #1594), and the @midscene/harmony-mcp package is the experimental MCP bridge for it (packages/harmony-mcp/package.json).
Community extensions sit alongside the official packages. The README curates an "Awesome Midscene" list including midscene-pc (a Windows/macOS/Linux agent, issue #1389), midscene-ios (iOS mirror automation), Python and Java SDK ports, and Docker images for the PC server. These projects all consume the same core abstractions, which is why a third-party device adapter can be added by implementing the platform interface that @midscene/core expects.
For teams that want to plug domain knowledge into planning, the project tracks a long-running discussion about Retrieval-Augmented Generation support so that arbitrary LLMs can interpret product-specific instructions (issue #426). Until that lands, the recommended mitigation is to opt in to DOM context for data-extraction and page-understanding tasks while keeping action planning strictly visual, as described in the project README's model strategy section (README.md:80-90).
See Also
- Quick Start & Installation
- Core API Reference
- Platform Adapters
- Model Strategy & Configuration
- MCP & Skills Integration
Source: https://github.com/web-infra-dev/midscene / Human Manual
Core AI Engine, Planning & Model Strategy
Related topics: Introduction & System Architecture, Platform Drivers & Device Abstraction
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Introduction & System Architecture, Platform Drivers & Device Abstraction
Core AI Engine, Planning & Model Strategy
Overview and Scope
Midscene's core AI engine is the vision-driven automation brain shared across every platform integration in the monorepo. Unlike traditional automation tools that read the DOM or the accessibility tree, the engine is built around multimodal models that localize elements from screenshots alone, with natural language used to describe each step. As stated in the main README, "Midscene is all-in on pure vision for UI actions: element localization is based on screenshots only" (README.md). This design removes the dependency on fragile selectors and makes the same engine usable for web, Android, iOS, HarmonyOS, and desktop surfaces.
The engine lives in the @midscene/core package and is consumed by every downstream integration. The package's package.json describes its role as "Automate browser actions, extract data, and perform assertions using AI. It offers JavaScript SDK, Chrome extension, and support for scripting in YAML" (packages/core/package.json). The exported entry points (e.g., ./ai-model, ./agent, ./device, ./tree) reveal the engine's internal layering and provide the surface area used by the web, iOS, Android, computer, and CLI packages.
Architecture and Layering
The engine is delivered as a workspace of cooperating packages rather than a single monolithic module. The core package owns the planning and AI model logic, while thin integration packages translate the abstract "device" interface into platform-specific input mechanisms.
flowchart TB User[User / Test Author] -->|natural language| SDK[Platform SDK<br/>@midscene/web, /android, /ios, /computer, /harmony] SDK --> Agent[Agent + Task Builder<br/>@midscene/core] Agent --> Planning[Planning & Action Parsing<br/>llm-planning + @ui-tars/action-parser] Agent --> Inspect[Inspect / Extract<br/>llm-inspect] Planning --> Model[Multimodal Model<br/>Qwen / Doubao / GLM / Gemini / UI-TARS] Inspect --> Model Model -->|structured actions| Planning Planning --> Executor[Execution Session] Executor --> Device[AbstractDevice<br/>web / android / ios / computer] Device -->|screenshot| Inspect Device -->|input events| Target[Target Platform]
The @midscene/core package declares a dependency on @ui-tars/action-parser (version 1.2.3), which is the parser that converts raw model output into typed, executable actions (packages/core/package.json). This is the bridge between free-form LLM responses and the deterministic command set that the device layer expects.
The @midscene/web package builds on the core and adds JavaScript SDK, Chrome extension, and YAML scripting surfaces (packages/web-integration/package.json). It also exposes bridge-mode subpaths, allowing an external UI to control a running browser instance. The same pattern is mirrored for iOS via @midscene/ios, which ships its own bin/midscene-ios CLI and a dedicated mcp-server subpath (packages/ios/package.json). The @midscene/android package is the Android counterpart (packages/android/README.md), and the CLI package pulls every platform integration together into a single command-line entry point (packages/cli/package.json).
The recorder in the Chrome extension shows how the engine's outputs are surfaced to humans. The utility module builds structured "session → page → event" mind maps from captured sequences, preserving input values, element descriptions, and page context for later review (apps/chrome-extension/src/extension/recorder/utils.ts). This same execution trace is what the visualizer package renders for debugging (packages/visualizer/README.md).
Model Strategy
Midscene is explicitly model-agnostic and is driven by multimodal models with strong UI grounding. The README enumerates the supported families: Qwen3.x, Doubao-Seed-2.0, GLM-4.6V, gemini-3.5-flash, and UI-TARS, "including open-source options you can self-host" (README.md). Because localization is screenshot-only, the strategy is to keep the model contract narrow: send a screenshot plus a natural-language instruction, and receive a structured action back.
| Concern | Approach in Midscene |
|---|---|
| Element localization | Pure vision on screenshots |
| Action parsing | @ui-tars/action-parser post-processor (packages/core/package.json) |
| Data extraction / assertions | Optional DOM-augmented mode for richer understanding (README.md) |
| Model choice | Any multimodal model with UI grounding, including self-hosted OSS variants (README.md) |
| Execution control | AbstractDevice abstraction, one driver per platform |
This strategy is what makes the same engine work for browser, native mobile, and desktop. It also explains why the community has been able to build out a PC controller (midscene-pc) on top of the device interface — issue #1389 specifically requests adding this to the official "Awesome Midscene" list, noting that the project's "integration with any interface" feature is what made the PC device possible.
Planning, Inspection, and Community Friction Points
The engine splits the LLM call into two complementary paths: a planning path that produces action sequences for aiAct-style flows, and an inspect path that produces structured data and assertions. The ai-model and agent subpath exports in @midscene/core are the public seams for these paths (packages/core/package.json). The shared package, described as the home for "AI-powered automation SDK" primitives (packages/shared/README.md), holds cross-cutting types used by both paths.
Several recurring community requests highlight the limits of the current planning strategy and where it is being extended:
- RAG over product knowledge. Issue #426 asks for retrieval-augmented generation so that LLMs can interpret high-level, domain-specific instructions before planning concrete steps. This sits directly on top of the planning path.
- Elements outside the current viewport. Issue #179 reports that the planner can only target what is visible. Real pages often require scrolling, which is a planning concern (when to issue a scroll action) as well as an executor concern (the device must support the input).
- Exporting to Playwright scripts. Issue #2240 proposes converting recorded AI execution steps into reusable Playwright code. The Chrome extension's recorder already captures the structured event sequence (apps/chrome-extension/src/extension/recorder/utils.ts), so the missing piece is a deterministic transpiler from that trace to Playwright.
- HarmonyOS support. Issue #1594 asks for HarmonyOS automation, which would round out the platform set. The
@midscene/harmonyand@midscene/harmony-mcppackages are already present in the workspace, which the main README points to via the "HarmonyOS" getting-started guide (README.md).
For a technical reader, the practical takeaway is that every new capability — whether RAG, viewport expansion, code export, or a new platform — plugs into the same three layers: the planner that talks to a multimodal model, the parser that turns model output into typed actions, and the AbstractDevice that executes them. The core package's export map is the contract to learn first; the platform-specific packages are mostly device implementations plus thin SDK ergonomics.
See Also
- Web Integration (
@midscene/web) - CLI and YAML Scripting
- Platform Integrations (iOS, Android, HarmonyOS, Computer)
- MCP and Skills
Source: https://github.com/web-infra-dev/midscene / Human Manual
Platform Drivers & Device Abstraction
Related topics: Introduction & System Architecture, Core AI Engine, Planning & Model Strategy, Developer Tools: CLI, MCP, Skills, Studio, Recorder & Visualizer
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Introduction & System Architecture, Core AI Engine, Planning & Model Strategy, Developer Tools: CLI, MCP, Skills, Studio, Recorder & Visualizer
Platform Drivers & Device Abstraction
Overview
Midscene is a multimodal, vision-driven UI automation SDK that ships separate platform driver packages for each runtime environment it targets, all of which plug into a shared device abstraction layer exposed by @midscene/core. The repository is organized as a pnpm monorepo with one package per driver surface — web, Android, iOS, HarmonyOS, and desktop ("computer") — plus shared utilities, a CLI, and a visualizer (README.md).
The drivers share a vision-first philosophy: instead of relying on DOM selectors or accessibility trees, every driver captures a screenshot of the current viewport and lets a multimodal model localize the target element. This is what allows the same aiAct, aiQuery, and aiAssert surface API to work uniformly across browsers, native mobile apps, and desktop applications (README.md).
The community context for this page reflects this multi-platform design:
- Issue #1389 is a request to formally add
midscene-pc(thecomputerdriver) to the curated list of integrations, reflecting strong community adoption of the desktop driver. - Issue #1594 asks about HarmonyOS support, which already exists as
@midscene/harmonyand@midscene/harmony-mcp. - Issue #179 requests out-of-viewport element targeting — a limitation tied to the screenshot-based device abstraction shared by every driver.
Package Layout and Driver Inventory
The monorepo's packages/ directory contains one driver per platform, each published independently with its own CLI and (where applicable) MCP server entry point.
| Package | Target | CLI Binary | MCP Server | Source |
|---|---|---|---|---|
@midscene/web | Browsers (Playwright, Puppeteer, Chrome extension, bridge mode) | — | — | packages/web-integration/package.json |
@midscene/android | Android devices via adb/scrcpy/yadb | midscene-android | ./mcp-server | packages/android/package.json |
@midscene/ios | iOS simulator and devices | midscene-ios | ./mcp-server | packages/ios/package.json |
@midscene/harmony / @midscene/harmony-mcp | HarmonyOS devices | — | yes (separate MCP package) | packages/harmony-mcp/package.json |
@midscene/computer | Desktop OS (Windows/macOS/Linux) | consumed by playground | — | packages/computer-playground/README.md |
@midscene/core | Engine, agent, device abstraction | — | ./skill | packages/core/package.json |
@midscene/shared | Cross-driver types (recorder events, key layout) | — | — | packages/shared/README.md |
@midscene/cli | Unified CLI that bundles all drivers | midscene | — | packages/cli/package.json |
@midscene/core is the only package that exports the device subpath (./device, ./agent, ./skill, ./ai-model, ./tree, ./yaml), and every platform driver depends on it via the workspace protocol — for example @midscene/cli lists @midscene/core, @midscene/web, @midscene/android, @midscene/computer, @midscene/harmony, and @midscene/ios as workspace dependencies (packages/cli/package.json). This composition makes core the canonical host of the device abstraction that all drivers implement against.
The Web Driver: A Driver-of-Drivers
The web package is unusual because it does not bind to a single browser automation backend. @midscene/web ships four sub-entry points, each of which is itself a thin driver that wraps a different browser control surface (packages/web-integration/package.json, packages/web-integration/README.md):
.— the default export for direct Playwright/Puppeteer usage../bridge-mode— a server-side entry used when an automation host runs in Node and drives a browser elsewhere../bridge-mode-browser— the browser-side counterpart loaded into the page being driven../utilsand./ui-utils— helper modules shared by all four drivers.
flowchart LR
User[Caller Code] --> WebPkg["@midscene/web"]
WebPkg --> PW[Playwright driver]
WebPkg --> PP[Puppeteer driver]
WebPkg --> CE[Chrome extension driver]
WebPkg --> BM[Bridge mode]
PW --> Core["@midscene/core<br/>(device abstraction + agent)"]
PP --> Core
CE --> Core
BM --> Core
Core --> LLM[Multimodal model]The bridge-mode split exists so that an MCP server (such as @midscene/android's ./mcp-server) can live on a developer machine while the browser is on another host, with JSON messages flowing across the bridge — a pattern reused by the mobile drivers (packages/web-integration/README.md).
Mobile and Desktop Drivers
The Android, iOS, and HarmonyOS drivers each expose their own CLI binary and a ./mcp-server sub-entry, so they can be invoked either programmatically or surfaced to an external agent via the Model Context Protocol (packages/android/package.json, packages/ios/package.json). The Android package notably runs a prebuild step that downloads scrcpy-server and yadb binaries — the native components that stream the device screen and inject input events (packages/android/package.json). The HarmonyOS MCP support is split into its own harmony-mcp package, which depends on @modelcontextprotocol/sdk and the MCP inspector for testing (packages/harmony-mcp/package.json).
The desktop driver is layered differently. @midscene/computer provides the core automation primitives, @midscene/computer-mac adds macOS-specific packaging (packages/computer-mac/package.json), and @midscene/computer-playground is a runnable playground app for Windows, macOS, and Linux (packages/computer-playground/README.md). This split lets the same engine run on any host OS while the playground provides a UI for manual exploration.
Shared Device Layer
All drivers converge on @midscene/core, which exports a ./device subpath alongside ./agent, ./skill, ./ai-model, and ./tree (packages/core/package.json). The device module is what gives every platform driver a uniform shape: a way to obtain the current screenshot, perform pointer/keyboard input, navigate, and observe the result, regardless of whether the underlying transport is Playwright, adb, libnut, or a HarmonyOS-specific protocol.
The @midscene/shared package carries the cross-driver data types that travel across that abstraction. recorder.ts defines MidsceneRecorderEvent, MidsceneRecorderTarget, and MidsceneRecorderGeneratedCode — the canonical event stream used by every platform's recording feature, with platformId discriminators that tell consumers whether an event came from a web, Android, iOS, Harmony, or computer session (packages/shared/src/recorder.ts). The us-keyboard-layout.ts module, adapted from the Puppeteer keyboard layout, provides a normalized KeyInput union plus KeyDefinition records so that the web, Android, and computer drivers can describe keystrokes using the same identifier space (packages/shared/src/us-keyboard-layout.ts).
Together these pieces implement the same vision-driven loop the README advertises: capture a screenshot from whichever device the driver controls, ask the multimodal model to localize the target, perform the input through the driver, and replay until the assertion or query succeeds (README.md).
See Also
- README.md — project overview and supported platforms
- Model Strategy — multimodal models used by every driver
- Bridge Mode — the web driver's bridge split
- Playwright Integration — web driver usage
- Puppeteer Integration — web driver usage
- Android Getting Started
- iOS Getting Started
- HarmonyOS Getting Started
- Computer (Desktop) Getting Started
Source: https://github.com/web-infra-dev/midscene / Human Manual
Developer Tools: CLI, MCP, Skills, Studio, Recorder & Visualizer
Related topics: Introduction & System Architecture, Core AI Engine, Planning & Model Strategy, Platform Drivers & Device Abstraction
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: Introduction & System Architecture, Core AI Engine, Planning & Model Strategy, Platform Drivers & Device Abstraction
Developer Tools: CLI, MCP, Skills, Studio, Recorder & Visualizer
Midscene ships a family of developer-facing tools that wrap the core vision-driven automation engine in different interfaces: a runnable command-line tool, an MCP server for AI agents, a Skill layer for autonomous execution, a Studio/Visualizer for inspecting runs, and a Chrome-extension Recorder that captures user actions and converts them into reusable artifacts. Together they let users operate Midscene from terminal sessions, IDE-embedded agents, in-browser debuggers, or recorder-style authoring flows. Source: README.md.
High-Level Architecture
The following diagram shows how the developer tools relate to the core automation engine.
flowchart LR
User[User / Agent]
CLI["@midscene/cli<br/>(midscene binary)"]
MCP["@midscene/mcp<br/>& @midscene/harmony-mcp"]
Skill["@midscene/core/skill"]
Recorder["Chrome Extension<br/>Recorder"]
Studio["Studio / Visualizer"]
Core["@midscene/core"]
Engines["Web / Android / iOS /<br/>Harmony / Computer"]
User --> CLI
User --> MCP
Agent[AI Agent] --> MCP
Agent --> Skill
User --> Recorder
Recorder --> Studio
CLI --> Core
MCP --> Core
Skill --> Core
Core --> Engines
Studio -.inspect.-> CoreSource: packages/cli/package.json, packages/core/package.json, packages/harmony-mcp/package.json.
Command-Line Interface (`@midscene/cli`)
The CLI is the terminal entry point for running YAML scripts, batch tests, and bridge-mode sessions. It is published as @midscene/cli at version 1.9.7 and exposes a single binary named midscene (bin.midscene: ./bin/midscene). Source: packages/cli/package.json.
The CLI aggregates every platform adapter as workspace dependencies, so a single install supports web, Android, iOS, HarmonyOS, and computer-control automation:
| Dependency | Purpose |
|---|---|
@midscene/web | Browser automation via Playwright/Puppeteer |
@midscene/android | Android device automation |
@midscene/ios | iOS simulator automation |
@midscene/harmony | HarmonyOS automation |
@midscene/computer | Desktop OS control (mouse/keyboard via libnut) |
@midscene/core / @midscene/shared | Engine and shared types |
Source: packages/cli/package.json.
Common npm scripts include build (rslib), test (vitest), and test:ai (AITEST=true) for running AI-evaluated suites. A bridge-mode test variant is exposed as test:ai:temp (AITEST=true BRIDGE_MODE=true), indicating the CLI can run the Web integration in bridge mode where the page under test is driven by an already-open browser. Source: packages/cli/package.json, packages/web-integration/README.md.
CLI argument handling is centralized in packages/shared/src/cli/cli-args.ts. It normalizes option spellings (canonicalizeCliArgKeys), detects disallowed aliases (buildDisallowedCliSpellings), and renders Zod validation failures with formatCliValidationError. This means every tool in the monorepo — including midscene, midscene-ios, and the MCP servers — shares one consistent schema-driven argument layer. Source: packages/shared/src/cli/cli-args.ts.
Model Context Protocol Server (`@midscene/mcp` and `@midscene/harmony-mcp`)
Midscene exposes its automation as MCP tools so external AI agents (Claude Desktop, MCP Inspector, custom agents) can call actions like aiAct, aiAssert, and aiQuery. The HarmonyOS-specific MCP server (@midscene/harmony-mcp) depends on @modelcontextprotocol/[email protected] and the @modelcontextprotocol/inspector for local debugging. Source: packages/harmony-mcp/package.json.
The MCP layer is published under subpath exports from @midscene/core (.mcp resolves to ./dist/types/skill/index.d.ts and ./dist/es/skill/index.mjs), which means the same Skill module is reused for both standalone scripts and MCP-exposed tools. Source: packages/core/package.json. Community interest in this area is high — a HarmonyOS UI automation request (issue #1594) was raised explicitly to complete the cross-OS surface, and the MCP integration itself is highlighted as a showcase in the README. Source: README.md.
Skills, Studio, Recorder, and Visualizer
Skills
Skills are the high-level, agent-friendly layer exported from @midscene/core/skill. They wrap the underlying action planner so an AI agent can issue multi-step instructions rather than per-element commands. The README frames Skills as one of two testing modes: add Midscene to Playwright/Vitest, or let an AI agent test autonomously via Skills and MCP. Source: README.md, packages/core/package.json.
Chrome Extension Recorder
The Recorder is implemented inside apps/chrome-extension/src/extension/recorder. It captures pointer, input, and navigation events, then summarizes them with an LLM call that includes screenshots of the session. The summarizer prompt explicitly requests a hierarchical Mermaid mindmap and a short title/description JSON, preserving the exact sequence from the recording. Source: apps/chrome-extension/src/extension/recorder/utils.ts.
The recorder emits three reusable artifact formats declared in the shared recorder types:
| Artifact | Field | Use |
|---|---|---|
| Markdown report | MidsceneRecorderGeneratedCode.markdown | Human-readable doc with embedded screenshots |
| YAML script | MidsceneRecorderGeneratedCode.yaml | Re-runnable script for the CLI |
| Playwright code | MidsceneRecorderGeneratedCode.playwright | Conventional Playwright test source |
Source: packages/shared/src/recorder.ts. Community issue #2240 explicitly requests exporting AI execution steps as Playwright scripts, which aligns with this playwright output format. Source: apps/chrome-extension/src/extension/recorder/utils.ts.
Pending descriptions in the recorder UI are detected by checking the literal string AI is analyzing element..., which is a deliberate sentinel so the UI can show a spinner while the model plans the next action. Source: packages/shared/src/recorder.ts.
Studio / Visualizer
The Visualizer (@midscene/visualizer) is a standalone package that renders the screenshots, plans, and assertions produced during a run for human inspection. It shares the same description used across the SDK: "An AI-powered automation SDK can control the page, perform assertions, and extract data in JSON format using natural language." Source: packages/visualizer/README.md, packages/shared/README.md.
Studio is the natural companion to the Recorder: a recorded session can be replayed through the CLI and then opened in the Visualizer to step through each AI decision. This addresses the inspection need behind community requests to debug AI-driven tests and verify the rendered UI state, not just DOM presence. Source: README.md.
Common Failure Modes and Notes
- Unrecognized CLI flags — Calling the CLI with a deprecated spelling raises a
CLIErrorlisting both the canonical and the rejected name. Pass option spellings defined in the tool'sdef.schemaonly. Source: packages/shared/src/cli/cli-args.ts. - Bridge-mode flakiness — Tests launched with
BRIDGE_MODE=trueassume a browser is already attached via@midscene/web/bridge-mode; running them in plain Node will hang waiting for the bridge. Source: packages/cli/package.json, packages/web-integration/package.json. - Viewport-only targeting — Issue #179 documents that off-viewport elements currently require manual scrolling. The Recorder and Skills rely on the same viewport-restricted planner, so long pages or staged overlays may need an explicit scroll instruction. Source: apps/chrome-extension/src/extension/recorder/utils.ts.
- AI-driven planning quality — Issue #426 highlights that generic LLMs may misinterpret domain-specific instructions; Skills and Recorder summaries both depend on the model's ability to map natural language to UI semantics, so prompt quality and screenshot context are critical. Source: apps/chrome-extension/src/extension/recorder/utils.ts.
See Also
- Core engine:
@midscene/core— packages/core/README.md - Web integration & bridge mode: packages/web-integration/README.md
- Shared recorder types: packages/shared/src/recorder.ts
- Project README: README.md
Source: https://github.com/web-infra-dev/midscene / Human Manual
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
Doramagic Pitfall Log
Found 10 structured pitfall item(s), including 1 high/blocking item(s). Top priority: Configuration risk - Configuration risk requires verification.
1. Configuration risk: Configuration risk requires verification
- Severity: high
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/web-infra-dev/midscene/issues/2063
2. Identity risk: Identity risk requires verification
- Severity: medium
- Finding: Project evidence flags a identity risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: identity.distribution | https://github.com/web-infra-dev/midscene
3. Installation risk: Installation risk requires verification
- Severity: medium
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/web-infra-dev/midscene/issues/2689
4. Configuration risk: Configuration risk requires verification
- Severity: medium
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/web-infra-dev/midscene/issues/2691
5. Capability evidence risk: Capability evidence risk requires verification
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: capability.assumptions | https://github.com/web-infra-dev/midscene
6. Maintenance risk: Maintenance risk requires verification
- Severity: medium
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/web-infra-dev/midscene
7. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: downstream_validation.risk_items | https://github.com/web-infra-dev/midscene
8. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: risks.scoring_risks | https://github.com/web-infra-dev/midscene
9. Maintenance risk: Maintenance risk requires verification
- Severity: low
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/web-infra-dev/midscene
10. Maintenance risk: Maintenance risk requires verification
- Severity: low
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | https://github.com/web-infra-dev/midscene
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using midscene with real data or production workflows.
- Community source 1 - github / github_issue
- Community source 2 - github / github_issue
- [[Bug]: Raw \x1b1A\x1b[2K ANSI escape sequences polluting logs in non-TT - github / github_issue
- [[Feature]: Let MIDSCENE_MODEL_REASONING_ENABLED disable thinking for sel](https://github.com/web-infra-dev/midscene/issues/2063) - github / github_issue
- v1.9.7 - github / github_release
- v1.9.6 - github / github_release
- v1.9.5 - github / github_release
- v1.9.4 - github / github_release
- v1.9.3 - github / github_release
- v1.9.2 - github / github_release
- v1.9.1 - github / github_release
- v1.9.0 - github / github_release
Source: Project Pack community evidence and pitfall evidence