# https://github.com/web-infra-dev/midscene Project Manual

Generated at: 2026-06-19 05:33:20 UTC

## Table of Contents

- [Introduction & System Architecture](#page-1)
- [Core AI Engine, Planning & Model Strategy](#page-2)
- [Platform Drivers & Device Abstraction](#page-3)
- [Developer Tools: CLI, MCP, Skills, Studio, Recorder & Visualizer](#page-4)

<a id='page-1'></a>

## Introduction & System Architecture

### Related Pages

Related topics: [Core AI Engine, Planning & Model Strategy](#page-2), [Platform Drivers & Device Abstraction](#page-3), [Developer Tools: CLI, MCP, Skills, Studio, Recorder & Visualizer](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/web-infra-dev/midscene/blob/main/README.md)
- [packages/core/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/core/package.json)
- [packages/core/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/core/README.md)
- [packages/web-integration/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/web-integration/package.json)
- [packages/web-integration/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/web-integration/README.md)
- [packages/android/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/android/package.json)
- [packages/ios/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/ios/package.json)
- [packages/cli/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/cli/package.json)
- [packages/shared/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/shared/package.json)
- [packages/visualizer/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/visualizer/README.md)
- [packages/playground/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/playground/package.json)
- [packages/harmony-mcp/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/harmony-mcp/package.json)
- [packages/shared/src/recorder.ts](https://github.com/web-infra-dev/midscene/blob/main/packages/shared/src/recorder.ts)
- [apps/chrome-extension/src/extension/recorder/utils.ts](https://github.com/web-infra-dev/midscene/blob/main/apps/chrome-extension/src/extension/recorder/utils.ts)
</details>

# Introduction & System Architecture

## Overview and Purpose

Midscene.js is an AI-powered UI automation SDK that controls applications, performs assertions, and extracts structured data using natural-language instructions. The project is positioned as a "vision-driven" engine: it localizes elements from screenshots rather than from DOM or accessibility trees, which allows it to target icon-only buttons, `<canvas>` elements, custom controls, cross-origin iframes, and native mobile or desktop surfaces that DOM-based automation cannot reach ([README.md:35-55]()). The same engine powers both test automation and general-purpose UI scripting, exposed via a JavaScript/TypeScript SDK, a Chrome extension, and YAML-based scripting ([packages/core/README.md:1-5]()).

The release cadence is tracked in the workspace: the current published version is `1.9.7` across every official package, including `@midscene/core`, `@midscene/web`, `@midscene/android`, `@midscene/ios`, `@midscene/cli`, and `@midscene/playground` ([packages/core/package.json:4](), [packages/web-integration/package.json:5](), [packages/android/package.json:4](), [packages/ios/package.json:4](), [packages/cli/package.json:4](), [packages/playground/package.json:4]()).

## Repository Topology and Package Layout

The repository is a pnpm workspace containing device-specific automation packages, a shared abstraction layer, developer tools, and example applications. Every package advertises the same description and homepage (`https://midscenejs.com/`), confirming that they are coordinated releases of a single product ([packages/core/package.json:2-3](), [packages/shared/package.json](), [packages/cli/package.json:2-3]()).

| Package | Role | Source |
| --- | --- | --- |
| `@midscene/core` | Vision-driven engine, agent orchestration, AI model adapters | [packages/core/package.json:2-3]() |
| `@midscene/shared` | Cross-package utilities (recorder types, helpers) | [packages/shared/src/recorder.ts:1-20]() |
| `@midscene/web` | Browser integration (Playwright, Puppeteer, bridge mode) | [packages/web-integration/README.md:1-10]() |
| `@midscene/android` | Android device automation (scrcpy + yadb) | [packages/android/package.json:2-9]() |
| `@midscene/ios` | iOS simulator/device automation | [packages/ios/package.json:2-10]() |
| `@midscene/cli` | Unified CLI entry point (`midscene` bin) | [packages/cli/package.json:11-13]() |
| `@midscene/playground` | Web playground utilities (Express + CORS) | [packages/playground/package.json:2-7]() |
| `@midscene/visualizer` | Playback UI for recorded/AI runs | [packages/visualizer/README.md:1-3]() |
| `apps/chrome-extension` | Recorder + runner shipped as a browser extension | [apps/chrome-extension/src/extension/recorder/utils.ts]()` |
| `@midscene/harmony-mcp` | HarmonyOS MCP server bridge | [packages/harmony-mcp/package.json]() |

The CLI package is the integration hub: it depends on every platform adapter (`@midscene/web`, `@midscene/android`, `@midscene/ios`, `@midscene/computer`, `@midscene/harmony`) and on `@midscene/core` + `@midscene/shared`, exposing them through a single `midscene` binary ([packages/cli/package.json:17-25]()). Similarly, `@midscene/core` consumes `@midscene/shared` and UI-TARS action parsing, which makes it the engine that all platform adapters share ([packages/core/package.json:64-72]()).

## System Architecture and Data Flow

Midscene separates three concerns: the platform adapter (which drives real input/output on a device), the core agent (which plans and verifies), and the AI model (which sees screenshots and returns structured instructions). The same `core` agent is reused across every platform, while each `@midscene/<platform>` package supplies the adapter that knows how to capture screenshots and dispatch taps, swipes, and text input on that surface ([packages/core/package.json:1-5](), [packages/web-integration/README.md:1-5](), [packages/android/package.json:2-9]()).

```mermaid
flowchart LR
  user[User / Test Runner] --> api[Platform SDK<br/>@midscene/web · android · ios · harmony · computer]
  api --> agent[Core Agent<br/>@midscene/core]
  agent --> screenshot[Screenshot + optional DOM]
  agent --> llm[Multimodal Model<br/>Qwen3.x · Doubao · Gemini · UI-TARS]
  llm --> agent
  agent --> adapter[Action Dispatch<br/>tap · swipe · type · assert]
  adapter --> device[Target UI<br/>Web · Android · iOS · Harmony · Desktop]
  recorder[Recorder / Visualizer] -. replay .-> api
```

The core package also exports skill and MCP entry points (`./skill`, `./mcp`), so external agents (for example, OpenClaw via `midscene-skills`) can drive Midscene through a model-context-protocol interface rather than the JS SDK directly ([packages/core/package.json:24-34]()). The Chrome extension and the shared recorder module share the same event model: `MidsceneRecorderTarget`, `MidsceneRecorderGeneratedCode`, and `MidsceneRecorderMarkdownScreenshotAsset` describe platform IDs, generated artifacts (Markdown, YAML, Playwright), and screenshot assets, respectively ([packages/shared/src/recorder.ts:1-25]()). This recorder contract is what powers the community-requested ability to export AI execution steps as reusable Playwright scripts ([issue #2240]()) and to handle elements outside the current viewport via recorded event playback ([issue #179]()).

## Platform Support, Extensibility, and Community

The official adapter matrix covers browser (Playwright/Puppeteer), Android, iOS, HarmonyOS, and desktop via `libnut`. The `apps/site/docs` getting-started matrix lists all five platforms ([README.md:65-75]()), and the CLI exposes HarmonyOS as a first-class target by depending on `@midscene/harmony` ([packages/cli/package.json:17-21]()). HarmonyOS support is still maturing — community members continue to ask for richer UI automation on that platform ([issue #1594]()), and the `@midscene/harmony-mcp` package is the experimental MCP bridge for it ([packages/harmony-mcp/package.json]()).

Community extensions sit alongside the official packages. The README curates an "Awesome Midscene" list including `midscene-pc` (a Windows/macOS/Linux agent, [issue #1389]()), `midscene-ios` (iOS mirror automation), Python and Java SDK ports, and Docker images for the PC server. These projects all consume the same `core` abstractions, which is why a third-party device adapter can be added by implementing the platform interface that `@midscene/core` expects.

For teams that want to plug domain knowledge into planning, the project tracks a long-running discussion about Retrieval-Augmented Generation support so that arbitrary LLMs can interpret product-specific instructions ([issue #426]()). Until that lands, the recommended mitigation is to opt in to DOM context for data-extraction and page-understanding tasks while keeping action planning strictly visual, as described in the project README's model strategy section ([README.md:80-90]()).

## See Also

- [Quick Start & Installation](quick-start.md)
- [Core API Reference](api-reference.md)
- [Platform Adapters](platform-adapters.md)
- [Model Strategy & Configuration](model-strategy.md)
- [MCP & Skills Integration](mcp-skills.md)

---

<a id='page-2'></a>

## Core AI Engine, Planning & Model Strategy

### Related Pages

Related topics: [Introduction & System Architecture](#page-1), [Platform Drivers & Device Abstraction](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/web-infra-dev/midscene/blob/main/README.md)
- [packages/core/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/core/README.md)
- [packages/core/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/core/package.json)
- [packages/web-integration/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/web-integration/README.md)
- [packages/web-integration/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/web-integration/package.json)
- [packages/ios/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/ios/README.md)
- [packages/ios/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/ios/package.json)
- [packages/android/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/android/README.md)
- [packages/cli/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/cli/package.json)
- [packages/shared/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/shared/README.md)
- [apps/chrome-extension/src/extension/recorder/utils.ts](https://github.com/web-infra-dev/midscene/blob/main/apps/chrome-extension/src/extension/recorder/utils.ts)
- [packages/visualizer/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/visualizer/README.md)

</details>

# Core AI Engine, Planning & Model Strategy

## Overview and Scope

Midscene's core AI engine is the vision-driven automation brain shared across every platform integration in the monorepo. Unlike traditional automation tools that read the DOM or the accessibility tree, the engine is built around multimodal models that localize elements from screenshots alone, with natural language used to describe each step. As stated in the main README, "Midscene is all-in on pure vision for UI actions: element localization is based on screenshots only" ([README.md](https://github.com/web-infra-dev/midscene/blob/main/README.md)). This design removes the dependency on fragile selectors and makes the same engine usable for web, Android, iOS, HarmonyOS, and desktop surfaces.

The engine lives in the `@midscene/core` package and is consumed by every downstream integration. The package's package.json describes its role as "Automate browser actions, extract data, and perform assertions using AI. It offers JavaScript SDK, Chrome extension, and support for scripting in YAML" ([packages/core/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/core/package.json)). The exported entry points (e.g., `./ai-model`, `./agent`, `./device`, `./tree`) reveal the engine's internal layering and provide the surface area used by the web, iOS, Android, computer, and CLI packages.

## Architecture and Layering

The engine is delivered as a workspace of cooperating packages rather than a single monolithic module. The core package owns the planning and AI model logic, while thin integration packages translate the abstract "device" interface into platform-specific input mechanisms.

```mermaid
flowchart TB
  User[User / Test Author] -->|natural language| SDK[Platform SDK<br/>@midscene/web, /android, /ios, /computer, /harmony]
  SDK --> Agent[Agent + Task Builder<br/>@midscene/core]
  Agent --> Planning[Planning & Action Parsing<br/>llm-planning + @ui-tars/action-parser]
  Agent --> Inspect[Inspect / Extract<br/>llm-inspect]
  Planning --> Model[Multimodal Model<br/>Qwen / Doubao / GLM / Gemini / UI-TARS]
  Inspect --> Model
  Model -->|structured actions| Planning
  Planning --> Executor[Execution Session]
  Executor --> Device[AbstractDevice<br/>web / android / ios / computer]
  Device -->|screenshot| Inspect
  Device -->|input events| Target[Target Platform]
```

The `@midscene/core` package declares a dependency on `@ui-tars/action-parser` (version 1.2.3), which is the parser that converts raw model output into typed, executable actions ([packages/core/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/core/package.json)). This is the bridge between free-form LLM responses and the deterministic command set that the device layer expects.

The `@midscene/web` package builds on the core and adds JavaScript SDK, Chrome extension, and YAML scripting surfaces ([packages/web-integration/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/web-integration/package.json)). It also exposes bridge-mode subpaths, allowing an external UI to control a running browser instance. The same pattern is mirrored for iOS via `@midscene/ios`, which ships its own `bin/midscene-ios` CLI and a dedicated `mcp-server` subpath ([packages/ios/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/ios/package.json)). The `@midscene/android` package is the Android counterpart ([packages/android/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/android/README.md)), and the CLI package pulls every platform integration together into a single command-line entry point ([packages/cli/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/cli/package.json)).

The recorder in the Chrome extension shows how the engine's outputs are surfaced to humans. The utility module builds structured "session → page → event" mind maps from captured sequences, preserving input values, element descriptions, and page context for later review ([apps/chrome-extension/src/extension/recorder/utils.ts](https://github.com/web-infra-dev/midscene/blob/main/apps/chrome-extension/src/extension/recorder/utils.ts)). This same execution trace is what the visualizer package renders for debugging ([packages/visualizer/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/visualizer/README.md)).

## Model Strategy

Midscene is explicitly model-agnostic and is driven by multimodal models with strong UI grounding. The README enumerates the supported families: `Qwen3.x`, `Doubao-Seed-2.0`, `GLM-4.6V`, `gemini-3.5-flash`, and `UI-TARS`, "including open-source options you can self-host" ([README.md](https://github.com/web-infra-dev/midscene/blob/main/README.md)). Because localization is screenshot-only, the strategy is to keep the model contract narrow: send a screenshot plus a natural-language instruction, and receive a structured action back.

| Concern | Approach in Midscene |
| --- | --- |
| Element localization | Pure vision on screenshots |
| Action parsing | `@ui-tars/action-parser` post-processor ([packages/core/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/core/package.json)) |
| Data extraction / assertions | Optional DOM-augmented mode for richer understanding ([README.md](https://github.com/web-infra-dev/midscene/blob/main/README.md)) |
| Model choice | Any multimodal model with UI grounding, including self-hosted OSS variants ([README.md](https://github.com/web-infra-dev/midscene/blob/main/README.md)) |
| Execution control | AbstractDevice abstraction, one driver per platform |

This strategy is what makes the same engine work for browser, native mobile, and desktop. It also explains why the community has been able to build out a PC controller (`midscene-pc`) on top of the device interface — issue #1389 specifically requests adding this to the official "Awesome Midscene" list, noting that the project's "integration with any interface" feature is what made the PC device possible.

## Planning, Inspection, and Community Friction Points

The engine splits the LLM call into two complementary paths: a planning path that produces action sequences for `aiAct`-style flows, and an inspect path that produces structured data and assertions. The `ai-model` and `agent` subpath exports in `@midscene/core` are the public seams for these paths ([packages/core/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/core/package.json)). The shared package, described as the home for "AI-powered automation SDK" primitives ([packages/shared/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/shared/README.md)), holds cross-cutting types used by both paths.

Several recurring community requests highlight the limits of the current planning strategy and where it is being extended:

- **RAG over product knowledge.** Issue #426 asks for retrieval-augmented generation so that LLMs can interpret high-level, domain-specific instructions before planning concrete steps. This sits directly on top of the planning path.
- **Elements outside the current viewport.** Issue #179 reports that the planner can only target what is visible. Real pages often require scrolling, which is a planning concern (when to issue a scroll action) as well as an executor concern (the device must support the input).
- **Exporting to Playwright scripts.** Issue #2240 proposes converting recorded AI execution steps into reusable Playwright code. The Chrome extension's recorder already captures the structured event sequence ([apps/chrome-extension/src/extension/recorder/utils.ts](https://github.com/web-infra-dev/midscene/blob/main/apps/chrome-extension/src/extension/recorder/utils.ts)), so the missing piece is a deterministic transpiler from that trace to Playwright.
- **HarmonyOS support.** Issue #1594 asks for HarmonyOS automation, which would round out the platform set. The `@midscene/harmony` and `@midscene/harmony-mcp` packages are already present in the workspace, which the main README points to via the "HarmonyOS" getting-started guide ([README.md](https://github.com/web-infra-dev/midscene/blob/main/README.md)).

For a technical reader, the practical takeaway is that every new capability — whether RAG, viewport expansion, code export, or a new platform — plugs into the same three layers: the planner that talks to a multimodal model, the parser that turns model output into typed actions, and the `AbstractDevice` that executes them. The core package's export map is the contract to learn first; the platform-specific packages are mostly device implementations plus thin SDK ergonomics.

## See Also

- [Web Integration (`@midscene/web`)](./web-integration.md)
- [CLI and YAML Scripting](./cli.md)
- [Platform Integrations (iOS, Android, HarmonyOS, Computer)](./platform-integrations.md)
- [MCP and Skills](./mcp-and-skills.md)

---

<a id='page-3'></a>

## Platform Drivers & Device Abstraction

### Related Pages

Related topics: [Introduction & System Architecture](#page-1), [Core AI Engine, Planning & Model Strategy](#page-2), [Developer Tools: CLI, MCP, Skills, Studio, Recorder & Visualizer](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/web-infra-dev/midscene/blob/main/README.md)
- [packages/core/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/core/package.json)
- [packages/core/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/core/README.md)
- [packages/shared/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/shared/README.md)
- [packages/shared/src/recorder.ts](https://github.com/web-infra-dev/midscene/blob/main/packages/shared/src/recorder.ts)
- [packages/shared/src/us-keyboard-layout.ts](https://github.com/web-infra-dev/midscene/blob/main/packages/shared/src/us-keyboard-layout.ts)
- [packages/web-integration/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/web-integration/package.json)
- [packages/web-integration/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/web-integration/README.md)
- [packages/android/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/android/package.json)
- [packages/android/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/android/README.md)
- [packages/ios/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/ios/package.json)
- [packages/harmony-mcp/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/harmony-mcp/package.json)
- [packages/computer-playground/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/computer-playground/README.md)
- [packages/computer-mac/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/computer-mac/package.json)
- [packages/cli/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/cli/package.json)
- [packages/visualizer/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/visualizer/README.md)
</details>

# Platform Drivers & Device Abstraction

## Overview

Midscene is a multimodal, vision-driven UI automation SDK that ships separate **platform driver packages** for each runtime environment it targets, all of which plug into a shared device abstraction layer exposed by `@midscene/core`. The repository is organized as a pnpm monorepo with one package per driver surface — web, Android, iOS, HarmonyOS, and desktop ("computer") — plus shared utilities, a CLI, and a visualizer ([README.md](https://github.com/web-infra-dev/midscene/blob/main/README.md)).

The drivers share a vision-first philosophy: instead of relying on DOM selectors or accessibility trees, every driver captures a screenshot of the current viewport and lets a multimodal model localize the target element. This is what allows the same `aiAct`, `aiQuery`, and `aiAssert` surface API to work uniformly across browsers, native mobile apps, and desktop applications ([README.md](https://github.com/web-infra-dev/midscene/blob/main/README.md)).

The community context for this page reflects this multi-platform design:
- Issue [#1389](https://github.com/web-infra-dev/midscene/issues/1389) is a request to formally add `midscene-pc` (the `computer` driver) to the curated list of integrations, reflecting strong community adoption of the desktop driver.
- Issue [#1594](https://github.com/web-infra-dev/midscene/issues/1594) asks about HarmonyOS support, which already exists as `@midscene/harmony` and `@midscene/harmony-mcp`.
- Issue [#179](https://github.com/web-infra-dev/midscene/issues/179) requests out-of-viewport element targeting — a limitation tied to the screenshot-based device abstraction shared by every driver.

## Package Layout and Driver Inventory

The monorepo's `packages/` directory contains one driver per platform, each published independently with its own CLI and (where applicable) MCP server entry point.

| Package | Target | CLI Binary | MCP Server | Source |
|---|---|---|---|---|
| `@midscene/web` | Browsers (Playwright, Puppeteer, Chrome extension, bridge mode) | — | — | [packages/web-integration/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/web-integration/package.json) |
| `@midscene/android` | Android devices via adb/scrcpy/yadb | `midscene-android` | `./mcp-server` | [packages/android/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/android/package.json) |
| `@midscene/ios` | iOS simulator and devices | `midscene-ios` | `./mcp-server` | [packages/ios/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/ios/package.json) |
| `@midscene/harmony` / `@midscene/harmony-mcp` | HarmonyOS devices | — | yes (separate MCP package) | [packages/harmony-mcp/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/harmony-mcp/package.json) |
| `@midscene/computer` | Desktop OS (Windows/macOS/Linux) | consumed by playground | — | [packages/computer-playground/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/computer-playground/README.md) |
| `@midscene/core` | Engine, agent, device abstraction | — | `./skill` | [packages/core/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/core/package.json) |
| `@midscene/shared` | Cross-driver types (recorder events, key layout) | — | — | [packages/shared/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/shared/README.md) |
| `@midscene/cli` | Unified CLI that bundles all drivers | `midscene` | — | [packages/cli/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/cli/package.json) |

`@midscene/core` is the only package that exports the `device` subpath (`./device`, `./agent`, `./skill`, `./ai-model`, `./tree`, `./yaml`), and every platform driver depends on it via the workspace protocol — for example `@midscene/cli` lists `@midscene/core`, `@midscene/web`, `@midscene/android`, `@midscene/computer`, `@midscene/harmony`, and `@midscene/ios` as workspace dependencies ([packages/cli/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/cli/package.json)). This composition makes `core` the canonical host of the device abstraction that all drivers implement against.

## The Web Driver: A Driver-of-Drivers

The web package is unusual because it does not bind to a single browser automation backend. `@midscene/web` ships four sub-entry points, each of which is itself a thin driver that wraps a different browser control surface ([packages/web-integration/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/web-integration/package.json), [packages/web-integration/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/web-integration/README.md)):

- `.` — the default export for direct Playwright/Puppeteer usage.
- `./bridge-mode` — a server-side entry used when an automation host runs in Node and drives a browser elsewhere.
- `./bridge-mode-browser` — the browser-side counterpart loaded into the page being driven.
- `./utils` and `./ui-utils` — helper modules shared by all four drivers.

```mermaid
flowchart LR
    User[Caller Code] --> WebPkg["@midscene/web"]
    WebPkg --> PW[Playwright driver]
    WebPkg --> PP[Puppeteer driver]
    WebPkg --> CE[Chrome extension driver]
    WebPkg --> BM[Bridge mode]
    PW --> Core["@midscene/core<br/>(device abstraction + agent)"]
    PP --> Core
    CE --> Core
    BM --> Core
    Core --> LLM[Multimodal model]
```

The `bridge-mode` split exists so that an MCP server (such as `@midscene/android`'s `./mcp-server`) can live on a developer machine while the browser is on another host, with JSON messages flowing across the bridge — a pattern reused by the mobile drivers ([packages/web-integration/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/web-integration/README.md)).

## Mobile and Desktop Drivers

The Android, iOS, and HarmonyOS drivers each expose their own CLI binary and a `./mcp-server` sub-entry, so they can be invoked either programmatically or surfaced to an external agent via the Model Context Protocol ([packages/android/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/android/package.json), [packages/ios/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/ios/package.json)). The Android package notably runs a prebuild step that downloads `scrcpy-server` and `yadb` binaries — the native components that stream the device screen and inject input events ([packages/android/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/android/package.json)). The HarmonyOS MCP support is split into its own `harmony-mcp` package, which depends on `@modelcontextprotocol/sdk` and the MCP inspector for testing ([packages/harmony-mcp/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/harmony-mcp/package.json)).

The desktop driver is layered differently. `@midscene/computer` provides the core automation primitives, `@midscene/computer-mac` adds macOS-specific packaging ([packages/computer-mac/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/computer-mac/package.json)), and `@midscene/computer-playground` is a runnable playground app for Windows, macOS, and Linux ([packages/computer-playground/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/computer-playground/README.md)). This split lets the same engine run on any host OS while the playground provides a UI for manual exploration.

## Shared Device Layer

All drivers converge on `@midscene/core`, which exports a `./device` subpath alongside `./agent`, `./skill`, `./ai-model`, and `./tree` ([packages/core/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/core/package.json)). The device module is what gives every platform driver a uniform shape: a way to obtain the current screenshot, perform pointer/keyboard input, navigate, and observe the result, regardless of whether the underlying transport is Playwright, adb, libnut, or a HarmonyOS-specific protocol.

The `@midscene/shared` package carries the cross-driver data types that travel across that abstraction. `recorder.ts` defines `MidsceneRecorderEvent`, `MidsceneRecorderTarget`, and `MidsceneRecorderGeneratedCode` — the canonical event stream used by every platform's recording feature, with `platformId` discriminators that tell consumers whether an event came from a web, Android, iOS, Harmony, or computer session ([packages/shared/src/recorder.ts](https://github.com/web-infra-dev/midscene/blob/main/packages/shared/src/recorder.ts)). The `us-keyboard-layout.ts` module, adapted from the Puppeteer keyboard layout, provides a normalized `KeyInput` union plus `KeyDefinition` records so that the web, Android, and computer drivers can describe keystrokes using the same identifier space ([packages/shared/src/us-keyboard-layout.ts](https://github.com/web-infra-dev/midscene/blob/main/packages/shared/src/us-keyboard-layout.ts)).

Together these pieces implement the same vision-driven loop the README advertises: capture a screenshot from whichever device the driver controls, ask the multimodal model to localize the target, perform the input through the driver, and replay until the assertion or query succeeds ([README.md](https://github.com/web-infra-dev/midscene/blob/main/README.md)).

## See Also

- [README.md](https://github.com/web-infra-dev/midscene/blob/main/README.md) — project overview and supported platforms
- [Model Strategy](https://midscenejs.com/model-strategy) — multimodal models used by every driver
- [Bridge Mode](https://midscenejs.com/bridge-mode.html) — the web driver's bridge split
- [Playwright Integration](https://midscenejs.com/integrate-with-playwright.html) — web driver usage
- [Puppeteer Integration](https://midscenejs.com/integrate-with-puppeteer.html) — web driver usage
- [Android Getting Started](https://midscenejs.com/android-getting-started.html)
- [iOS Getting Started](https://midscenejs.com/ios-getting-started.html)
- [HarmonyOS Getting Started](https://midscenejs.com/harmony-getting-started.html)
- [Computer (Desktop) Getting Started](https://midscenejs.com/computer-getting-started.html)

---

<a id='page-4'></a>

## Developer Tools: CLI, MCP, Skills, Studio, Recorder & Visualizer

### Related Pages

Related topics: [Introduction & System Architecture](#page-1), [Core AI Engine, Planning & Model Strategy](#page-2), [Platform Drivers & Device Abstraction](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/web-infra-dev/midscene/blob/main/README.md)
- [packages/cli/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/cli/package.json)
- [packages/visualizer/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/visualizer/README.md)
- [packages/core/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/core/package.json)
- [packages/core/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/core/README.md)
- [packages/harmony-mcp/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/harmony-mcp/package.json)
- [packages/web-integration/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/web-integration/package.json)
- [packages/web-integration/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/web-integration/README.md)
- [packages/shared/src/recorder.ts](https://github.com/web-infra-dev/midscene/blob/main/packages/shared/src/recorder.ts)
- [packages/shared/src/cli/cli-args.ts](https://github.com/web-infra-dev/midscene/blob/main/packages/shared/src/cli/cli-args.ts)
- [packages/shared/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/shared/README.md)
- [apps/chrome-extension/src/extension/recorder/utils.ts](https://github.com/web-infra-dev/midscene/blob/main/apps/chrome-extension/src/extension/recorder/utils.ts)
</details>

# Developer Tools: CLI, MCP, Skills, Studio, Recorder & Visualizer

Midscene ships a family of developer-facing tools that wrap the core vision-driven automation engine in different interfaces: a runnable command-line tool, an MCP server for AI agents, a Skill layer for autonomous execution, a Studio/Visualizer for inspecting runs, and a Chrome-extension Recorder that captures user actions and converts them into reusable artifacts. Together they let users operate Midscene from terminal sessions, IDE-embedded agents, in-browser debuggers, or recorder-style authoring flows. Source: [README.md](https://github.com/web-infra-dev/midscene/blob/main/README.md).

## High-Level Architecture

The following diagram shows how the developer tools relate to the core automation engine.

```mermaid
flowchart LR
    User[User / Agent]
    CLI["@midscene/cli<br/>(midscene binary)"]
    MCP["@midscene/mcp<br/>&amp; @midscene/harmony-mcp"]
    Skill["@midscene/core/skill"]
    Recorder["Chrome Extension<br/>Recorder"]
    Studio["Studio / Visualizer"]
    Core["@midscene/core"]
    Engines["Web / Android / iOS /<br/>Harmony / Computer"]

    User --> CLI
    User --> MCP
    Agent[AI Agent] --> MCP
    Agent --> Skill
    User --> Recorder
    Recorder --> Studio
    CLI --> Core
    MCP --> Core
    Skill --> Core
    Core --> Engines
    Studio -.inspect.-> Core
```

Source: [packages/cli/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/cli/package.json), [packages/core/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/core/package.json), [packages/harmony-mcp/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/harmony-mcp/package.json).

## Command-Line Interface (`@midscene/cli`)

The CLI is the terminal entry point for running YAML scripts, batch tests, and bridge-mode sessions. It is published as `@midscene/cli` at version `1.9.7` and exposes a single binary named `midscene` (`bin.midscene: ./bin/midscene`). Source: [packages/cli/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/cli/package.json).

The CLI aggregates every platform adapter as workspace dependencies, so a single install supports web, Android, iOS, HarmonyOS, and computer-control automation:

| Dependency | Purpose |
| --- | --- |
| `@midscene/web` | Browser automation via Playwright/Puppeteer |
| `@midscene/android` | Android device automation |
| `@midscene/ios` | iOS simulator automation |
| `@midscene/harmony` | HarmonyOS automation |
| `@midscene/computer` | Desktop OS control (mouse/keyboard via libnut) |
| `@midscene/core` / `@midscene/shared` | Engine and shared types |

Source: [packages/cli/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/cli/package.json).

Common npm scripts include `build` (rslib), `test` (vitest), and `test:ai` (`AITEST=true`) for running AI-evaluated suites. A `bridge-mode` test variant is exposed as `test:ai:temp` (`AITEST=true BRIDGE_MODE=true`), indicating the CLI can run the Web integration in bridge mode where the page under test is driven by an already-open browser. Source: [packages/cli/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/cli/package.json), [packages/web-integration/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/web-integration/README.md).

CLI argument handling is centralized in `packages/shared/src/cli/cli-args.ts`. It normalizes option spellings (`canonicalizeCliArgKeys`), detects disallowed aliases (`buildDisallowedCliSpellings`), and renders Zod validation failures with `formatCliValidationError`. This means every tool in the monorepo — including `midscene`, `midscene-ios`, and the MCP servers — shares one consistent schema-driven argument layer. Source: [packages/shared/src/cli/cli-args.ts](https://github.com/web-infra-dev/midscene/blob/main/packages/shared/src/cli/cli-args.ts).

## Model Context Protocol Server (`@midscene/mcp` and `@midscene/harmony-mcp`)

Midscene exposes its automation as MCP tools so external AI agents (Claude Desktop, MCP Inspector, custom agents) can call actions like `aiAct`, `aiAssert`, and `aiQuery`. The HarmonyOS-specific MCP server (`@midscene/harmony-mcp`) depends on `@modelcontextprotocol/sdk@1.10.2` and the `@modelcontextprotocol/inspector` for local debugging. Source: [packages/harmony-mcp/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/harmony-mcp/package.json).

The MCP layer is published under subpath exports from `@midscene/core` (`.mcp` resolves to `./dist/types/skill/index.d.ts` and `./dist/es/skill/index.mjs`), which means the same Skill module is reused for both standalone scripts and MCP-exposed tools. Source: [packages/core/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/core/package.json). Community interest in this area is high — a HarmonyOS UI automation request (issue #1594) was raised explicitly to complete the cross-OS surface, and the MCP integration itself is highlighted as a showcase in the README. Source: [README.md](https://github.com/web-infra-dev/midscene/blob/main/README.md).

## Skills, Studio, Recorder, and Visualizer

### Skills

Skills are the high-level, agent-friendly layer exported from `@midscene/core/skill`. They wrap the underlying action planner so an AI agent can issue multi-step instructions rather than per-element commands. The README frames Skills as one of two testing modes: add Midscene to Playwright/Vitest, or let an AI agent test autonomously via Skills and MCP. Source: [README.md](https://github.com/web-infra-dev/midscene/blob/main/README.md), [packages/core/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/core/package.json).

### Chrome Extension Recorder

The Recorder is implemented inside `apps/chrome-extension/src/extension/recorder`. It captures pointer, input, and navigation events, then summarizes them with an LLM call that includes screenshots of the session. The summarizer prompt explicitly requests a hierarchical Mermaid mindmap and a short title/description JSON, preserving the exact sequence from the recording. Source: [apps/chrome-extension/src/extension/recorder/utils.ts](https://github.com/web-infra-dev/midscene/blob/main/apps/chrome-extension/src/extension/recorder/utils.ts).

The recorder emits three reusable artifact formats declared in the shared recorder types:

| Artifact | Field | Use |
| --- | --- | --- |
| Markdown report | `MidsceneRecorderGeneratedCode.markdown` | Human-readable doc with embedded screenshots |
| YAML script | `MidsceneRecorderGeneratedCode.yaml` | Re-runnable script for the CLI |
| Playwright code | `MidsceneRecorderGeneratedCode.playwright` | Conventional Playwright test source |

Source: [packages/shared/src/recorder.ts](https://github.com/web-infra-dev/midscene/blob/main/packages/shared/src/recorder.ts). Community issue #2240 explicitly requests exporting AI execution steps as Playwright scripts, which aligns with this `playwright` output format. Source: [apps/chrome-extension/src/extension/recorder/utils.ts](https://github.com/web-infra-dev/midscene/blob/main/apps/chrome-extension/src/extension/recorder/utils.ts).

Pending descriptions in the recorder UI are detected by checking the literal string `AI is analyzing element...`, which is a deliberate sentinel so the UI can show a spinner while the model plans the next action. Source: [packages/shared/src/recorder.ts](https://github.com/web-infra-dev/midscene/blob/main/packages/shared/src/recorder.ts).

### Studio / Visualizer

The Visualizer (`@midscene/visualizer`) is a standalone package that renders the screenshots, plans, and assertions produced during a run for human inspection. It shares the same description used across the SDK: "An AI-powered automation SDK can control the page, perform assertions, and extract data in JSON format using natural language." Source: [packages/visualizer/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/visualizer/README.md), [packages/shared/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/shared/README.md).

Studio is the natural companion to the Recorder: a recorded session can be replayed through the CLI and then opened in the Visualizer to step through each AI decision. This addresses the inspection need behind community requests to debug AI-driven tests and verify the rendered UI state, not just DOM presence. Source: [README.md](https://github.com/web-infra-dev/midscene/blob/main/README.md).

## Common Failure Modes and Notes

- **Unrecognized CLI flags** — Calling the CLI with a deprecated spelling raises a `CLIError` listing both the canonical and the rejected name. Pass option spellings defined in the tool's `def.schema` only. Source: [packages/shared/src/cli/cli-args.ts](https://github.com/web-infra-dev/midscene/blob/main/packages/shared/src/cli/cli-args.ts).
- **Bridge-mode flakiness** — Tests launched with `BRIDGE_MODE=true` assume a browser is already attached via `@midscene/web/bridge-mode`; running them in plain Node will hang waiting for the bridge. Source: [packages/cli/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/cli/package.json), [packages/web-integration/package.json](https://github.com/web-infra-dev/midscene/blob/main/packages/web-integration/package.json).
- **Viewport-only targeting** — Issue #179 documents that off-viewport elements currently require manual scrolling. The Recorder and Skills rely on the same viewport-restricted planner, so long pages or staged overlays may need an explicit scroll instruction. Source: [apps/chrome-extension/src/extension/recorder/utils.ts](https://github.com/web-infra-dev/midscene/blob/main/apps/chrome-extension/src/extension/recorder/utils.ts).
- **AI-driven planning quality** — Issue #426 highlights that generic LLMs may misinterpret domain-specific instructions; Skills and Recorder summaries both depend on the model's ability to map natural language to UI semantics, so prompt quality and screenshot context are critical. Source: [apps/chrome-extension/src/extension/recorder/utils.ts](https://github.com/web-infra-dev/midscene/blob/main/apps/chrome-extension/src/extension/recorder/utils.ts).

## See Also

- Core engine: `@midscene/core` — [packages/core/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/core/README.md)
- Web integration & bridge mode: [packages/web-integration/README.md](https://github.com/web-infra-dev/midscene/blob/main/packages/web-integration/README.md)
- Shared recorder types: [packages/shared/src/recorder.ts](https://github.com/web-infra-dev/midscene/blob/main/packages/shared/src/recorder.ts)
- Project README: [README.md](https://github.com/web-infra-dev/midscene/blob/main/README.md)

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: web-infra-dev/midscene

Summary: Found 10 structured pitfall item(s), including 1 high/blocking item(s). Top priority: Configuration risk - Configuration risk requires verification.

## 1. Configuration risk - Configuration risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/web-infra-dev/midscene/issues/2063

## 2. Identity risk - Identity risk requires verification

- Severity: medium
- Evidence strength: runtime_trace
- Finding: Project evidence flags a identity risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Repro command: `npm install @midscene/web`
- Evidence: identity.distribution | https://github.com/web-infra-dev/midscene

## 3. Installation risk - Installation risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/web-infra-dev/midscene/issues/2689

## 4. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/web-infra-dev/midscene/issues/2691

## 5. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | https://github.com/web-infra-dev/midscene

## 6. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/web-infra-dev/midscene

## 7. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: downstream_validation.risk_items | https://github.com/web-infra-dev/midscene

## 8. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: risks.scoring_risks | https://github.com/web-infra-dev/midscene

## 9. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/web-infra-dev/midscene

## 10. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | https://github.com/web-infra-dev/midscene

<!-- canonical_name: web-infra-dev/midscene; human_manual_source: deepwiki_human_wiki -->