OSWorld Manual - Doramagic.ai

Doramagic Project Pack · Human Manual

OSWorld

[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

OSWorld Overview & System Architecture

Related topics: VM Providers, Desktop Environment & Server, Agent Implementations, Evaluators & Benchmark Tasks, Deployment, Workflows & Common Failure Modes

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Environment Providers and the Desktop Server

Continue reading this section for the full explanation and source context.

Section Agent Implementations

Continue reading this section for the full explanation and source context.

Section Evaluators

Continue reading this section for the full explanation and source context.

OSWorld Overview & System Architecture

Purpose and Scope

OSWorld is a scalable, real computer-environment benchmark for multimodal agents. It provides a controlled desktop environment (Ubuntu/Windows/macOS guests hosted on VM, VirtualBox, Docker, or AWS) in which an agent receives a natural-language task, observes the screen (and optionally the accessibility tree), executes actions, and is then scored by deterministic evaluators against an expected end state.

The project defines four pillars, all of which are implemented as runnable Python in this repository:

Environment providers — pluggable back-ends that boot and reset the guest OS (VM, VirtualBox, Docker, AWS).
Agents — multiple reference implementations (vlaa_gui, os_symphony, coact, maestro) that translate observations into actions.
Evaluators — per-task scoring functions that grade the final state of the guest.
Monitor — a Flask-based web dashboard that visualises running tasks and trajectories.

The README states that credentials default to user / password for local providers and osworld-public-evaluation for AWS, and that some evaluators require sudo, so the client must know the password. Source: README.md:1-200.

The current public release is v0.1.16, which adds VirtualBox snapshots, AWS support with optimized interface mappings, fixes for annotation bugs, and broader model support (Gemini 1.5-Pro, Llama-3, Qwen). Source: release notes for v0.1.16.

High-Level Architecture

OSWorld separates environment control (the guest desktop) from agent reasoning (a Python process that talks to the guest via RPC and screenshots) and from evaluation (a deterministic post-run grader). All three are coordinated by a top-level runner that streams results into the Monitor.

flowchart LR
    A[Runner / quickstart.py] --> B[Agent Process]
    B -- screenshot / a11y tree --> A
    A -- action code --> C[Guest VM / Container]
    C --> D[desktop_env Server]
    D -- clipboard / file ops --> B
    A --> E[Evaluator]
    E --> F[Result JSON]
    F --> G[Monitor Dashboard]

The desktop side is mediated by a Python server embedded in the guest that exposes clipboard, file, and accessibility-tree RPCs to the host agent. Source: desktop_env/server/main.py:1-60 defines platform-specific accessibility namespaces for ubuntu, windows, and macos, with MAX_DEPTH and MAX_WIDTH bounds for tree traversal — this is the protocol agents consume when they request an a11y tree observation.

Core Components

Environment Providers and the Desktop Server

The guest runs a server (desktop_env/server/main.py) that brokers interactions between the host agent and the operating system. It handles:

Accessibility tree construction with per-platform namespaces (e.g., st, attr, cp, doc, docattr, txt, val, act, class for Ubuntu). Source: desktop_env/server/main.py:1-40.
Per-application setup: evaluators require libraries such as python-pptx, python-docx, odfpy, openpyxl, pandas, lxml, and xmltodict to be pre-installed inside the guest. Source: desktop_env/evaluators/README.md:1-80.
LibreOffice headless conversion — evaluators rely on libreoffice --convert-to "csv:Text - txt - csv (StarCalc):44,34,UTF8,..." to materialise intermediate files. Source: desktop_env/evaluators/README.md:30-50.

Community reports highlight two recurring operational issues: (1) the published Ubuntu.qcow2 snapshot can boot with a Snap Store "software updates available" popup that derails screenshot-based agents (#515), and (2) Chrome DevTools port forwarding through a host proxy can return 400 even with a verified clean image (#495).

Agent Implementations

OSWorld ships several reference agents that differ in how they plan and act. All of them share the same observation/action contract with desktop_env, but differ in prompting and tooling.

vlaa_gui Worker — a generator/grounding split. The Worker class wires a generator agent to a grounding ACI agent, optionally invokes a GateAgent to propose completion criteria on step 0, and injects a recon_context block on the first turn so the agent is aware of read-only pre-task inspection. Source: mm_agents/vlaa_gui/agents/worker.py:1-120.
os_symphony Worker — an orchestrator that delegates intent to a SearcherAgent (which itself wraps pyautogui calls such as click and type) and feeds search results back into the generator as a one-shot prompt injection (then resets last_search_agent_result to avoid re-injection). Source: mm_agents/os_symphony/agents/worker.py:1-80 and mm_agents/os_symphony/agents/searcher_agent.py:1-80.
os_symphony ACI — the Search Agent wrapper that issues "How to …" queries scoped to a single application, then injects returned tutorials into the generator system prompt. Source: mm_agents/os_symphony/agents/os_aci.py:1-40.
coact — an AutoGen-derived multi-agent stack; it pulls in Teachability (ChromaDB-backed long-term memory), VisionCapability, ToolsCapability, TransformMessages (history/token limiters), LLMLingua compression, and ImageGenerator. Source: mm_agents/coact/autogen/agentchat/contrib/capabilities/teachability.py:1-60 and mm_agents/coact/autogen/agentchat/contrib/capabilities/vision_capability.py:1-60.
maestro — shares utility helpers (safe_write_json, locked file locks, generate_uuid) used by other agents. Source: mm_agents/maestro/utils/README.md:1-40.

Evaluators

Evaluators are deterministic functions that compare the final guest state to a per-task spec. The setup guide explicitly warns that the LibreOffice "no popup on Ctrl+S" flag must be enabled inside the guest, and that system crash reports must be disabled via apport so evaluators are not derailed by background dialogs. Source: desktop_env/evaluators/README.md:1-20.

Community issue #518 flags a real correctness gap: several feasible-task evaluators return reward=1 on loose substring matches without verifying that the agent *caused* the change, leading to false positives. Issue #430 shows the inverse problem — a feasible task annotated against a non-existent target page.

Monitor Dashboard

The Monitor is a Flask app that reads the runner's results directory and renders per-task pages with screenshots, step counts, and final scores. Configuration is via a .env file with the following key variables:

Variable	Purpose	Default
`TASK_CONFIG_PATH`	Path to the task configuration JSON	`../evaluation_examples/test.json`
`EXAMPLES_BASE_PATH`	Base directory for per-example assets	`../evaluation_examples/examples`
`RESULTS_BASE_PATH`	Directory of run result JSONs	`../results`
`ACTION_SPACE`	Action vocabulary (`pyautogui`, `keyboard`)	`pyautogui`
`OBSERVATION_TYPE`	Observation mode (`screenshot`, `video`)	`screenshot`
`MODEL_NAME`	Identifier of the model under test	`computer-use-preview`
`MAX_STEPS`	Maximum step count to display	`150`
`FLASK_PORT` / `FLASK_HOST`	Server bind	`80` / `0.0.0.0`

Source: monitor/README.md:1-60. The README explicitly warns: *"Make sure you run the monitor after the main runner has started executing tasks. Otherwise, it may cause issues when executing tasks."* Source: monitor/README.md:1-20.

Common Usage Patterns and Known Failure Modes

Quickstart. The canonical entry point is python quickstart.py --provider_name <vmware|virtualbox|docker|aws> .... The README documents the credentials and proxy flow required when running behind the GFW or against cloud VMs. Source: README.md:1-200.

Bottleneck community topics. Recurring user friction includes: benchmarking Qwen3.5-VL with very low success rates (#441), display-config breakage on Aliyun (#480), VMware Fusion setup on Apple Silicon (#407), and a proposal to add trace-level failure attribution to surface hidden failure modes (#514). There is also evidence that a pixel-blind CLI agent can outperform a vision agent on test_all (77.9% vs 64.3%), motivating alternative observation modes (#517).

Operational checklist. Before each run, the project recommends: (1) installing the per-evaluator libraries inside the guest (desktop_env/evaluators/README.md:1-80), (2) disabling apport so crash dialogs cannot surface in the agent's screenshot (desktop_env/evaluators/README.md:1-20), and (3) ensuring the Monitor's .env points at the runner's actual results directory before starting Flask (monitor/README.md:1-60).

VM Providers, Desktop Environment & Server

Related topics: OSWorld Overview & System Architecture, Agent Implementations, Evaluators & Benchmark Tasks, Deployment, Workflows & Common Failure Modes

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Common Provider Pitfalls (Community)

Continue reading this section for the full explanation and source context.

Section Known Server-Side Failure Modes (Community)

Continue reading this section for the full explanation and source context.

VM Providers, Desktop Environment & Server

1. Overview

OSWorld is a benchmark for evaluating multimodal agents on real desktop operating systems. The execution substrate is built around three coupled layers:

VM Provider layer — abstractions for VMware, VirtualBox, Docker, and AWS that boot and reset a guest OS.
Desktop Environment (DesktopEnv) — the Python orchestrator that wires provider, controller, and registry into a Gym-like API consumed by agents.
In-VM Server — a long-lived service running inside the guest that exposes the accessibility tree, screenshots, and application control ports back to the host.

Together, these layers let a host-side agent observe pixels/accessibility nodes, execute keystrokes/clicks, and trigger evaluations deterministically. Source: README.md:1-120.

2. VM Providers

OSWorld supports four providers, each selected via --provider_name on run.py / run_multienv.py:

Provider	Use Case	Notes
`vmware`	Local x86 / Apple silicon (Fusion)	Free for personal use; recommended baseline. Source: desktop_env/providers/vmware/INSTALL_VMWARE.md:1-40.
`virtualbox`	Free snapshots	Added in v0.1.16 for users without VMware licensing. Source: README.md:90-110.
`docker`	Cloud / CI / parallel runs	Uses `happysixd/osworld-docker`; supports `--vm_secret_mount` for host-side credentials. Source: README.md:60-90.
`aws`	Cloud benchmarks	Optimized instance types and interface mappings ship with v0.1.16. Source: README.md:90-110.

VMware 17.5.1 is the reference version. On Linux hosts, install via sudo sh VMware-Workstation-xxxx-xxxxxxx.<arch>.bundle --console; on Apple-silicon Macs use VMware Fusion with an ARM Ubuntu image. Source: desktop_env/providers/vmware/INSTALL_VMWARE.md:5-40.

All providers expose the same lifecycle methods (start, reset, stop, get_ip, screenshot), so an agent implementation is provider-agnostic. The default VM credentials are user / password for local providers and osworld-public-evaluation for AWS. Source: README.md:110-140.

Common Provider Pitfalls (Community)

VMware Fusion on M-series Macs can fail to start the Ubuntu ARM image — see Issue #407. Verify the .vmx path and that Fusion has helper-tool permissions. Source: Issue #407.
Docker + Clash proxy can yield a 400 from the in-VM Chrome DevTools port even when the proxy is reachable from the host — see Issue #495. Setting http_proxy/https_proxy in the container env and matching the host's 172.17.0.1:7897 Clash listener is the standard fix. Source: Issue #495.
Aliyun (and other minimal Linux display configs) using the dummy X screen snippet from the README can corrupt the graphical session if applied outside a fresh image — see Issue #480. Source: Issue #480.

3. Desktop Environment Orchestration

The DesktopEnv Python class (in desktop_env/) is the agent-facing entry point. It composes:

A provider (above) that owns the VM lifecycle.
A controller that issues pyautogui / shell / file actions.
An evaluator registry that grades terminal states.

A typical invocation looks like:

python scripts/python/run_multienv.py \
    --provider_name docker \
    --headless \
    --observation_type screenshot \
    --model gpt-4o \
    --max_steps 15 \
    --num_envs 10 \
    --client_password password

Source: README.md:50-90. The same flags work for vmware, virtualbox, and aws. Outputs (screenshots, action logs, video) land in --result_dir.

Multiple agent implementations consume this environment, including mm_agents/aworldguiagent (CUA-style), mm_agents/vlaa_gui (multi-agent Worker / Gate / Verifier pipeline), and mm_agents/os_symphony (orchestrator + search-agent tutorial retrieval). Source: mm_agents/aworldguiagent/README.md:1-20, mm_agents/vlaa_gui/README.md:1-40, mm_agents/os_symphony/agents/worker.py:1-60.

The vlaa_gui/agents/worker.py worker injects a "recon_context" pre-task inspection string on turn 0, and uses a GateAgent to propose success criteria before the first action — this requires the server's accessibility tree to be available at boot. Source: mm_agents/vlaa_gui/agents/worker.py:1-90.

4. In-VM Server and Evaluator Setup

The server is a Python service that starts at boot inside the guest and exposes the channels the host needs: screenshot streaming, xdotool-based input, file upload/download, and an AT-SPI bridge. The accessibility namespace map is hard-coded per platform (ubuntu, windows, macos) in desktop_env/server/main.py, supporting the agent's SoM/AXTree observations. Source: desktop_env/server/main.py:1-80.

The desktop_env/server/README.md enumerates eight configuration categories that the gold image must satisfy:

Account credentials (user / password).
Auto-start of the OSWorld service.
Accessibility tree packages installed.
Disabling of interfering services (auto-update, notifications).
Required software installation (LibreOffice, GIMP, Chrome, VS Code, etc.).
Per-app configuration (e.g., LibreOffice "do not show popup on Ctrl+S").
Port configuration for monitoring and control.
Display / resolution / desktop environment settings.

Source: desktop_env/server/README.md:1-60.

Evaluator-specific setup is documented in desktop_env/evaluators/README.md. Highlights:

Disable the system crash reporter: set enabled=0 in /etc/default/apport. Source: desktop_env/evaluators/README.md:1-15.
LibreOffice requires pre-installed Python libraries: python-pptx, python-docx, odfpy, and (for Calc) openpyxl, pandas, lxml, xmltodict. Source: desktop_env/evaluators/README.md:30-60.
The LibreOffice Calc evaluator converts XLSX to CSV via libreoffice --convert-to "csv:Text - txt - csv (StarCalc):44,34,UTF8,..." with a trailing sheet index. Source: desktop_env/evaluators/README.md:60-80.

Known Server-Side Failure Modes (Community)

The published Ubuntu.qcow2 snapshot can boot with a Snap Store "software updates available" popup that derails screenshot-only agents on the first observation — see Issue #515. Pinning the guest to a specific Snap refresh schedule or scripting the popup's dismissal at first boot is the usual mitigation. Source: Issue #515.
Several feasible-task evaluators were reported to return reward=1 on loose substring matches without verifying a state delta — see Issue #518. When debugging grader behavior, check the evaluate() implementation for the specific example rather than trusting the score alone. Source: Issue #518.
A community proposal (#514) suggests adding per-step trace diagnostics so failure attribution can distinguish grounding errors from planning errors — useful when the server's screenshot/AXTree look correct but the agent still fails. Source: Issue #514.

5. Operational Tips

Always pass --client_password matching the image you are using (password for local, osworld-public-evaluation for AWS). Source: README.md:110-140.
For Docker runs that need host-side secrets, prefer --vm_secret_mount over baking credentials into the image. Source: README.md:70-90.
Use --headless for CI / cloud; remove it for local debugging of GUI issues.
After installing a new model, verify it works on a small subset of test_all.json before scaling to --num_envs parallel runs.

Agent Implementations, Evaluators & Benchmark Tasks

Related topics: OSWorld Overview & System Architecture, VM Providers, Desktop Environment & Server, Deployment, Workflows & Common Failure Modes

Section Related Pages

Continue reading this section for the full explanation and source context.

Section 2.1 Action Space and Observation

Continue reading this section for the full explanation and source context.

Section 2.2 VLAA-GUI Worker Flow

Continue reading this section for the full explanation and source context.

Section 2.3 OS-Symphony Flow

Continue reading this section for the full explanation and source context.

Agent Implementations, Evaluators & Benchmark Tasks

1. Overview and Purpose

OSWorld is a benchmark for evaluating multimodal agents on real desktop tasks. The framework is split into three cooperating layers: a set of agent implementations under mm_agents/, a task and evaluator corpus under evaluation_examples/, and a desktop environment under desktop_env/. Together they let researchers plug a new agent into the same VM, run the same set of tasks, and receive a comparable reward score.

The high-level loop is: an agent receives a natural-language instruction and an observation (screenshot and/or accessibility tree), emits a code-style action (typically agent.click(...), agent.type(...), agent.execute(...)), the environment executes it on the VM, and the evaluator inspects the resulting state to assign a score. Source: README.md and mm_agents/README.md.

Community context: the leaderboard discussion in issue #489 (Hugging Face native benchmark integration) and the methodology critique in #517 (a pixel-blind CLI agent scoring 77.9% vs 64.3% for a vision agent on test_all) both depend on this same agent–evaluator–task pipeline.

2. Agent Implementations

OSWorld ships multiple agent implementations in mm_agents/, each using a different reasoning strategy. All expose a common predict(instruction, obs) interface (or the equivalent for multi-agent variants).

Agent	Location	Key idea
`PromptAgent`	`mm_agents/agent.py`	Baseline prompt-based agent; supports `screenshot`, `a11y_tree`, and `screenshot+a11y_tree` observations. Source: mm_agents/README.md
`VLAA-GUI` Worker	`mm_agents/vlaa_gui/agents/worker.py`	Multi-agent design: Worker, Grounding (ACI), Gate, Verifier, Code, and Searcher. Source: mm_agents/vlaa_gui/README.md
`OS-Symphony` Worker	`mm_agents/os_symphony/agents/worker.py`	Orchestrator + coder-agent; injects search results and recon context. Source: mm_agents/os_symphony/agents/worker.py
`CoAct`	`mm_agents/coact/...`	Builds on AG2 `ConversableAgent` with composable `AgentCapability` and `ToolsCapability`. Source: tools_capability.py
`UiPath`	`mm_agents/uipath/`	Two-stage Action Planner + Grounder (UI-TARS-1.5) using a crop-and-refine coordinate prediction. Source: mm_agents/uipath/README.md
`aworldGUIAgent-v1`	`mm_agents/aworldguiagent/`	Built on AWorld framework, extends Agent-S perception with new executable tools. Source: mm_agents/aworldguiagent/README.md

2.1 Action Space and Observation

The action space is enforced as a single agent.<method>(...) call per turn. The formatter in mm_agents/os_symphony/utils/formatters.py validates that the response contains exactly one action and that the method name appears in the loaded tool config (SINGLE_ACTION_FORMATTER, CODE_VALID_FORMATTER). Source: mm_agents/os_symphony/utils/formatters.py.

In mm_agents/vlaa_gui/utils/formatters.py the same single-action guarantee is enforced, with a small bypass list (_CODE_VALIDATION_BYPASS_ACTIONS) for side-effecting methods such as call_search_agent and call_code_agent, and for click which is validated later by the grounding step. Source: mm_agents/vlaa_gui/utils/formatters.py.

2.2 VLAA-GUI Worker Flow

The Worker in mm_agents/vlaa_gui/agents/worker.py injects task instructions, optional search tutorials, and a recon context on turn 0; it then asks the GateAgent to propose 1–3 UI-observable success criteria that are checked every step. Source: mm_agents/vlaa_gui/agents/worker.py. The ACI grounding module coordinates clicks via OCR/Tesseract fallback, and grounding.py wraps the CodeAgent with constraint-aware failure tracking. Source: mm_agents/vlaa_gui/agents/grounding.py.

2.3 OS-Symphony Flow

mm_agents/os_symphony/agents/worker.py shows the orchestrator appending search results, gating their use to the current turn (resetting last_search_agent_result after injection), and then asking the orchestrator LLM for a plan via call_llm_formatted with the two format checkers listed above. Source: mm_agents/os_symphony/agents/worker.py. The dedicated CoderAgent summarizes the multi-step code execution with a 150-word cap. Source: mm_agents/os_symphony/agents/coder_agent.py.

3. Benchmark Tasks and Evaluators

3.1 Task Schema

Each task lives at evaluation_examples/examples/<domain>/<id>.json and is documented in evaluation_examples/README.md. Required fields are:

id — unique task identifier
snapshot — VM snapshot id that fixes the initial state
instruction — natural-language task description
source — provenance URL
config — setup scripts (file downloads, app launches) executed before the agent starts
related_apps — applications the task will touch
evaluator — directory containing the per-task evaluate.py

The ./trajectories directory holds annotated gold trajectories and recordings.

3.2 Evaluator Design

desktop_env/evaluators/README.md documents per-app evaluator setup, including disabling apport crash reports, suppressing LibreOffice save popups, and installing domain libraries (python-pptx, python-docx, odfpy, openpyxl, pandas, lxml, xmltodict). For LibreOffice Calc the recommended XLSX→CSV conversion uses the Text - txt - csv (StarCalc):44,34,UTF8,... filter. Source: desktop_env/evaluators/README.md.

A persistent community concern is loose grading: issue #518 reports that several feasible-task evaluators return reward=1 via substring matching on the final state, with no delta/causation check, which can inflate success rates when an agent happens to leave the right text in place without performing the requested action.

3.3 Accessibility and Observation Plumbing

The desktop-side server in desktop_env/server/main.py exposes per-OS accessibility namespace maps (_accessibility_ns_map for ubuntu, windows, macos) that are used to build a11y-tree observations for a11y_tree-mode agents. Source: desktop_env/server/main.py.

4. Common Failure Modes and Community Issues

Task design flaws. Issue #430 points to a Chrome task (f3b19d1e-…) whose target page does not exist, yet the task is marked feasible.
First-observation drift. Issue #515 shows that a Snap Store "software updates available" popup on the Ubuntu.qcow2 reset can derail screenshot-only agents on their very first observation.
Container / Chrome DevTools 400. Issue #495 reports a 400 on the Chrome DevTools port from the published Docker image even with a verified host proxy.
Vendor-specific X server configs. Issue #480 shows that the DummyScreen/DummyDevice X config used for Aliyun can break the GUI when adapted carelessly.
Trace diagnostics proposal. Issue #514 suggests adding per-step failure attribution beyond the single reward signal.
VMware Fusion on Apple Silicon. Issue #407 documents an ami/provider mismatch on M-series Macs.
Qwen3.5-VL tuning. Issue #441 notes very low success rates when running Qwen3.5-VL through the qwen3vl pipeline and asks for an official config.

For manual debugging the helper script scripts/bash/run_manual_examine.sh runs a single task end-to-end with screenshots and a video, with domain-specific example IDs. Source: scripts/README.md.

Deployment, Workflows & Common Failure Modes

Related topics: OSWorld Overview & System Architecture, VM Providers, Desktop Environment & Server, Agent Implementations, Evaluators & Benchmark Tasks

Section Related Pages

Continue reading this section for the full explanation and source context.

Section 4.1 VM Reset Pollution

Continue reading this section for the full explanation and source context.

Section 4.2 Loose Evaluator Grading

Continue reading this section for the full explanation and source context.

Section 4.3 Final-Step Hang

Continue reading this section for the full explanation and source context.

Deployment, Workflows & Common Failure Modes

1. Overview

OSWorld is a benchmark for evaluating multimodal agents on real desktop operating systems. The repository ships a Docker/VM/cloud runtime (desktop_env), multiple pluggable agent implementations (mm_agents/*), example tasks (evaluation_examples/), and runner scripts (scripts/python/*). This page documents how the pieces are deployed, the end-to-end workflow that produces a benchmark number, and the failure modes that users repeatedly hit in the wild.

The main entry point for first-time users is quickstart.py plus the scripts/python/run_multienv*.py runners, while power users orchestrate their own agent through the DesktopEnv interface described in desktop_env/README.md and the agent interface in mm_agents/README.md. Source: README.md.

2. Deployment Topologies

OSWorld supports four runtime providers, each with its own failure surface:

Provider	Backend	Typical failure mode	Reference
`docker`	`happysixd/osworld-docker` image	Chrome DevTools port returning 400 when host proxy is mis-routed (#495)	`desktop_env/providers/docker/`
`vmware`	`Ubuntu.vmx` on host	VMware Fusion on Apple Silicon fails to start the VM (#407)	`desktop_env/providers/vmware/`
`virtualbox`	Local snapshot	Free for snapshots since v0.1.16	v0.1.16 release notes
`aws`	EC2 instance, AMI `ami-0b505e9d0d99ba88c`	Region-specific or X11 dummy-screen breakage (#480)	`mm_agents/vlaa_gui/README.md`

Default credentials for local providers are user / password; for cloud providers OSWorld defaults to osworld-public-evaluation to avoid trivially guessable logins. Source: README.md.

The desktop server is started by desktop_env/server/main.py, which exposes an x11vnc-backed stream and an AT-SPI accessibility-tree endpoint. The server's platform-specific namespaces (_accessibility_ns_map_ubuntu, _accessibility_ns_map_windows, _accessibility_ns_map_macos) tell the agent how to interpret node attributes for each guest OS. Source: desktop_env/server/main.py.

3. End-to-End Workflow

flowchart LR
    A[Task JSON in evaluation_examples/] --> B[runner script: run_multienv_*.py]
    B --> C[Agent e.g. os_symphony / vlaa_gui]
    C --> D[Screenshot + a11y tree]
    D --> E[Worker next_action]
    E -->|click/type/code/search| F[DesktopEnv step]
    F --> G[reset to snapshot]
    G --> H[Evaluators in desktop_env/evaluators/]
    H --> I[reward 0/1]

The runner instantiates one DesktopEnv per worker process. After resetting the VM to the task's golden snapshot, it calls env.reset(task_config), which the worker turns into the first observation. Source: scripts/README.md.

Inside the agent, the Worker (mm_agents/os_symphony/agents/worker.py and mm_agents/vlaa_gui/agents/worker.py) builds the per-step prompt by stitching together:

The system prompt with TASK_DESCRIPTION interpolated.
Optional tutorials produced by the SearcherAgent (mm_agents/os_symphony/agents/searcher_agent.py), appended under ### Tutorials Found by Search Agent.
The previous search/code-agent result (with DONE / FAIL reason).
A ⚠️ FINAL STEP warning when obs["is_last_step"] is true, forcing a terminal agent.done() or agent.fail() decision. Source: mm_agents/os_symphony/agents/worker.py.

When the worker hands control to the CoderAgent, the agent emits multi-step pyautogui scripts. After execution, the coder agent generates a Summary: ... block via PROCEDURAL_MEMORY.CODE_SUMMARY_AGENT_PROMPT, which is fed back into the next worker turn. Source: mm_agents/os_symphony/agents/coder_agent.py.

The VLAA GUI variant adds two extra components on top of the basic loop: a GateAgent that proposes 1–3 UI-observable success criteria at step 1, and a VerifierAgent that independently confirms completion before the worker is allowed to call done(). Source: mm_agents/vlaa_gui/README.md.

Once the agent signals completion, evaluate() runs the per-task evaluators bundled under desktop_env/evaluators/. The evaluator setup README details which libraries each application needs (e.g. python-pptx for LibreOffice Impress, openpyxl/pandas/lxml/xmltodict for Calc). Source: desktop_env/evaluators/README.md.

4. Common Failure Modes

4.1 VM Reset Pollution

After a snapshot reset, the Ubuntu guest frequently boots into a Snap Store "software updates available" popup, which is then captured in the agent's first screenshot and derails screenshot-only models. Source: Issue #515.

4.2 Loose Evaluator Grading

Several feasible-task evaluators return reward=1 purely on a substring match against the final file state, with no causation or delta check. A regression in #430 (Ticket-Delivery-FAQs page does not exist yet the task is marked feasible) shows that even feasibility filtering is not bulletproof. Source: Issue #518 and Issue #430.

4.3 Final-Step Hang

Without an explicit FINAL STEP prompt, workers keep emitting non-terminal actions until the step budget is exhausted. The OS-Symphony worker now injects ⚠️ FINAL STEP exactly when obs.get("is_last_step") is true. Source: mm_agents/os_symphony/agents/worker.py.

4.4 Network & Proxy Failures

Chrome-based tasks fail when DevTools returns 400 because the host proxy is not propagated into the container (172.17.0.1:7897 style Clash configs are a common culprit). The proxy setup instructions in SETUP_GUIDELINE.md and the v0.1.16 release notes both flag this. Source: Issue #495.

4.5 Model-Specific Pipeline Misconfiguration

Qwen3.5-VL users report near-zero success rates because they run the model through the wrong pipeline variant (#441). The qwen3vl runner script assumes a specific prompt format; using it with the chat-completions endpoint silently downgrades accuracy. Source: scripts/python/run_multienv_qwen3vl.py.

5. Mitigations and Recommended Practices

Pin the provider image: Use a digest-pinned Docker image (happysixd/osworld-docker@sha256:...) rather than :latest to avoid silent VM-state drift. Source: Issue #495.
Disable apport and version pop-ups inside the guest: desktop_env/evaluators/README.md recommends setting enabled=0 in /etc/default/apport and turning off LibreOffice "no pop-up when ctrl+s" warnings to keep the observation stable. Source: desktop_env/evaluators/README.md.
Always run scripts from the repo root: The scripts/README.md documents that sys.path is mutated at import time, so running python scripts/python/run.py from the wrong directory breaks module resolution. Source: scripts/README.md.
Treat trace-level diagnostics as first-class: When reward is 0, inspect the per-step screenshots and the worker history; a CLI-only baseline (#517) demonstrates that pixel-blind policies can outperform vision agents once fine-grained traces are added. Source: Issue #517 and Issue #514.
Verify evaluator libraries: python-pptx, python-docx, odfpy, openpyxl, and pandas must be installed inside the guest for the corresponding evaluator to run; their absence silently degrades evaluation to the substring-match fallback. Source: desktop_env/evaluators/README.md.

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

high Installation risk requires verification

May increase setup, validation, or first-run risk for the user.

high Maintenance risk requires verification

May increase setup, validation, or first-run risk for the user.

high Security or permission risk requires verification

May increase setup, validation, or first-run risk for the user.

high Security or permission risk requires verification

May increase setup, validation, or first-run risk for the user.

Doramagic Pitfall Log

Found 11 structured pitfall item(s), including 5 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

1. Installation risk: Installation risk requires verification

Severity: high
Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/xlang-ai/OSWorld/issues/515

2. Maintenance risk: Maintenance risk requires verification

Severity: high
Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/xlang-ai/OSWorld/issues/514

3. Security or permission risk: Security or permission risk requires verification

Severity: high
Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/xlang-ai/OSWorld/issues/495

4. Security or permission risk: Security or permission risk requires verification

Severity: high
Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/xlang-ai/OSWorld/issues/518

5. Security or permission risk: Security or permission risk requires verification

Severity: high
Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/xlang-ai/OSWorld/issues/517

6. Capability evidence risk: Capability evidence risk requires verification

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: capability.assumptions | github_repo:705433049 | https://github.com/xlang-ai/OSWorld

7. Maintenance risk: Maintenance risk requires verification

Severity: medium
Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | github_repo:705433049 | https://github.com/xlang-ai/OSWorld

8. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: downstream_validation.risk_items | github_repo:705433049 | https://github.com/xlang-ai/OSWorld

9. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: risks.scoring_risks | github_repo:705433049 | https://github.com/xlang-ai/OSWorld

10. Maintenance risk: Maintenance risk requires verification

Severity: low
Finding: issue_or_pr_quality=unknown。
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | github_repo:705433049 | https://github.com/xlang-ai/OSWorld

11. Maintenance risk: Maintenance risk requires verification

Severity: low
Finding: release_recency=unknown。
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | github_repo:705433049 | https://github.com/xlang-ai/OSWorld

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 8

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using OSWorld with real data or production workflows.

Pixel-blind CLI agent scores 77.9% on OSWorld test_all (vs 64.3% vision) - github / github_issue
Container starts but Chrome DevTools port returns 400, even with clean h - github / github_issue
Proposal: trace diagnostics for computer-use agent failures - github / github_issue
Guest VM shows a Snap Store "software updates available" popup on reset, - github / github_issue
Feasible-task evaluators return reward=1 without verifying the task was - github / github_issue
v0.1.16 - github / github_release
v0.1.0 - github / github_release
Capability evidence risk requires verification - GitHub / issue

Source: Project Pack community evidence and pitfall evidence

OSWorld

OSWorld Overview & System Architecture

Related Pages

OSWorld Overview & System Architecture

Purpose and Scope

High-Level Architecture

Core Components

Environment Providers and the Desktop Server

Agent Implementations

Evaluators

Monitor Dashboard

Common Usage Patterns and Known Failure Modes

See Also

VM Providers, Desktop Environment & Server

Related Pages

VM Providers, Desktop Environment & Server

1. Overview

2. VM Providers

Common Provider Pitfalls (Community)

3. Desktop Environment Orchestration

4. In-VM Server and Evaluator Setup

Known Server-Side Failure Modes (Community)

5. Operational Tips

See Also

Agent Implementations, Evaluators & Benchmark Tasks

Related Pages

Agent Implementations, Evaluators & Benchmark Tasks

1. Overview and Purpose

2. Agent Implementations

2.1 Action Space and Observation

2.2 VLAA-GUI Worker Flow

2.3 OS-Symphony Flow

3. Benchmark Tasks and Evaluators

3.1 Task Schema

3.2 Evaluator Design

3.3 Accessibility and Observation Plumbing

4. Common Failure Modes and Community Issues

See Also

Deployment, Workflows & Common Failure Modes

Related Pages

Deployment, Workflows & Common Failure Modes

1. Overview

2. Deployment Topologies

3. End-to-End Workflow

4. Common Failure Modes

4.1 VM Reset Pollution

4.2 Loose Evaluator Grading

4.3 Final-Step Hang

4.4 Network & Proxy Failures

4.5 Model-Specific Pipeline Misconfiguration

5. Mitigations and Recommended Practices

See Also

Doramagic Pitfall Log

Doramagic Pitfall Log

1. Installation risk: Installation risk requires verification

2. Maintenance risk: Maintenance risk requires verification

3. Security or permission risk: Security or permission risk requires verification

4. Security or permission risk: Security or permission risk requires verification

5. Security or permission risk: Security or permission risk requires verification

6. Capability evidence risk: Capability evidence risk requires verification

7. Maintenance risk: Maintenance risk requires verification

8. Security or permission risk: Security or permission risk requires verification

9. Security or permission risk: Security or permission risk requires verification

10. Maintenance risk: Maintenance risk requires verification

11. Maintenance risk: Maintenance risk requires verification

Community Discussion Evidence

Community Discussion Evidence