Doramagic Project Pack · Human Manual
OSWorld
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
OSWorld Overview & System Architecture
Related topics: VM Providers, Desktop Environment & Server, Agent Implementations, Evaluators & Benchmark Tasks, Deployment, Workflows & Common Failure Modes
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: VM Providers, Desktop Environment & Server, Agent Implementations, Evaluators & Benchmark Tasks, Deployment, Workflows & Common Failure Modes
OSWorld Overview & System Architecture
Purpose and Scope
OSWorld is a scalable, real computer-environment benchmark for multimodal agents. It provides a controlled desktop environment (Ubuntu/Windows/macOS guests hosted on VM, VirtualBox, Docker, or AWS) in which an agent receives a natural-language task, observes the screen (and optionally the accessibility tree), executes actions, and is then scored by deterministic evaluators against an expected end state.
The project defines four pillars, all of which are implemented as runnable Python in this repository:
- Environment providers — pluggable back-ends that boot and reset the guest OS (VM, VirtualBox, Docker, AWS).
- Agents — multiple reference implementations (
vlaa_gui,os_symphony,coact,maestro) that translate observations into actions. - Evaluators — per-task scoring functions that grade the final state of the guest.
- Monitor — a Flask-based web dashboard that visualises running tasks and trajectories.
The README states that credentials default to user / password for local providers and osworld-public-evaluation for AWS, and that some evaluators require sudo, so the client must know the password. Source: README.md:1-200.
The current public release is v0.1.16, which adds VirtualBox snapshots, AWS support with optimized interface mappings, fixes for annotation bugs, and broader model support (Gemini 1.5-Pro, Llama-3, Qwen). Source: release notes for v0.1.16.
High-Level Architecture
OSWorld separates environment control (the guest desktop) from agent reasoning (a Python process that talks to the guest via RPC and screenshots) and from evaluation (a deterministic post-run grader). All three are coordinated by a top-level runner that streams results into the Monitor.
flowchart LR
A[Runner / quickstart.py] --> B[Agent Process]
B -- screenshot / a11y tree --> A
A -- action code --> C[Guest VM / Container]
C --> D[desktop_env Server]
D -- clipboard / file ops --> B
A --> E[Evaluator]
E --> F[Result JSON]
F --> G[Monitor Dashboard]The desktop side is mediated by a Python server embedded in the guest that exposes clipboard, file, and accessibility-tree RPCs to the host agent. Source: desktop_env/server/main.py:1-60 defines platform-specific accessibility namespaces for ubuntu, windows, and macos, with MAX_DEPTH and MAX_WIDTH bounds for tree traversal — this is the protocol agents consume when they request an a11y tree observation.
Core Components
Environment Providers and the Desktop Server
The guest runs a server (desktop_env/server/main.py) that brokers interactions between the host agent and the operating system. It handles:
- Accessibility tree construction with per-platform namespaces (e.g.,
st,attr,cp,doc,docattr,txt,val,act,classfor Ubuntu). Source: desktop_env/server/main.py:1-40. - Per-application setup: evaluators require libraries such as
python-pptx,python-docx,odfpy,openpyxl,pandas,lxml, andxmltodictto be pre-installed inside the guest. Source: desktop_env/evaluators/README.md:1-80. - LibreOffice headless conversion — evaluators rely on
libreoffice --convert-to "csv:Text - txt - csv (StarCalc):44,34,UTF8,..."to materialise intermediate files. Source: desktop_env/evaluators/README.md:30-50.
Community reports highlight two recurring operational issues: (1) the published Ubuntu.qcow2 snapshot can boot with a Snap Store "software updates available" popup that derails screenshot-based agents (#515), and (2) Chrome DevTools port forwarding through a host proxy can return 400 even with a verified clean image (#495).
Agent Implementations
OSWorld ships several reference agents that differ in how they plan and act. All of them share the same observation/action contract with desktop_env, but differ in prompting and tooling.
vlaa_guiWorker — a generator/grounding split. TheWorkerclass wires a generator agent to a groundingACIagent, optionally invokes aGateAgentto propose completion criteria on step 0, and injects arecon_contextblock on the first turn so the agent is aware of read-only pre-task inspection. Source: mm_agents/vlaa_gui/agents/worker.py:1-120.os_symphonyWorker — an orchestrator that delegates intent to aSearcherAgent(which itself wrapspyautoguicalls such asclickandtype) and feeds search results back into the generator as a one-shot prompt injection (then resetslast_search_agent_resultto avoid re-injection). Source: mm_agents/os_symphony/agents/worker.py:1-80 and mm_agents/os_symphony/agents/searcher_agent.py:1-80.os_symphonyACI — the Search Agent wrapper that issues "How to …" queries scoped to a single application, then injects returned tutorials into the generator system prompt. Source: mm_agents/os_symphony/agents/os_aci.py:1-40.coact— an AutoGen-derived multi-agent stack; it pulls inTeachability(ChromaDB-backed long-term memory),VisionCapability,ToolsCapability,TransformMessages(history/token limiters),LLMLinguacompression, andImageGenerator. Source: mm_agents/coact/autogen/agentchat/contrib/capabilities/teachability.py:1-60 and mm_agents/coact/autogen/agentchat/contrib/capabilities/vision_capability.py:1-60.maestro— shares utility helpers (safe_write_json,lockedfile locks,generate_uuid) used by other agents. Source: mm_agents/maestro/utils/README.md:1-40.
Evaluators
Evaluators are deterministic functions that compare the final guest state to a per-task spec. The setup guide explicitly warns that the LibreOffice "no popup on Ctrl+S" flag must be enabled inside the guest, and that system crash reports must be disabled via apport so evaluators are not derailed by background dialogs. Source: desktop_env/evaluators/README.md:1-20.
Community issue #518 flags a real correctness gap: several feasible-task evaluators return reward=1 on loose substring matches without verifying that the agent *caused* the change, leading to false positives. Issue #430 shows the inverse problem — a feasible task annotated against a non-existent target page.
Monitor Dashboard
The Monitor is a Flask app that reads the runner's results directory and renders per-task pages with screenshots, step counts, and final scores. Configuration is via a .env file with the following key variables:
| Variable | Purpose | Default |
|---|---|---|
TASK_CONFIG_PATH | Path to the task configuration JSON | ../evaluation_examples/test.json |
EXAMPLES_BASE_PATH | Base directory for per-example assets | ../evaluation_examples/examples |
RESULTS_BASE_PATH | Directory of run result JSONs | ../results |
ACTION_SPACE | Action vocabulary (pyautogui, keyboard) | pyautogui |
OBSERVATION_TYPE | Observation mode (screenshot, video) | screenshot |
MODEL_NAME | Identifier of the model under test | computer-use-preview |
MAX_STEPS | Maximum step count to display | 150 |
FLASK_PORT / FLASK_HOST | Server bind | 80 / 0.0.0.0 |
Source: monitor/README.md:1-60. The README explicitly warns: *"Make sure you run the monitor after the main runner has started executing tasks. Otherwise, it may cause issues when executing tasks."* Source: monitor/README.md:1-20.
Common Usage Patterns and Known Failure Modes
Quickstart. The canonical entry point is python quickstart.py --provider_name <vmware|virtualbox|docker|aws> .... The README documents the credentials and proxy flow required when running behind the GFW or against cloud VMs. Source: README.md:1-200.
Bottleneck community topics. Recurring user friction includes: benchmarking Qwen3.5-VL with very low success rates (#441), display-config breakage on Aliyun (#480), VMware Fusion setup on Apple Silicon (#407), and a proposal to add trace-level failure attribution to surface hidden failure modes (#514). There is also evidence that a pixel-blind CLI agent can outperform a vision agent on test_all (77.9% vs 64.3%), motivating alternative observation modes (#517).
Operational checklist. Before each run, the project recommends: (1) installing the per-evaluator libraries inside the guest (desktop_env/evaluators/README.md:1-80), (2) disabling apport so crash dialogs cannot surface in the agent's screenshot (desktop_env/evaluators/README.md:1-20), and (3) ensuring the Monitor's .env points at the runner's actual results directory before starting Flask (monitor/README.md:1-60).
See Also
- Quickstart & Setup Guide — VM/Cloud provisioning, proxy, and Google account setup.
- Evaluation Examples — task definitions and per-app evaluators.
- Data Viewer — browser UI for browsing task trajectories.
- v0.1.16 Release Notes — VirtualBox/AWS support and model additions.
Source: https://github.com/xlang-ai/OSWorld / Human Manual
VM Providers, Desktop Environment & Server
Related topics: OSWorld Overview & System Architecture, Agent Implementations, Evaluators & Benchmark Tasks, Deployment, Workflows & Common Failure Modes
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: OSWorld Overview & System Architecture, Agent Implementations, Evaluators & Benchmark Tasks, Deployment, Workflows & Common Failure Modes
VM Providers, Desktop Environment & Server
1. Overview
OSWorld is a benchmark for evaluating multimodal agents on real desktop operating systems. The execution substrate is built around three coupled layers:
- VM Provider layer — abstractions for VMware, VirtualBox, Docker, and AWS that boot and reset a guest OS.
- Desktop Environment (
DesktopEnv) — the Python orchestrator that wires provider, controller, and registry into a Gym-like API consumed by agents. - In-VM Server — a long-lived service running inside the guest that exposes the accessibility tree, screenshots, and application control ports back to the host.
Together, these layers let a host-side agent observe pixels/accessibility nodes, execute keystrokes/clicks, and trigger evaluations deterministically. Source: README.md:1-120.
2. VM Providers
OSWorld supports four providers, each selected via --provider_name on run.py / run_multienv.py:
| Provider | Use Case | Notes |
|---|---|---|
vmware | Local x86 / Apple silicon (Fusion) | Free for personal use; recommended baseline. Source: desktop_env/providers/vmware/INSTALL_VMWARE.md:1-40. |
virtualbox | Free snapshots | Added in v0.1.16 for users without VMware licensing. Source: README.md:90-110. |
docker | Cloud / CI / parallel runs | Uses happysixd/osworld-docker; supports --vm_secret_mount for host-side credentials. Source: README.md:60-90. |
aws | Cloud benchmarks | Optimized instance types and interface mappings ship with v0.1.16. Source: README.md:90-110. |
VMware 17.5.1 is the reference version. On Linux hosts, install via sudo sh VMware-Workstation-xxxx-xxxxxxx.<arch>.bundle --console; on Apple-silicon Macs use VMware Fusion with an ARM Ubuntu image. Source: desktop_env/providers/vmware/INSTALL_VMWARE.md:5-40.
All providers expose the same lifecycle methods (start, reset, stop, get_ip, screenshot), so an agent implementation is provider-agnostic. The default VM credentials are user / password for local providers and osworld-public-evaluation for AWS. Source: README.md:110-140.
Common Provider Pitfalls (Community)
- VMware Fusion on M-series Macs can fail to start the Ubuntu ARM image — see Issue #407. Verify the
.vmxpath and that Fusion has helper-tool permissions. Source: Issue #407. - Docker + Clash proxy can yield a 400 from the in-VM Chrome DevTools port even when the proxy is reachable from the host — see Issue #495. Setting
http_proxy/https_proxyin the container env and matching the host's172.17.0.1:7897Clash listener is the standard fix. Source: Issue #495. - Aliyun (and other minimal Linux display configs) using the dummy X screen snippet from the README can corrupt the graphical session if applied outside a fresh image — see Issue #480. Source: Issue #480.
3. Desktop Environment Orchestration
The DesktopEnv Python class (in desktop_env/) is the agent-facing entry point. It composes:
- A provider (above) that owns the VM lifecycle.
- A controller that issues pyautogui / shell / file actions.
- An evaluator registry that grades terminal states.
A typical invocation looks like:
python scripts/python/run_multienv.py \
--provider_name docker \
--headless \
--observation_type screenshot \
--model gpt-4o \
--max_steps 15 \
--num_envs 10 \
--client_password password
Source: README.md:50-90. The same flags work for vmware, virtualbox, and aws. Outputs (screenshots, action logs, video) land in --result_dir.
Multiple agent implementations consume this environment, including mm_agents/aworldguiagent (CUA-style), mm_agents/vlaa_gui (multi-agent Worker / Gate / Verifier pipeline), and mm_agents/os_symphony (orchestrator + search-agent tutorial retrieval). Source: mm_agents/aworldguiagent/README.md:1-20, mm_agents/vlaa_gui/README.md:1-40, mm_agents/os_symphony/agents/worker.py:1-60.
The vlaa_gui/agents/worker.py worker injects a "recon_context" pre-task inspection string on turn 0, and uses a GateAgent to propose success criteria before the first action — this requires the server's accessibility tree to be available at boot. Source: mm_agents/vlaa_gui/agents/worker.py:1-90.
4. In-VM Server and Evaluator Setup
The server is a Python service that starts at boot inside the guest and exposes the channels the host needs: screenshot streaming, xdotool-based input, file upload/download, and an AT-SPI bridge. The accessibility namespace map is hard-coded per platform (ubuntu, windows, macos) in desktop_env/server/main.py, supporting the agent's SoM/AXTree observations. Source: desktop_env/server/main.py:1-80.
The desktop_env/server/README.md enumerates eight configuration categories that the gold image must satisfy:
- Account credentials (
user/password). - Auto-start of the OSWorld service.
- Accessibility tree packages installed.
- Disabling of interfering services (auto-update, notifications).
- Required software installation (LibreOffice, GIMP, Chrome, VS Code, etc.).
- Per-app configuration (e.g., LibreOffice "do not show popup on Ctrl+S").
- Port configuration for monitoring and control.
- Display / resolution / desktop environment settings.
Source: desktop_env/server/README.md:1-60.
Evaluator-specific setup is documented in desktop_env/evaluators/README.md. Highlights:
- Disable the system crash reporter: set
enabled=0in/etc/default/apport. Source: desktop_env/evaluators/README.md:1-15. - LibreOffice requires pre-installed Python libraries:
python-pptx,python-docx,odfpy, and (for Calc)openpyxl,pandas,lxml,xmltodict. Source: desktop_env/evaluators/README.md:30-60. - The LibreOffice Calc evaluator converts XLSX to CSV via
libreoffice --convert-to "csv:Text - txt - csv (StarCalc):44,34,UTF8,..."with a trailing sheet index. Source: desktop_env/evaluators/README.md:60-80.
Known Server-Side Failure Modes (Community)
- The published
Ubuntu.qcow2snapshot can boot with a Snap Store "software updates available" popup that derails screenshot-only agents on the first observation — see Issue #515. Pinning the guest to a specific Snap refresh schedule or scripting the popup's dismissal at first boot is the usual mitigation. Source: Issue #515. - Several feasible-task evaluators were reported to return
reward=1on loose substring matches without verifying a state delta — see Issue #518. When debugging grader behavior, check theevaluate()implementation for the specific example rather than trusting the score alone. Source: Issue #518. - A community proposal (#514) suggests adding per-step trace diagnostics so failure attribution can distinguish grounding errors from planning errors — useful when the server's screenshot/AXTree look correct but the agent still fails. Source: Issue #514.
5. Operational Tips
- Always pass
--client_passwordmatching the image you are using (passwordfor local,osworld-public-evaluationfor AWS). Source: README.md:110-140. - For Docker runs that need host-side secrets, prefer
--vm_secret_mountover baking credentials into the image. Source: README.md:70-90. - Use
--headlessfor CI / cloud; remove it for local debugging of GUI issues. - After installing a new model, verify it works on a small subset of
test_all.jsonbefore scaling to--num_envsparallel runs.
See Also
- Agent implementations: mm_agents/aworldguiagent, mm_agents/vlaa_gui, mm_agents/os_symphony
- Evaluator library: desktop_env/evaluators/
- Task examples: evaluation_examples/
- Setup guideline: SETUP_GUIDELINE.md
Source: https://github.com/xlang-ai/OSWorld / Human Manual
Agent Implementations, Evaluators & Benchmark Tasks
Related topics: OSWorld Overview & System Architecture, VM Providers, Desktop Environment & Server, Deployment, Workflows & Common Failure Modes
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: OSWorld Overview & System Architecture, VM Providers, Desktop Environment & Server, Deployment, Workflows & Common Failure Modes
Agent Implementations, Evaluators & Benchmark Tasks
1. Overview and Purpose
OSWorld is a benchmark for evaluating multimodal agents on real desktop tasks. The framework is split into three cooperating layers: a set of agent implementations under mm_agents/, a task and evaluator corpus under evaluation_examples/, and a desktop environment under desktop_env/. Together they let researchers plug a new agent into the same VM, run the same set of tasks, and receive a comparable reward score.
The high-level loop is: an agent receives a natural-language instruction and an observation (screenshot and/or accessibility tree), emits a code-style action (typically agent.click(...), agent.type(...), agent.execute(...)), the environment executes it on the VM, and the evaluator inspects the resulting state to assign a score. Source: README.md and mm_agents/README.md.
Community context: the leaderboard discussion in issue #489 (Hugging Face native benchmark integration) and the methodology critique in #517 (a pixel-blind CLI agent scoring 77.9% vs 64.3% for a vision agent on test_all) both depend on this same agent–evaluator–task pipeline.
2. Agent Implementations
OSWorld ships multiple agent implementations in mm_agents/, each using a different reasoning strategy. All expose a common predict(instruction, obs) interface (or the equivalent for multi-agent variants).
| Agent | Location | Key idea |
|---|---|---|
PromptAgent | mm_agents/agent.py | Baseline prompt-based agent; supports screenshot, a11y_tree, and screenshot+a11y_tree observations. Source: mm_agents/README.md |
VLAA-GUI Worker | mm_agents/vlaa_gui/agents/worker.py | Multi-agent design: Worker, Grounding (ACI), Gate, Verifier, Code, and Searcher. Source: mm_agents/vlaa_gui/README.md |
OS-Symphony Worker | mm_agents/os_symphony/agents/worker.py | Orchestrator + coder-agent; injects search results and recon context. Source: mm_agents/os_symphony/agents/worker.py |
CoAct | mm_agents/coact/... | Builds on AG2 ConversableAgent with composable AgentCapability and ToolsCapability. Source: tools_capability.py |
UiPath | mm_agents/uipath/ | Two-stage Action Planner + Grounder (UI-TARS-1.5) using a crop-and-refine coordinate prediction. Source: mm_agents/uipath/README.md |
aworldGUIAgent-v1 | mm_agents/aworldguiagent/ | Built on AWorld framework, extends Agent-S perception with new executable tools. Source: mm_agents/aworldguiagent/README.md |
2.1 Action Space and Observation
The action space is enforced as a single agent.<method>(...) call per turn. The formatter in mm_agents/os_symphony/utils/formatters.py validates that the response contains exactly one action and that the method name appears in the loaded tool config (SINGLE_ACTION_FORMATTER, CODE_VALID_FORMATTER). Source: mm_agents/os_symphony/utils/formatters.py.
In mm_agents/vlaa_gui/utils/formatters.py the same single-action guarantee is enforced, with a small bypass list (_CODE_VALIDATION_BYPASS_ACTIONS) for side-effecting methods such as call_search_agent and call_code_agent, and for click which is validated later by the grounding step. Source: mm_agents/vlaa_gui/utils/formatters.py.
2.2 VLAA-GUI Worker Flow
The Worker in mm_agents/vlaa_gui/agents/worker.py injects task instructions, optional search tutorials, and a recon context on turn 0; it then asks the GateAgent to propose 1–3 UI-observable success criteria that are checked every step. Source: mm_agents/vlaa_gui/agents/worker.py. The ACI grounding module coordinates clicks via OCR/Tesseract fallback, and grounding.py wraps the CodeAgent with constraint-aware failure tracking. Source: mm_agents/vlaa_gui/agents/grounding.py.
2.3 OS-Symphony Flow
mm_agents/os_symphony/agents/worker.py shows the orchestrator appending search results, gating their use to the current turn (resetting last_search_agent_result after injection), and then asking the orchestrator LLM for a plan via call_llm_formatted with the two format checkers listed above. Source: mm_agents/os_symphony/agents/worker.py. The dedicated CoderAgent summarizes the multi-step code execution with a 150-word cap. Source: mm_agents/os_symphony/agents/coder_agent.py.
3. Benchmark Tasks and Evaluators
3.1 Task Schema
Each task lives at evaluation_examples/examples/<domain>/<id>.json and is documented in evaluation_examples/README.md. Required fields are:
id— unique task identifiersnapshot— VM snapshot id that fixes the initial stateinstruction— natural-language task descriptionsource— provenance URLconfig— setup scripts (file downloads, app launches) executed before the agent startsrelated_apps— applications the task will touchevaluator— directory containing the per-taskevaluate.py
The ./trajectories directory holds annotated gold trajectories and recordings.
3.2 Evaluator Design
desktop_env/evaluators/README.md documents per-app evaluator setup, including disabling apport crash reports, suppressing LibreOffice save popups, and installing domain libraries (python-pptx, python-docx, odfpy, openpyxl, pandas, lxml, xmltodict). For LibreOffice Calc the recommended XLSX→CSV conversion uses the Text - txt - csv (StarCalc):44,34,UTF8,... filter. Source: desktop_env/evaluators/README.md.
A persistent community concern is loose grading: issue #518 reports that several feasible-task evaluators return reward=1 via substring matching on the final state, with no delta/causation check, which can inflate success rates when an agent happens to leave the right text in place without performing the requested action.
3.3 Accessibility and Observation Plumbing
The desktop-side server in desktop_env/server/main.py exposes per-OS accessibility namespace maps (_accessibility_ns_map for ubuntu, windows, macos) that are used to build a11y-tree observations for a11y_tree-mode agents. Source: desktop_env/server/main.py.
4. Common Failure Modes and Community Issues
- Task design flaws. Issue #430 points to a Chrome task (
f3b19d1e-…) whose target page does not exist, yet the task is marked feasible. - First-observation drift. Issue #515 shows that a Snap Store "software updates available" popup on the
Ubuntu.qcow2reset can derail screenshot-only agents on their very first observation. - Container / Chrome DevTools 400. Issue #495 reports a 400 on the Chrome DevTools port from the published Docker image even with a verified host proxy.
- Vendor-specific X server configs. Issue #480 shows that the
DummyScreen/DummyDeviceX config used for Aliyun can break the GUI when adapted carelessly. - Trace diagnostics proposal. Issue #514 suggests adding per-step failure attribution beyond the single
rewardsignal. - VMware Fusion on Apple Silicon. Issue #407 documents an
ami/provider mismatch on M-series Macs. - Qwen3.5-VL tuning. Issue #441 notes very low success rates when running Qwen3.5-VL through the
qwen3vlpipeline and asks for an official config.
For manual debugging the helper script scripts/bash/run_manual_examine.sh runs a single task end-to-end with screenshots and a video, with domain-specific example IDs. Source: scripts/README.md.
See Also
- Desktop Environment & Providers (separate page)
- Setup Guideline (proxy, Google account, AWS) (separate page)
- Public Evaluation Platform (separate page)
Source: https://github.com/xlang-ai/OSWorld / Human Manual
Deployment, Workflows & Common Failure Modes
Related topics: OSWorld Overview & System Architecture, VM Providers, Desktop Environment & Server, Agent Implementations, Evaluators & Benchmark Tasks
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Continue reading this section for the full explanation and source context.
Related Pages
Related topics: OSWorld Overview & System Architecture, VM Providers, Desktop Environment & Server, Agent Implementations, Evaluators & Benchmark Tasks
Deployment, Workflows & Common Failure Modes
1. Overview
OSWorld is a benchmark for evaluating multimodal agents on real desktop operating systems. The repository ships a Docker/VM/cloud runtime (desktop_env), multiple pluggable agent implementations (mm_agents/*), example tasks (evaluation_examples/), and runner scripts (scripts/python/*). This page documents how the pieces are deployed, the end-to-end workflow that produces a benchmark number, and the failure modes that users repeatedly hit in the wild.
The main entry point for first-time users is quickstart.py plus the scripts/python/run_multienv*.py runners, while power users orchestrate their own agent through the DesktopEnv interface described in desktop_env/README.md and the agent interface in mm_agents/README.md. Source: README.md.
2. Deployment Topologies
OSWorld supports four runtime providers, each with its own failure surface:
| Provider | Backend | Typical failure mode | Reference |
|---|---|---|---|
docker | happysixd/osworld-docker image | Chrome DevTools port returning 400 when host proxy is mis-routed (#495) | desktop_env/providers/docker/ |
vmware | Ubuntu.vmx on host | VMware Fusion on Apple Silicon fails to start the VM (#407) | desktop_env/providers/vmware/ |
virtualbox | Local snapshot | Free for snapshots since v0.1.16 | v0.1.16 release notes |
aws | EC2 instance, AMI ami-0b505e9d0d99ba88c | Region-specific or X11 dummy-screen breakage (#480) | mm_agents/vlaa_gui/README.md |
Default credentials for local providers are user / password; for cloud providers OSWorld defaults to osworld-public-evaluation to avoid trivially guessable logins. Source: README.md.
The desktop server is started by desktop_env/server/main.py, which exposes an x11vnc-backed stream and an AT-SPI accessibility-tree endpoint. The server's platform-specific namespaces (_accessibility_ns_map_ubuntu, _accessibility_ns_map_windows, _accessibility_ns_map_macos) tell the agent how to interpret node attributes for each guest OS. Source: desktop_env/server/main.py.
3. End-to-End Workflow
flowchart LR
A[Task JSON in evaluation_examples/] --> B[runner script: run_multienv_*.py]
B --> C[Agent e.g. os_symphony / vlaa_gui]
C --> D[Screenshot + a11y tree]
D --> E[Worker next_action]
E -->|click/type/code/search| F[DesktopEnv step]
F --> G[reset to snapshot]
G --> H[Evaluators in desktop_env/evaluators/]
H --> I[reward 0/1]The runner instantiates one DesktopEnv per worker process. After resetting the VM to the task's golden snapshot, it calls env.reset(task_config), which the worker turns into the first observation. Source: scripts/README.md.
Inside the agent, the Worker (mm_agents/os_symphony/agents/worker.py and mm_agents/vlaa_gui/agents/worker.py) builds the per-step prompt by stitching together:
- The system prompt with
TASK_DESCRIPTIONinterpolated. - Optional tutorials produced by the SearcherAgent (
mm_agents/os_symphony/agents/searcher_agent.py), appended under### Tutorials Found by Search Agent. - The previous search/code-agent result (with
DONE/FAILreason). - A
⚠️ FINAL STEPwarning whenobs["is_last_step"]is true, forcing a terminalagent.done()oragent.fail()decision. Source: mm_agents/os_symphony/agents/worker.py.
When the worker hands control to the CoderAgent, the agent emits multi-step pyautogui scripts. After execution, the coder agent generates a Summary: ... block via PROCEDURAL_MEMORY.CODE_SUMMARY_AGENT_PROMPT, which is fed back into the next worker turn. Source: mm_agents/os_symphony/agents/coder_agent.py.
The VLAA GUI variant adds two extra components on top of the basic loop: a GateAgent that proposes 1–3 UI-observable success criteria at step 1, and a VerifierAgent that independently confirms completion before the worker is allowed to call done(). Source: mm_agents/vlaa_gui/README.md.
Once the agent signals completion, evaluate() runs the per-task evaluators bundled under desktop_env/evaluators/. The evaluator setup README details which libraries each application needs (e.g. python-pptx for LibreOffice Impress, openpyxl/pandas/lxml/xmltodict for Calc). Source: desktop_env/evaluators/README.md.
4. Common Failure Modes
4.1 VM Reset Pollution
After a snapshot reset, the Ubuntu guest frequently boots into a Snap Store "software updates available" popup, which is then captured in the agent's first screenshot and derails screenshot-only models. Source: Issue #515.
4.2 Loose Evaluator Grading
Several feasible-task evaluators return reward=1 purely on a substring match against the final file state, with no causation or delta check. A regression in #430 (Ticket-Delivery-FAQs page does not exist yet the task is marked feasible) shows that even feasibility filtering is not bulletproof. Source: Issue #518 and Issue #430.
4.3 Final-Step Hang
Without an explicit FINAL STEP prompt, workers keep emitting non-terminal actions until the step budget is exhausted. The OS-Symphony worker now injects ⚠️ FINAL STEP exactly when obs.get("is_last_step") is true. Source: mm_agents/os_symphony/agents/worker.py.
4.4 Network & Proxy Failures
Chrome-based tasks fail when DevTools returns 400 because the host proxy is not propagated into the container (172.17.0.1:7897 style Clash configs are a common culprit). The proxy setup instructions in SETUP_GUIDELINE.md and the v0.1.16 release notes both flag this. Source: Issue #495.
4.5 Model-Specific Pipeline Misconfiguration
Qwen3.5-VL users report near-zero success rates because they run the model through the wrong pipeline variant (#441). The qwen3vl runner script assumes a specific prompt format; using it with the chat-completions endpoint silently downgrades accuracy. Source: scripts/python/run_multienv_qwen3vl.py.
5. Mitigations and Recommended Practices
- Pin the provider image: Use a digest-pinned Docker image (
happysixd/osworld-docker@sha256:...) rather than:latestto avoid silent VM-state drift. Source: Issue #495. - Disable apport and version pop-ups inside the guest:
desktop_env/evaluators/README.mdrecommends settingenabled=0in/etc/default/apportand turning off LibreOffice "no pop-up when ctrl+s" warnings to keep the observation stable. Source: desktop_env/evaluators/README.md. - Always run scripts from the repo root: The
scripts/README.mddocuments thatsys.pathis mutated at import time, so runningpython scripts/python/run.pyfrom the wrong directory breaks module resolution. Source: scripts/README.md. - Treat trace-level diagnostics as first-class: When reward is 0, inspect the per-step screenshots and the worker history; a CLI-only baseline (
#517) demonstrates that pixel-blind policies can outperform vision agents once fine-grained traces are added. Source: Issue #517 and Issue #514. - Verify evaluator libraries:
python-pptx,python-docx,odfpy,openpyxl, andpandasmust be installed inside the guest for the corresponding evaluator to run; their absence silently degrades evaluation to the substring-match fallback. Source: desktop_env/evaluators/README.md.
See Also
- Quick Start & FAQ
- Agent Interface
- DesktopEnv Interface
- Setup Guideline (proxy, Google accounts, public evaluation)
- VLAA GUI Architecture
Source: https://github.com/xlang-ai/OSWorld / Human Manual
Doramagic Pitfall Log
Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
May increase setup, validation, or first-run risk for the user.
Doramagic Pitfall Log
Found 11 structured pitfall item(s), including 5 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.
1. Installation risk: Installation risk requires verification
- Severity: high
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/xlang-ai/OSWorld/issues/515
2. Maintenance risk: Maintenance risk requires verification
- Severity: high
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/xlang-ai/OSWorld/issues/514
3. Security or permission risk: Security or permission risk requires verification
- Severity: high
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/xlang-ai/OSWorld/issues/495
4. Security or permission risk: Security or permission risk requires verification
- Severity: high
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/xlang-ai/OSWorld/issues/518
5. Security or permission risk: Security or permission risk requires verification
- Severity: high
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: community_evidence:github | https://github.com/xlang-ai/OSWorld/issues/517
6. Capability evidence risk: Capability evidence risk requires verification
- Severity: medium
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: capability.assumptions | github_repo:705433049 | https://github.com/xlang-ai/OSWorld
7. Maintenance risk: Maintenance risk requires verification
- Severity: medium
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | github_repo:705433049 | https://github.com/xlang-ai/OSWorld
8. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: downstream_validation.risk_items | github_repo:705433049 | https://github.com/xlang-ai/OSWorld
9. Security or permission risk: Security or permission risk requires verification
- Severity: medium
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: risks.scoring_risks | github_repo:705433049 | https://github.com/xlang-ai/OSWorld
10. Maintenance risk: Maintenance risk requires verification
- Severity: low
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | github_repo:705433049 | https://github.com/xlang-ai/OSWorld
11. Maintenance risk: Maintenance risk requires verification
- Severity: low
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Recommended check: Reproduce the official install and quickstart path in an isolated environment.
- Evidence: evidence.maintainer_signals | github_repo:705433049 | https://github.com/xlang-ai/OSWorld
Source: Doramagic discovery, validation, and Project Pack records
Community Discussion Evidence
These external discussion links are review inputs, not standalone proof that the project is production-ready.
Count of project-level external discussion links exposed on this manual page.
Open the linked issues or discussions before treating the pack as ready for your environment.
Community Discussion Evidence
Doramagic exposes project-level community discussion separately from official documentation. Review these links before using OSWorld with real data or production workflows.
- Pixel-blind CLI agent scores 77.9% on OSWorld test_all (vs 64.3% vision) - github / github_issue
- Container starts but Chrome DevTools port returns 400, even with clean h - github / github_issue
- Proposal: trace diagnostics for computer-use agent failures - github / github_issue
- Guest VM shows a Snap Store "software updates available" popup on reset, - github / github_issue
- Feasible-task evaluators return reward=1 without verifying the task was - github / github_issue
- v0.1.16 - github / github_release
- v0.1.0 - github / github_release
- Capability evidence risk requires verification - GitHub / issue
Source: Project Pack community evidence and pitfall evidence