# https://github.com/xlang-ai/OSWorld Project Manual

Generated at: 2026-06-13 16:11:48 UTC

## Table of Contents

- [OSWorld Overview & System Architecture](#page-1)
- [VM Providers, Desktop Environment & Server](#page-2)
- [Agent Implementations, Evaluators & Benchmark Tasks](#page-3)
- [Deployment, Workflows & Common Failure Modes](#page-4)

<a id='page-1'></a>

## OSWorld Overview & System Architecture

### Related Pages

Related topics: [VM Providers, Desktop Environment & Server](#page-2), [Agent Implementations, Evaluators & Benchmark Tasks](#page-3), [Deployment, Workflows & Common Failure Modes](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/xlang-ai/OSWorld/blob/main/README.md)
- [desktop_env/server/main.py](https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/server/main.py)
- [desktop_env/evaluators/README.md](https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/evaluators/README.md)
- [monitor/README.md](https://github.com/xlang-ai/OSWorld/blob/main/monitor/README.md)
- [mm_agents/vlaa_gui/agents/worker.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/vlaa_gui/agents/worker.py)
- [mm_agents/os_symphony/agents/worker.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/os_symphony/agents/worker.py)
- [mm_agents/os_symphony/agents/searcher_agent.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/os_symphony/agents/searcher_agent.py)
- [mm_agents/os_symphony/agents/os_aci.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/os_symphony/agents/os_aci.py)
- [mm_agents/coact/autogen/agentchat/contrib/capabilities/teachability.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/coact/autogen/agentchat/contrib/capabilities/teachability.py)
- [mm_agents/maestro/utils/README.md](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/maestro/utils/README.md)
</details>

# OSWorld Overview & System Architecture

## Purpose and Scope

OSWorld is a scalable, real computer-environment benchmark for multimodal agents. It provides a controlled desktop environment (Ubuntu/Windows/macOS guests hosted on VM, VirtualBox, Docker, or AWS) in which an agent receives a natural-language task, observes the screen (and optionally the accessibility tree), executes actions, and is then scored by deterministic evaluators against an expected end state.

The project defines four pillars, all of which are implemented as runnable Python in this repository:

- **Environment providers** — pluggable back-ends that boot and reset the guest OS (VM, VirtualBox, Docker, AWS).
- **Agents** — multiple reference implementations (`vlaa_gui`, `os_symphony`, `coact`, `maestro`) that translate observations into actions.
- **Evaluators** — per-task scoring functions that grade the final state of the guest.
- **Monitor** — a Flask-based web dashboard that visualises running tasks and trajectories.

The README states that credentials default to `user` / `password` for local providers and `osworld-public-evaluation` for AWS, and that some evaluators require `sudo`, so the client must know the password. Source: [README.md:1-200]().

The current public release is **v0.1.16**, which adds VirtualBox snapshots, AWS support with optimized interface mappings, fixes for annotation bugs, and broader model support (Gemini 1.5-Pro, Llama-3, Qwen). Source: [release notes for v0.1.16](https://github.com/xlang-ai/OSWorld/releases/tag/v0.1.16).

## High-Level Architecture

OSWorld separates **environment control** (the guest desktop) from **agent reasoning** (a Python process that talks to the guest via RPC and screenshots) and from **evaluation** (a deterministic post-run grader). All three are coordinated by a top-level runner that streams results into the Monitor.

```mermaid
flowchart LR
    A[Runner / quickstart.py] --> B[Agent Process]
    B -- screenshot / a11y tree --> A
    A -- action code --> C[Guest VM / Container]
    C --> D[desktop_env Server]
    D -- clipboard / file ops --> B
    A --> E[Evaluator]
    E --> F[Result JSON]
    F --> G[Monitor Dashboard]
```

The desktop side is mediated by a Python server embedded in the guest that exposes clipboard, file, and accessibility-tree RPCs to the host agent. Source: [desktop_env/server/main.py:1-60]() defines platform-specific accessibility namespaces for `ubuntu`, `windows`, and `macos`, with `MAX_DEPTH` and `MAX_WIDTH` bounds for tree traversal — this is the protocol agents consume when they request an a11y tree observation.

## Core Components

### Environment Providers and the Desktop Server

The guest runs a server (`desktop_env/server/main.py`) that brokers interactions between the host agent and the operating system. It handles:

- **Accessibility tree construction** with per-platform namespaces (e.g., `st`, `attr`, `cp`, `doc`, `docattr`, `txt`, `val`, `act`, `class` for Ubuntu). Source: [desktop_env/server/main.py:1-40]().
- **Per-application setup**: evaluators require libraries such as `python-pptx`, `python-docx`, `odfpy`, `openpyxl`, `pandas`, `lxml`, and `xmltodict` to be pre-installed inside the guest. Source: [desktop_env/evaluators/README.md:1-80]().
- **LibreOffice headless conversion** — evaluators rely on `libreoffice --convert-to "csv:Text - txt - csv (StarCalc):44,34,UTF8,..."` to materialise intermediate files. Source: [desktop_env/evaluators/README.md:30-50]().

Community reports highlight two recurring operational issues: (1) the published `Ubuntu.qcow2` snapshot can boot with a Snap Store "software updates available" popup that derails screenshot-based agents ([#515](https://github.com/xlang-ai/OSWorld/issues/515)), and (2) Chrome DevTools port forwarding through a host proxy can return `400` even with a verified clean image ([#495](https://github.com/xlang-ai/OSWorld/issues/495)).

### Agent Implementations

OSWorld ships several reference agents that differ in how they plan and act. All of them share the same observation/action contract with `desktop_env`, but differ in prompting and tooling.

- **`vlaa_gui` Worker** — a generator/grounding split. The `Worker` class wires a generator agent to a grounding `ACI` agent, optionally invokes a `GateAgent` to propose completion criteria on step 0, and injects a `recon_context` block on the first turn so the agent is aware of read-only pre-task inspection. Source: [mm_agents/vlaa_gui/agents/worker.py:1-120]().
- **`os_symphony` Worker** — an orchestrator that delegates intent to a `SearcherAgent` (which itself wraps `pyautogui` calls such as `click` and `type`) and feeds search results back into the generator as a one-shot prompt injection (then resets `last_search_agent_result` to avoid re-injection). Source: [mm_agents/os_symphony/agents/worker.py:1-80]() and [mm_agents/os_symphony/agents/searcher_agent.py:1-80]().
- **`os_symphony` ACI** — the Search Agent wrapper that issues "How to …" queries scoped to a single application, then injects returned tutorials into the generator system prompt. Source: [mm_agents/os_symphony/agents/os_aci.py:1-40]().
- **`coact`** — an AutoGen-derived multi-agent stack; it pulls in `Teachability` (ChromaDB-backed long-term memory), `VisionCapability`, `ToolsCapability`, `TransformMessages` (history/token limiters), `LLMLingua` compression, and `ImageGenerator`. Source: [mm_agents/coact/autogen/agentchat/contrib/capabilities/teachability.py:1-60]() and [mm_agents/coact/autogen/agentchat/contrib/capabilities/vision_capability.py:1-60]().
- **`maestro`** — shares utility helpers (`safe_write_json`, `locked` file locks, `generate_uuid`) used by other agents. Source: [mm_agents/maestro/utils/README.md:1-40]().

### Evaluators

Evaluators are deterministic functions that compare the final guest state to a per-task spec. The setup guide explicitly warns that the LibreOffice "no popup on Ctrl+S" flag must be enabled inside the guest, and that system crash reports must be disabled via `apport` so evaluators are not derailed by background dialogs. Source: [desktop_env/evaluators/README.md:1-20]().

Community issue [#518](https://github.com/xlang-ai/OSWorld/issues/518) flags a real correctness gap: several feasible-task evaluators return `reward=1` on loose substring matches without verifying that the agent *caused* the change, leading to false positives. Issue [#430](https://github.com/xlang-ai/OSWorld/issues/430) shows the inverse problem — a feasible task annotated against a non-existent target page.

### Monitor Dashboard

The Monitor is a Flask app that reads the runner's results directory and renders per-task pages with screenshots, step counts, and final scores. Configuration is via a `.env` file with the following key variables:

| Variable | Purpose | Default |
|----------|---------|---------|
| `TASK_CONFIG_PATH` | Path to the task configuration JSON | `../evaluation_examples/test.json` |
| `EXAMPLES_BASE_PATH` | Base directory for per-example assets | `../evaluation_examples/examples` |
| `RESULTS_BASE_PATH` | Directory of run result JSONs | `../results` |
| `ACTION_SPACE` | Action vocabulary (`pyautogui`, `keyboard`) | `pyautogui` |
| `OBSERVATION_TYPE` | Observation mode (`screenshot`, `video`) | `screenshot` |
| `MODEL_NAME` | Identifier of the model under test | `computer-use-preview` |
| `MAX_STEPS` | Maximum step count to display | `150` |
| `FLASK_PORT` / `FLASK_HOST` | Server bind | `80` / `0.0.0.0` |

Source: [monitor/README.md:1-60](). The README explicitly warns: *"Make sure you run the monitor after the main runner has started executing tasks. Otherwise, it may cause issues when executing tasks."* Source: [monitor/README.md:1-20]().

## Common Usage Patterns and Known Failure Modes

**Quickstart.** The canonical entry point is `python quickstart.py --provider_name <vmware|virtualbox|docker|aws> ...`. The README documents the credentials and proxy flow required when running behind the GFW or against cloud VMs. Source: [README.md:1-200]().

**Bottleneck community topics.** Recurring user friction includes: benchmarking Qwen3.5-VL with very low success rates ([#441](https://github.com/xlang-ai/OSWorld/issues/441)), display-config breakage on Aliyun ([#480](https://github.com/xlang-ai/OSWorld/issues/480)), VMware Fusion setup on Apple Silicon ([#407](https://github.com/xlang-ai/OSWorld/issues/407)), and a proposal to add trace-level failure attribution to surface hidden failure modes ([#514](https://github.com/xlang-ai/OSWorld/issues/514)). There is also evidence that a pixel-blind CLI agent can outperform a vision agent on `test_all` (77.9% vs 64.3%), motivating alternative observation modes ([#517](https://github.com/xlang-ai/OSWorld/issues/517)).

**Operational checklist.** Before each run, the project recommends: (1) installing the per-evaluator libraries inside the guest ([desktop_env/evaluators/README.md:1-80]()), (2) disabling `apport` so crash dialogs cannot surface in the agent's screenshot ([desktop_env/evaluators/README.md:1-20]()), and (3) ensuring the Monitor's `.env` points at the runner's actual results directory before starting Flask ([monitor/README.md:1-60]()).

## See Also

- [Quickstart & Setup Guide](https://github.com/xlang-ai/OSWorld/blob/main/SETUP_GUIDELINE.md) — VM/Cloud provisioning, proxy, and Google account setup.
- [Evaluation Examples](https://github.com/xlang-ai/OSWorld/tree/main/evaluation_examples) — task definitions and per-app evaluators.
- [Data Viewer](https://os-world.github.io/explorer.html) — browser UI for browsing task trajectories.
- [v0.1.16 Release Notes](https://github.com/xlang-ai/OSWorld/releases/tag/v0.1.16) — VirtualBox/AWS support and model additions.

---

<a id='page-2'></a>

## VM Providers, Desktop Environment & Server

### Related Pages

Related topics: [OSWorld Overview & System Architecture](#page-1), [Agent Implementations, Evaluators & Benchmark Tasks](#page-3), [Deployment, Workflows & Common Failure Modes](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/xlang-ai/OSWorld/blob/main/README.md)
- [desktop_env/server/main.py](https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/server/main.py)
- [desktop_env/server/README.md](https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/server/README.md)
- [desktop_env/providers/vmware/INSTALL_VMWARE.md](https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/providers/vmware/INSTALL_VMWARE.md)
- [desktop_env/evaluators/README.md](https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/evaluators/README.md)
- [mm_agents/aworldguiagent/README.md](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/aworldguiagent/README.md)
- [mm_agents/vlaa_gui/README.md](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/vlaa_gui/README.md)
- [mm_agents/os_symphony/agents/worker.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/os_symphony/agents/worker.py)
- [mm_agents/vlaa_gui/agents/worker.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/vlaa_gui/agents/worker.py)
</details>

# VM Providers, Desktop Environment & Server

## 1. Overview

OSWorld is a benchmark for evaluating multimodal agents on real desktop operating systems. The execution substrate is built around three coupled layers:

- **VM Provider layer** — abstractions for VMware, VirtualBox, Docker, and AWS that boot and reset a guest OS.
- **Desktop Environment (`DesktopEnv`)** — the Python orchestrator that wires provider, controller, and registry into a Gym-like API consumed by agents.
- **In-VM Server** — a long-lived service running inside the guest that exposes the accessibility tree, screenshots, and application control ports back to the host.

Together, these layers let a host-side agent observe pixels/accessibility nodes, execute keystrokes/clicks, and trigger evaluations deterministically. Source: [README.md:1-120]().

## 2. VM Providers

OSWorld supports four providers, each selected via `--provider_name` on `run.py` / `run_multienv.py`:

| Provider | Use Case | Notes |
|----------|----------|-------|
| `vmware` | Local x86 / Apple silicon (Fusion) | Free for personal use; recommended baseline. Source: [desktop_env/providers/vmware/INSTALL_VMWARE.md:1-40](). |
| `virtualbox` | Free snapshots | Added in v0.1.16 for users without VMware licensing. Source: [README.md:90-110](). |
| `docker` | Cloud / CI / parallel runs | Uses `happysixd/osworld-docker`; supports `--vm_secret_mount` for host-side credentials. Source: [README.md:60-90](). |
| `aws` | Cloud benchmarks | Optimized instance types and interface mappings ship with v0.1.16. Source: [README.md:90-110](). |

VMware 17.5.1 is the reference version. On Linux hosts, install via `sudo sh VMware-Workstation-xxxx-xxxxxxx.<arch>.bundle --console`; on Apple-silicon Macs use VMware Fusion with an ARM Ubuntu image. Source: [desktop_env/providers/vmware/INSTALL_VMWARE.md:5-40]().

All providers expose the same lifecycle methods (`start`, `reset`, `stop`, `get_ip`, `screenshot`), so an agent implementation is provider-agnostic. The default VM credentials are `user` / `password` for local providers and `osworld-public-evaluation` for AWS. Source: [README.md:110-140]().

### Common Provider Pitfalls (Community)

- **VMware Fusion on M-series Macs** can fail to start the Ubuntu ARM image — see Issue #407. Verify the `.vmx` path and that Fusion has helper-tool permissions. Source: [Issue #407]().
- **Docker + Clash proxy** can yield a 400 from the in-VM Chrome DevTools port even when the proxy is reachable from the host — see Issue #495. Setting `http_proxy`/`https_proxy` in the container env and matching the host's `172.17.0.1:7897` Clash listener is the standard fix. Source: [Issue #495]().
- **Aliyun (and other minimal Linux display configs)** using the dummy X screen snippet from the README can corrupt the graphical session if applied outside a fresh image — see Issue #480. Source: [Issue #480]().

## 3. Desktop Environment Orchestration

The `DesktopEnv` Python class (in `desktop_env/`) is the agent-facing entry point. It composes:

- A **provider** (above) that owns the VM lifecycle.
- A **controller** that issues pyautogui / shell / file actions.
- An **evaluator registry** that grades terminal states.

A typical invocation looks like:

```bash
python scripts/python/run_multienv.py \
    --provider_name docker \
    --headless \
    --observation_type screenshot \
    --model gpt-4o \
    --max_steps 15 \
    --num_envs 10 \
    --client_password password
```

Source: [README.md:50-90](). The same flags work for `vmware`, `virtualbox`, and `aws`. Outputs (screenshots, action logs, video) land in `--result_dir`.

Multiple agent implementations consume this environment, including `mm_agents/aworldguiagent` (CUA-style), `mm_agents/vlaa_gui` (multi-agent Worker / Gate / Verifier pipeline), and `mm_agents/os_symphony` (orchestrator + search-agent tutorial retrieval). Source: [mm_agents/aworldguiagent/README.md:1-20](), [mm_agents/vlaa_gui/README.md:1-40](), [mm_agents/os_symphony/agents/worker.py:1-60]().

The `vlaa_gui/agents/worker.py` worker injects a "recon_context" pre-task inspection string on turn 0, and uses a `GateAgent` to propose success criteria before the first action — this requires the server's accessibility tree to be available at boot. Source: [mm_agents/vlaa_gui/agents/worker.py:1-90]().

## 4. In-VM Server and Evaluator Setup

The server is a Python service that starts at boot inside the guest and exposes the channels the host needs: screenshot streaming, xdotool-based input, file upload/download, and an AT-SPI bridge. The accessibility namespace map is hard-coded per platform (`ubuntu`, `windows`, `macos`) in `desktop_env/server/main.py`, supporting the agent's SoM/AXTree observations. Source: [desktop_env/server/main.py:1-80]().

The `desktop_env/server/README.md` enumerates eight configuration categories that the gold image must satisfy:

1. Account credentials (`user` / `password`).
2. Auto-start of the OSWorld service.
3. Accessibility tree packages installed.
4. Disabling of interfering services (auto-update, notifications).
5. Required software installation (LibreOffice, GIMP, Chrome, VS Code, etc.).
6. Per-app configuration (e.g., LibreOffice "do not show popup on Ctrl+S").
7. Port configuration for monitoring and control.
8. Display / resolution / desktop environment settings.

Source: [desktop_env/server/README.md:1-60]().

Evaluator-specific setup is documented in `desktop_env/evaluators/README.md`. Highlights:

- Disable the system crash reporter: set `enabled=0` in `/etc/default/apport`. Source: [desktop_env/evaluators/README.md:1-15]().
- LibreOffice requires pre-installed Python libraries: `python-pptx`, `python-docx`, `odfpy`, and (for Calc) `openpyxl`, `pandas`, `lxml`, `xmltodict`. Source: [desktop_env/evaluators/README.md:30-60]().
- The LibreOffice Calc evaluator converts XLSX to CSV via `libreoffice --convert-to "csv:Text - txt - csv (StarCalc):44,34,UTF8,..."` with a trailing sheet index. Source: [desktop_env/evaluators/README.md:60-80]().

### Known Server-Side Failure Modes (Community)

- The published `Ubuntu.qcow2` snapshot can boot with a Snap Store "software updates available" popup that derails screenshot-only agents on the first observation — see Issue #515. Pinning the guest to a specific Snap refresh schedule or scripting the popup's dismissal at first boot is the usual mitigation. Source: [Issue #515]().
- Several feasible-task evaluators were reported to return `reward=1` on loose substring matches without verifying a state delta — see Issue #518. When debugging grader behavior, check the `evaluate()` implementation for the specific example rather than trusting the score alone. Source: [Issue #518]().
- A community proposal (#514) suggests adding per-step trace diagnostics so failure attribution can distinguish grounding errors from planning errors — useful when the server's screenshot/AXTree look correct but the agent still fails. Source: [Issue #514]().

## 5. Operational Tips

- Always pass `--client_password` matching the image you are using (`password` for local, `osworld-public-evaluation` for AWS). Source: [README.md:110-140]().
- For Docker runs that need host-side secrets, prefer `--vm_secret_mount` over baking credentials into the image. Source: [README.md:70-90]().
- Use `--headless` for CI / cloud; remove it for local debugging of GUI issues.
- After installing a new model, verify it works on a small subset of `test_all.json` before scaling to `--num_envs` parallel runs.

## See Also

- Agent implementations: [mm_agents/aworldguiagent](https://github.com/xlang-ai/OSWorld/tree/main/mm_agents/aworldguiagent), [mm_agents/vlaa_gui](https://github.com/xlang-ai/OSWorld/tree/main/mm_agents/vlaa_gui), [mm_agents/os_symphony](https://github.com/xlang-ai/OSWorld/tree/main/mm_agents/os_symphony)
- Evaluator library: [desktop_env/evaluators/](https://github.com/xlang-ai/OSWorld/tree/main/desktop_env/evaluators)
- Task examples: [evaluation_examples/](https://github.com/xlang-ai/OSWorld/tree/main/evaluation_examples)
- Setup guideline: [SETUP_GUIDELINE.md](https://github.com/xlang-ai/OSWorld/blob/main/SETUP_GUIDELINE.md)

---

<a id='page-3'></a>

## Agent Implementations, Evaluators & Benchmark Tasks

### Related Pages

Related topics: [OSWorld Overview & System Architecture](#page-1), [VM Providers, Desktop Environment & Server](#page-2), [Deployment, Workflows & Common Failure Modes](#page-4)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/xlang-ai/OSWorld/blob/main/README.md)
- [mm_agents/README.md](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/README.md)
- [mm_agents/vlaa_gui/README.md](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/vlaa_gui/README.md)
- [mm_agents/vlaa_gui/agents/worker.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/vlaa_gui/agents/worker.py)
- [mm_agents/vlaa_gui/agents/grounding.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/vlaa_gui/agents/grounding.py)
- [mm_agents/vlaa_gui/utils/formatters.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/vlaa_gui/utils/formatters.py)
- [mm_agents/os_symphony/agents/worker.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/os_symphony/agents/worker.py)
- [mm_agents/os_symphony/agents/os_aci.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/os_symphony/agents/os_aci.py)
- [mm_agents/os_symphony/agents/coder_agent.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/os_symphony/agents/coder_agent.py)
- [mm_agents/os_symphony/utils/formatters.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/os_symphony/utils/formatters.py)
- [mm_agents/coact/autogen/agentchat/contrib/capabilities/tools_capability.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/coact/autogen/agentchat/contrib/capabilities/tools_capability.py)
- [mm_agents/coact/autogen/agentchat/contrib/capabilities/agent_capability.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/coact/autogen/agentchat/contrib/capabilities/agent_capability.py)
- [mm_agents/uipath/README.md](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/uipath/README.md)
- [mm_agents/aworldguiagent/README.md](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/aworldguiagent/README.md)
- [evaluation_examples/README.md](https://github.com/xlang-ai/OSWorld/blob/main/evaluation_examples/README.md)
- [desktop_env/evaluators/README.md](https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/evaluators/README.md)
- [desktop_env/server/main.py](https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/server/main.py)
- [scripts/README.md](https://github.com/xlang-ai/OSWorld/blob/main/scripts/README.md)
</details>

# Agent Implementations, Evaluators & Benchmark Tasks

## 1. Overview and Purpose

OSWorld is a benchmark for evaluating multimodal agents on real desktop tasks. The framework is split into three cooperating layers: a set of **agent implementations** under `mm_agents/`, a **task and evaluator corpus** under `evaluation_examples/`, and a **desktop environment** under `desktop_env/`. Together they let researchers plug a new agent into the same VM, run the same set of tasks, and receive a comparable `reward` score.

The high-level loop is: an agent receives a natural-language instruction and an observation (screenshot and/or accessibility tree), emits a code-style action (typically `agent.click(...)`, `agent.type(...)`, `agent.execute(...)`), the environment executes it on the VM, and the evaluator inspects the resulting state to assign a score. Source: [README.md](https://github.com/xlang-ai/OSWorld/blob/main/README.md) and [mm_agents/README.md](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/README.md).

Community context: the leaderboard discussion in issue [#489](https://github.com/xlang-ai/OSWorld/issues/489) (Hugging Face native benchmark integration) and the methodology critique in [#517](https://github.com/xlang-ai/OSWorld/issues/517) (a pixel-blind CLI agent scoring 77.9% vs 64.3% for a vision agent on `test_all`) both depend on this same agent–evaluator–task pipeline.

## 2. Agent Implementations

OSWorld ships multiple agent implementations in `mm_agents/`, each using a different reasoning strategy. All expose a common `predict(instruction, obs)` interface (or the equivalent for multi-agent variants).

| Agent | Location | Key idea |
|---|---|---|
| `PromptAgent` | `mm_agents/agent.py` | Baseline prompt-based agent; supports `screenshot`, `a11y_tree`, and `screenshot+a11y_tree` observations. Source: [mm_agents/README.md](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/README.md) |
| `VLAA-GUI` Worker | `mm_agents/vlaa_gui/agents/worker.py` | Multi-agent design: Worker, Grounding (ACI), Gate, Verifier, Code, and Searcher. Source: [mm_agents/vlaa_gui/README.md](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/vlaa_gui/README.md) |
| `OS-Symphony` Worker | `mm_agents/os_symphony/agents/worker.py` | Orchestrator + coder-agent; injects search results and recon context. Source: [mm_agents/os_symphony/agents/worker.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/os_symphony/agents/worker.py) |
| `CoAct` | `mm_agents/coact/...` | Builds on AG2 `ConversableAgent` with composable `AgentCapability` and `ToolsCapability`. Source: [tools_capability.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/coact/autogen/agentchat/contrib/capabilities/tools_capability.py) |
| `UiPath` | `mm_agents/uipath/` | Two-stage Action Planner + Grounder (UI-TARS-1.5) using a crop-and-refine coordinate prediction. Source: [mm_agents/uipath/README.md](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/uipath/README.md) |
| `aworldGUIAgent-v1` | `mm_agents/aworldguiagent/` | Built on AWorld framework, extends Agent-S perception with new executable tools. Source: [mm_agents/aworldguiagent/README.md](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/aworldguiagent/README.md) |

### 2.1 Action Space and Observation

The action space is enforced as a single `agent.<method>(...)` call per turn. The formatter in `mm_agents/os_symphony/utils/formatters.py` validates that the response contains exactly one action and that the method name appears in the loaded tool config (`SINGLE_ACTION_FORMATTER`, `CODE_VALID_FORMATTER`). Source: [mm_agents/os_symphony/utils/formatters.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/os_symphony/utils/formatters.py).

In `mm_agents/vlaa_gui/utils/formatters.py` the same single-action guarantee is enforced, with a small bypass list (`_CODE_VALIDATION_BYPASS_ACTIONS`) for side-effecting methods such as `call_search_agent` and `call_code_agent`, and for `click` which is validated later by the grounding step. Source: [mm_agents/vlaa_gui/utils/formatters.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/vlaa_gui/utils/formatters.py).

### 2.2 VLAA-GUI Worker Flow

The `Worker` in `mm_agents/vlaa_gui/agents/worker.py` injects task instructions, optional search tutorials, and a recon context on turn 0; it then asks the `GateAgent` to propose 1–3 UI-observable success criteria that are checked every step. Source: [mm_agents/vlaa_gui/agents/worker.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/vlaa_gui/agents/worker.py). The `ACI` grounding module coordinates clicks via OCR/Tesseract fallback, and `grounding.py` wraps the `CodeAgent` with constraint-aware failure tracking. Source: [mm_agents/vlaa_gui/agents/grounding.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/vlaa_gui/agents/grounding.py).

### 2.3 OS-Symphony Flow

`mm_agents/os_symphony/agents/worker.py` shows the orchestrator appending search results, gating their use to the current turn (resetting `last_search_agent_result` after injection), and then asking the orchestrator LLM for a plan via `call_llm_formatted` with the two format checkers listed above. Source: [mm_agents/os_symphony/agents/worker.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/os_symphony/agents/worker.py). The dedicated `CoderAgent` summarizes the multi-step code execution with a 150-word cap. Source: [mm_agents/os_symphony/agents/coder_agent.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/os_symphony/agents/coder_agent.py).

## 3. Benchmark Tasks and Evaluators

### 3.1 Task Schema

Each task lives at `evaluation_examples/examples/<domain>/<id>.json` and is documented in [evaluation_examples/README.md](https://github.com/xlang-ai/OSWorld/blob/main/evaluation_examples/README.md). Required fields are:

- `id` — unique task identifier
- `snapshot` — VM snapshot id that fixes the initial state
- `instruction` — natural-language task description
- `source` — provenance URL
- `config` — setup scripts (file downloads, app launches) executed before the agent starts
- `related_apps` — applications the task will touch
- `evaluator` — directory containing the per-task `evaluate.py`

The `./trajectories` directory holds annotated gold trajectories and recordings.

### 3.2 Evaluator Design

`desktop_env/evaluators/README.md` documents per-app evaluator setup, including disabling `apport` crash reports, suppressing LibreOffice save popups, and installing domain libraries (`python-pptx`, `python-docx`, `odfpy`, `openpyxl`, `pandas`, `lxml`, `xmltodict`). For LibreOffice Calc the recommended XLSX→CSV conversion uses the `Text - txt - csv (StarCalc):44,34,UTF8,...` filter. Source: [desktop_env/evaluators/README.md](https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/evaluators/README.md).

A persistent community concern is loose grading: issue [#518](https://github.com/xlang-ai/OSWorld/issues/518) reports that several feasible-task evaluators return `reward=1` via substring matching on the final state, with no delta/causation check, which can inflate success rates when an agent happens to leave the right text in place without performing the requested action.

### 3.3 Accessibility and Observation Plumbing

The desktop-side server in `desktop_env/server/main.py` exposes per-OS accessibility namespace maps (`_accessibility_ns_map` for `ubuntu`, `windows`, `macos`) that are used to build a11y-tree observations for `a11y_tree`-mode agents. Source: [desktop_env/server/main.py](https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/server/main.py).

## 4. Common Failure Modes and Community Issues

- **Task design flaws.** Issue [#430](https://github.com/xlang-ai/OSWorld/issues/430) points to a Chrome task (`f3b19d1e-…`) whose target page does not exist, yet the task is marked feasible.
- **First-observation drift.** Issue [#515](https://github.com/xlang-ai/OSWorld/issues/515) shows that a Snap Store "software updates available" popup on the `Ubuntu.qcow2` reset can derail screenshot-only agents on their very first observation.
- **Container / Chrome DevTools 400.** Issue [#495](https://github.com/xlang-ai/OSWorld/issues/495) reports a 400 on the Chrome DevTools port from the published Docker image even with a verified host proxy.
- **Vendor-specific X server configs.** Issue [#480](https://github.com/xlang-ai/OSWorld/issues/480) shows that the `DummyScreen`/`DummyDevice` X config used for Aliyun can break the GUI when adapted carelessly.
- **Trace diagnostics proposal.** Issue [#514](https://github.com/xlang-ai/OSWorld/issues/514) suggests adding per-step failure attribution beyond the single `reward` signal.
- **VMware Fusion on Apple Silicon.** Issue [#407](https://github.com/xlang-ai/OSWorld/issues/407) documents an `ami`/provider mismatch on M-series Macs.
- **Qwen3.5-VL tuning.** Issue [#441](https://github.com/xlang-ai/OSWorld/issues/441) notes very low success rates when running Qwen3.5-VL through the `qwen3vl` pipeline and asks for an official config.

For manual debugging the helper script `scripts/bash/run_manual_examine.sh` runs a single task end-to-end with screenshots and a video, with domain-specific example IDs. Source: [scripts/README.md](https://github.com/xlang-ai/OSWorld/blob/main/scripts/README.md).

## See Also

- Desktop Environment & Providers (separate page)
- Setup Guideline (proxy, Google account, AWS) (separate page)
- Public Evaluation Platform (separate page)

---

<a id='page-4'></a>

## Deployment, Workflows & Common Failure Modes

### Related Pages

Related topics: [OSWorld Overview & System Architecture](#page-1), [VM Providers, Desktop Environment & Server](#page-2), [Agent Implementations, Evaluators & Benchmark Tasks](#page-3)

<details>
<summary>Related Source Files</summary>

The following source files were used to generate this page:

- [README.md](https://github.com/xlang-ai/OSWorld/blob/main/README.md)
- [scripts/README.md](https://github.com/xlang-ai/OSWorld/blob/main/scripts/README.md)
- [desktop_env/evaluators/README.md](https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/evaluators/README.md)
- [desktop_env/server/main.py](https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/server/main.py)
- [mm_agents/vlaa_gui/README.md](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/vlaa_gui/README.md)
- [mm_agents/os_symphony/agents/worker.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/os_symphony/agents/worker.py)
- [mm_agents/os_symphony/agents/coder_agent.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/os_symphony/agents/coder_agent.py)
- [mm_agents/os_symphony/agents/searcher_agent.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/os_symphony/agents/searcher_agent.py)
</details>

# Deployment, Workflows & Common Failure Modes

## 1. Overview

OSWorld is a benchmark for evaluating multimodal agents on real desktop operating systems. The repository ships a Docker/VM/cloud runtime (`desktop_env`), multiple pluggable agent implementations (`mm_agents/*`), example tasks (`evaluation_examples/`), and runner scripts (`scripts/python/*`). This page documents how the pieces are deployed, the end-to-end workflow that produces a benchmark number, and the failure modes that users repeatedly hit in the wild.

The main entry point for first-time users is `quickstart.py` plus the `scripts/python/run_multienv*.py` runners, while power users orchestrate their own agent through the `DesktopEnv` interface described in `desktop_env/README.md` and the agent interface in `mm_agents/README.md`. Source: [README.md](https://github.com/xlang-ai/OSWorld/blob/main/README.md).

## 2. Deployment Topologies

OSWorld supports four runtime providers, each with its own failure surface:

| Provider | Backend | Typical failure mode | Reference |
|---|---|---|---|
| `docker` | `happysixd/osworld-docker` image | Chrome DevTools port returning 400 when host proxy is mis-routed ([#495](https://github.com/xlang-ai/OSWorld/issues/495)) | `desktop_env/providers/docker/` |
| `vmware` | `Ubuntu.vmx` on host | VMware Fusion on Apple Silicon fails to start the VM ([#407](https://github.com/xlang-ai/OSWorld/issues/407)) | `desktop_env/providers/vmware/` |
| `virtualbox` | Local snapshot | Free for snapshots since v0.1.16 | [v0.1.16 release notes](https://github.com/xlang-ai/OSWorld/releases/tag/v0.1.16) |
| `aws` | EC2 instance, AMI `ami-0b505e9d0d99ba88c` | Region-specific or X11 dummy-screen breakage ([#480](https://github.com/xlang-ai/OSWorld/issues/480)) | `mm_agents/vlaa_gui/README.md` |

Default credentials for local providers are `user` / `password`; for cloud providers OSWorld defaults to `osworld-public-evaluation` to avoid trivially guessable logins. Source: [README.md](https://github.com/xlang-ai/OSWorld/blob/main/README.md).

The desktop server is started by `desktop_env/server/main.py`, which exposes an x11vnc-backed stream and an AT-SPI accessibility-tree endpoint. The server's platform-specific namespaces (`_accessibility_ns_map_ubuntu`, `_accessibility_ns_map_windows`, `_accessibility_ns_map_macos`) tell the agent how to interpret node attributes for each guest OS. Source: [desktop_env/server/main.py](https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/server/main.py).

## 3. End-to-End Workflow

```mermaid
flowchart LR
    A[Task JSON in evaluation_examples/] --> B[runner script: run_multienv_*.py]
    B --> C[Agent e.g. os_symphony / vlaa_gui]
    C --> D[Screenshot + a11y tree]
    D --> E[Worker next_action]
    E -->|click/type/code/search| F[DesktopEnv step]
    F --> G[reset to snapshot]
    G --> H[Evaluators in desktop_env/evaluators/]
    H --> I[reward 0/1]
```

The runner instantiates one `DesktopEnv` per worker process. After resetting the VM to the task's golden snapshot, it calls `env.reset(task_config)`, which the worker turns into the first observation. Source: [scripts/README.md](https://github.com/xlang-ai/OSWorld/blob/main/scripts/README.md).

Inside the agent, the **Worker** (`mm_agents/os_symphony/agents/worker.py` and `mm_agents/vlaa_gui/agents/worker.py`) builds the per-step prompt by stitching together:

1. The system prompt with `TASK_DESCRIPTION` interpolated.
2. Optional tutorials produced by the **SearcherAgent** (`mm_agents/os_symphony/agents/searcher_agent.py`), appended under `### Tutorials Found by Search Agent`.
3. The previous search/code-agent result (with `DONE` / `FAIL` reason).
4. A `⚠️ FINAL STEP` warning when `obs["is_last_step"]` is true, forcing a terminal `agent.done()` or `agent.fail()` decision. Source: [mm_agents/os_symphony/agents/worker.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/os_symphony/agents/worker.py).

When the worker hands control to the **CoderAgent**, the agent emits multi-step pyautogui scripts. After execution, the coder agent generates a `Summary: ...` block via `PROCEDURAL_MEMORY.CODE_SUMMARY_AGENT_PROMPT`, which is fed back into the next worker turn. Source: [mm_agents/os_symphony/agents/coder_agent.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/os_symphony/agents/coder_agent.py).

The **VLAA GUI** variant adds two extra components on top of the basic loop: a `GateAgent` that proposes 1–3 UI-observable success criteria at step 1, and a `VerifierAgent` that independently confirms completion before the worker is allowed to call `done()`. Source: [mm_agents/vlaa_gui/README.md](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/vlaa_gui/README.md).

Once the agent signals completion, `evaluate()` runs the per-task evaluators bundled under `desktop_env/evaluators/`. The evaluator setup README details which libraries each application needs (e.g. `python-pptx` for LibreOffice Impress, `openpyxl`/`pandas`/`lxml`/`xmltodict` for Calc). Source: [desktop_env/evaluators/README.md](https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/evaluators/README.md).

## 4. Common Failure Modes

### 4.1 VM Reset Pollution

After a snapshot reset, the Ubuntu guest frequently boots into a Snap Store "software updates available" popup, which is then captured in the agent's first screenshot and derails screenshot-only models. Source: [Issue #515](https://github.com/xlang-ai/OSWorld/issues/515).

### 4.2 Loose Evaluator Grading

Several feasible-task evaluators return `reward=1` purely on a substring match against the final file state, with no causation or delta check. A regression in `#430` (Ticket-Delivery-FAQs page does not exist yet the task is marked feasible) shows that even feasibility filtering is not bulletproof. Source: [Issue #518](https://github.com/xlang-ai/OSWorld/issues/518) and [Issue #430](https://github.com/xlang-ai/OSWorld/issues/430).

### 4.3 Final-Step Hang

Without an explicit `FINAL STEP` prompt, workers keep emitting non-terminal actions until the step budget is exhausted. The OS-Symphony worker now injects `⚠️ FINAL STEP` exactly when `obs.get("is_last_step")` is true. Source: [mm_agents/os_symphony/agents/worker.py](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/os_symphony/agents/worker.py).

### 4.4 Network & Proxy Failures

Chrome-based tasks fail when DevTools returns 400 because the host proxy is not propagated into the container (`172.17.0.1:7897` style Clash configs are a common culprit). The proxy setup instructions in `SETUP_GUIDELINE.md` and the v0.1.16 release notes both flag this. Source: [Issue #495](https://github.com/xlang-ai/OSWorld/issues/495).

### 4.5 Model-Specific Pipeline Misconfiguration

Qwen3.5-VL users report near-zero success rates because they run the model through the wrong pipeline variant ([#441](https://github.com/xlang-ai/OSWorld/issues/441)). The `qwen3vl` runner script assumes a specific prompt format; using it with the chat-completions endpoint silently downgrades accuracy. Source: [scripts/python/run_multienv_qwen3vl.py](https://github.com/xlang-ai/OSWorld/blob/main/scripts/python/run_multienv_qwen3vl.py).

## 5. Mitigations and Recommended Practices

- **Pin the provider image**: Use a digest-pinned Docker image (`happysixd/osworld-docker@sha256:...`) rather than `:latest` to avoid silent VM-state drift. Source: [Issue #495](https://github.com/xlang-ai/OSWorld/issues/495).
- **Disable apport and version pop-ups inside the guest**: `desktop_env/evaluators/README.md` recommends setting `enabled=0` in `/etc/default/apport` and turning off LibreOffice "no pop-up when ctrl+s" warnings to keep the observation stable. Source: [desktop_env/evaluators/README.md](https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/evaluators/README.md).
- **Always run scripts from the repo root**: The `scripts/README.md` documents that `sys.path` is mutated at import time, so running `python scripts/python/run.py` from the wrong directory breaks module resolution. Source: [scripts/README.md](https://github.com/xlang-ai/OSWorld/blob/main/scripts/README.md).
- **Treat trace-level diagnostics as first-class**: When reward is 0, inspect the per-step screenshots and the worker history; a CLI-only baseline (`#517`) demonstrates that pixel-blind policies can outperform vision agents once fine-grained traces are added. Source: [Issue #517](https://github.com/xlang-ai/OSWorld/issues/517) and [Issue #514](https://github.com/xlang-ai/OSWorld/issues/514).
- **Verify evaluator libraries**: `python-pptx`, `python-docx`, `odfpy`, `openpyxl`, and `pandas` must be installed inside the guest for the corresponding evaluator to run; their absence silently degrades evaluation to the substring-match fallback. Source: [desktop_env/evaluators/README.md](https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/evaluators/README.md).

## See Also

- [Quick Start & FAQ](../README.md)
- [Agent Interface](../mm_agents/README.md)
- [DesktopEnv Interface](../desktop_env/README.md)
- [Setup Guideline (proxy, Google accounts, public evaluation)](../SETUP_GUIDELINE.md)
- [VLAA GUI Architecture](../mm_agents/vlaa_gui/README.md)

---

<!-- evidence_pipeline_checked: true -->
<!-- evidence_injected: true -->

---

## Pitfall Log

Project: xlang-ai/OSWorld

Summary: Found 11 structured pitfall item(s), including 5 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

## 1. Installation risk - Installation risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/xlang-ai/OSWorld/issues/515

## 2. Maintenance risk - Maintenance risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/xlang-ai/OSWorld/issues/514

## 3. Security or permission risk - Security or permission risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/xlang-ai/OSWorld/issues/495

## 4. Security or permission risk - Security or permission risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/xlang-ai/OSWorld/issues/518

## 5. Security or permission risk - Security or permission risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/xlang-ai/OSWorld/issues/517

## 6. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | github_repo:705433049 | https://github.com/xlang-ai/OSWorld

## 7. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | github_repo:705433049 | https://github.com/xlang-ai/OSWorld

## 8. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: downstream_validation.risk_items | github_repo:705433049 | https://github.com/xlang-ai/OSWorld

## 9. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: risks.scoring_risks | github_repo:705433049 | https://github.com/xlang-ai/OSWorld

## 10. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | github_repo:705433049 | https://github.com/xlang-ai/OSWorld

## 11. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | github_repo:705433049 | https://github.com/xlang-ai/OSWorld

<!-- canonical_name: xlang-ai/OSWorld; human_manual_source: deepwiki_human_wiki -->
