The Synthetic Environment Factory: How Holo3 Cracked the Computer Use Training Data Problem
Source: huggingface
When Anthropic shipped Claude’s computer use capability in late 2024, the immediate developer reaction was excitement followed quickly by friction. The models could see a screen, reason about it, and issue mouse clicks and keystrokes, but real-world reliability was low enough that most production applications kept humans tightly in the loop. H Company’s Holo3, published April 1, 2026 (the date raised eyebrows, but the model weights and inference API access are genuine), claims 78.85% on OSWorld-Verified and releases weights under Apache 2.0. Both details are worth examining carefully.
What OSWorld-Verified Actually Measures
OSWorld is a benchmark introduced in a 2024 paper that evaluates computer use agents on real desktop tasks inside actual virtual machines running Linux, Windows, and macOS. Tasks span file management, web browsing, terminal operations, document editing, and cross-application workflows. Completion is verified programmatically through test scripts, not by a secondary model scoring screenshots.
The “Verified” suffix refers to a curated subset where task completion criteria have been audited for reliability. In a benchmark where an agent must resize a window, save a spreadsheet in a specific format, or drag files into the correct directory, small bugs in the verification scripts produce noisy signals. The Verified variant filters those out, making scores more reproducible and meaningful for comparison.
When OSWorld launched, GPT-4V scored around 8% on the full benchmark. By late 2024, frontier models with dedicated computer use training were reaching 30 to 40% on the standard evaluation. A score of 78.85% on the Verified subset is a substantial jump; most of the failure modes that accumulate around 40% are not failures of reasoning but failures of grounding, where the model knows what to do but misidentifies the target element, or fails to register that a dialog box has appeared over the window it was targeting. Closing that grounding gap, at scale, is the hard engineering problem H Company appears to have addressed.
Mixture-of-Experts at Computer Use Speed
Holo3 uses a sparse mixture-of-experts architecture with 122 billion total parameters and 10 billion active per forward pass. For general language tasks, MoE primarily reduces training and inference cost. For computer use, the active parameter count has a more direct and compounding consequence: latency.
A desktop automation loop where the model sees a screenshot, decides on an action, and executes it repeats that inference cycle many times per task. A multi-step workflow that involves parsing a PDF, cross-referencing a spreadsheet, and sending personalized emails might require fifty or more inference calls. With 10B active parameters rather than a dense 70B model, each cycle is substantially faster, and that difference compounds across the full task length.
The comparison to H Company’s previous release is instructive. Holo2-235B-A22B used 235 billion total parameters with 22 billion active. Holo3 reaches higher benchmark scores with 10B active versus 22B. That efficiency gain reflects what the curated training pipeline selects for: not raw capacity, but the right capacity deployed precisely.
The Real Innovation: Generating the Training Environments
The component of Holo3’s development that deserves the most attention is the synthetic environment factory, because the training data problem for computer use differs fundamentally from the training data problem for code generation or chat.
For a model that writes code, you can mine GitHub, Stack Overflow, and documentation. The artifacts exist at scale and are largely self-contained. For a model that uses a computer, you need something harder: a running software environment, a task specification with a clear and programmatically checkable success condition, and an execution trace through that environment. You cannot scrape this from the web. You cannot use synthetic text generation to produce it. You need to spin up actual GUIs, define tasks that can be verified automatically, and generate or record demonstrations of correct behavior. At the scale required to train a model of this size, that demands an automated pipeline.
H Company’s approach uses coding agents to generate the environments themselves. Given a scenario specification, a coding agent programs a website or application from scratch, producing a running environment that matches the scenario’s requirements. Verification scripts are generated alongside each environment, so task completion can be checked automatically without human review of individual examples. Task difficulty can be varied systematically across the same scenario family, and the entire pipeline runs continuously, producing new training environments as the model’s capabilities improve.
This is structurally analogous to how robotics researchers use physics simulation to generate training data for manipulation policies, or how AlphaCode generated programming problems to train on. The core insight is the same: if real-world demonstration data is too sparse or too expensive to collect at the required scale, build a generator for synthetic data that preserves the structure of the real problem. In robotics, the challenge is simulation-to-reality transfer; physical accuracy matters for policy transfer. For computer use, the analogous challenge is GUI realism and diversity, making synthetic applications varied and representative enough that policies trained on them generalize to production software.
The out-of-domain augmentation component addresses this generalization gap directly. Programmatic scenario extension creates variants of training environments that push agents into edge cases: unusual dialog flows, unexpected application states, non-standard UI configurations. This is the part of the pipeline most likely to determine whether a model that scores well on OSWorld also handles the specific combination of tools that a real enterprise deploys, a stack that includes Jira, Confluence, Salesforce, and a bespoke internal tool that no benchmark has ever seen.
Three Training Pillars
The training methodology organizes into three layers. Synthetic navigation data provides scenario-specific demonstrations from both human recordings and model-generated traces; this supplies the dense, task-specific behavioral signal. Out-of-domain augmentation extends coverage to edge cases through programmatic generation. Curated reinforcement learning then filters and weights the resulting data, using a quality-first pipeline to prioritize samples that produce genuine performance improvement.
The RL component is where the most significant gains in recent model training have originated. The emphasis on curation reflects a meaningful shift away from scale-first thinking. For computer use specifically, noisy training traces are actively harmful rather than merely uninformative. A demonstration that teaches the model to click the wrong element under a particular condition, or to abandon a task after the wrong trigger, propagates systematic errors. Filtering those out is not a minor detail; it is load-bearing infrastructure for everything above it.
Open Weights and What They Enable
Holo3 is released under Apache 2.0 with weights available on Hugging Face. This matters in the computer use context for a concrete reason that is easy to overlook.
Enterprise computer use deployments involve sensitive data: financial documents, HR systems, internal tooling, proprietary workflows. Many organizations cannot route screenshots of their internal systems through a third-party API regardless of contractual data handling terms. An open-weight model that runs on-premises or in a private cloud removes that constraint entirely. The organization controls the inference stack, the data stays on the internal network, and the model can be fine-tuned on internal workflows under the same data governance policies that apply to everything else.
Proprietary computer use products from OpenAI (Operator) and Anthropic (Claude Computer Use) have made real progress on capability, but both remain API-only products. H Company is making a bet that for serious enterprise deployment, local execution is not a preference but a requirement, and that matching closed-model benchmark performance while offering that option is a durable competitive position.
What Adaptive Agency Would Actually Require
The roadmap item H Company calls “adaptive agency” is the most technically ambitious part of the announcement. Holo3, like every current computer use agent, is trained on a distribution of known software environments. It generalizes across applications it has seen during training and handles variations within that distribution.
The unsolved problem is navigating entirely new software with no prior exposure: a bespoke internal CRM built a decade ago, a proprietary trading platform, an industry-specific tool with no public presence. A human employee encountering such a system for the first time reads UI labels, infers button behavior from visual context and positioning, asks a colleague, tries something and observes the result. The agent equivalent of that process, working reliably in real time without a prebuilt training environment, remains an open research problem.
The synthetic environment factory points in the right direction. If you can generate training environments automatically, the gap between “seen during training” and “seen for the first time” narrows. But narrowing it to zero requires the agent to reason about novel interfaces from first principles rather than pattern-matching to training distribution. That is a different capability than what current benchmarks measure, and it is the capability that would make computer use agents transformative for enterprise operations rather than useful for a bounded set of known workflows.
Holo3 is a clear advance on a hard benchmark, with an architectural and training approach that is worth studying closely. The synthetic environment factory in particular represents a solution to a problem that any serious computer use agent has to solve, and H Company has made the model itself open for anyone building in this space.