3B Active Parameters, State-of-the-Art Computer Use: What Holo3 Reveals About Agent Training
Source: huggingface
H Company dropped Holo3 on April 1, 2026, which is either a bold release date choice or the most confidence a team has ever shown in their own results. The headline number is 78.85% on OSWorld-Verified, currently the leading score on the most rigorous desktop computer use benchmark available. The more interesting number is 3B: the active parameter count doing that work.
That gap between total and active parameters is the whole story. Holo3-35B-A3B has 35 billion total parameters but uses a sparse Mixture-of-Experts architecture built on top of Qwen3.5-35B-A3B, meaning only about 3 billion parameters are active during any given forward pass. At inference time, this model is roughly competitive in compute cost with a dense 3B model, not a 35B one. That matters enormously if you are actually running these agents at scale, because computer use is inherently sequential and latency-sensitive in a way that batch text generation is not.
Why MoE Makes Sense for Computer Use
Mixture-of-Experts has had a complicated reputation. The architecture routes each token through a subset of specialized “expert” feed-forward layers rather than the full network, which gives you more total capacity without a proportional increase in compute per forward pass. The tradeoff has historically been training instability, load balancing headaches, and memory pressure from having to keep all those parameters resident even when most are idle.
For computer use specifically, the case for MoE is stronger than for general language modeling. A GUI agent is doing a genuinely heterogeneous set of sub-tasks within a single trajectory: reading text from a screenshot, identifying UI elements by visual position, reasoning about multi-step task state, generating action sequences, and recovering from unexpected interface states. These sub-tasks have different computational signatures. Routing different aspects of perception and planning through different expert subsets is at least a plausible architectural fit, and the Qwen3.5 base already came with a stable MoE foundation to fine-tune from.
The model card lists the tensor format as BF16 safetensors, which means standard loading via transformers with the usual quantization options available. If you want to run this locally you are looking at roughly 70GB in BF16, which puts it in the range of a two-GPU consumer setup or a single high-end workstation card.
What OSWorld Actually Measures
OSWorld-Verified is worth understanding before treating the benchmark number as a simple score. The original OSWorld benchmark was designed to test agents on realistic computer tasks inside live virtual machines, covering applications like LibreOffice, Chrome, VS Code, and system-level operations. Tasks require multi-step execution and the environments are dynamically initialized, so there is no static answer key.
The “Verified” variant adds an additional validation layer on top of the base benchmark, filtering out tasks where automated verification was ambiguous and adding stricter functional checks. This makes the scores harder to game through pattern matching or approximate completion. A score in the high 70s on this variant represents genuine task completion across a wide spread of application types, not just success on well-represented categories.
For reference, the state of computer use agents twelve months ago was sitting in the 30-40% range for the strongest proprietary systems. The jump to near 80% in the verified setting is substantial, and Holo3 sits at the top of that curve alongside models with significantly higher inference costs.
The Training Pipeline Is the Real Innovation
The architecture is the delivery mechanism; the training approach is where H Company appears to have made their actual technical bets.
They describe what they call an “Agentic Learning Flywheel” built on three components. The first is synthetic navigation data: a mix of human-annotated trajectories and programmatically generated instruction-scenario pairs. The second is out-of-domain augmentation, where they extend scenarios programmatically to cover unexpected interface states and edge cases that would be expensive to annotate manually. The third is curated reinforcement learning with explicit data filtering before the RL stage.
The piece that stands out is the Synthetic Environment Factory. Rather than relying on recordings of real software or carefully hand-crafted test environments, they built a system where coding agents generate entire websites and application environments from scratch, then verify those environments end-to-end with automated scripts before using them as training scenarios. This closes a loop that has been a persistent problem in agent training: you need diverse, realistic environments to train on, but curating them by hand does not scale, and scraping the web gives you demonstrations but not verifiable task completions.
Generation plus verification is a pattern that has shown up across recent AI training pipelines. For math reasoning it looks like generating candidate proofs and checking them against a formal verifier. For computer use it looks like generating candidate environments and checking that they behave as specified. The common thread is that you can scale training data by automating the checking step, which is usually easier than automating the generation step.
Two Pillars: Perception and Planning
Holo3 is trained with an explicit separation between two agentic capabilities that H Company calls Perception and Decision-Making. Perception covers visual grounding: finding UI elements, reading text from screenshots, understanding spatial layout. Decision-Making covers the planning layer: maintaining task state across steps, selecting actions given current interface state, and recovering when an action produces an unexpected result.
This decomposition is not just organizational. It reflects a real asymmetry in how these capabilities fail. Perception errors tend to be local: the model misidentifies a button or misreads a label. Planning errors tend to be global: the model loses track of where it is in a multi-step task or fails to account for a dependency between actions. Training for these two modes explicitly, rather than hoping a single objective covers both, is a reasonable engineering decision.
The benchmark performance on ScreenSpot-Pro, which specifically tests UI localization accuracy, suggests the perception side is working well. The 486-task H Corporate Benchmark, which includes long-horizon multi-application workflows like cross-referencing a PDF against a spreadsheet and then sending personalized emails based on the result, tests the planning side under more realistic conditions than most academic benchmarks.
Open Weights at This Performance Level
The Apache 2.0 license is not a small detail. Most competitive computer use systems at this performance level are proprietary APIs. Running them in production means accepting vendor pricing, rate limits, data retention policies, and the risk that the API changes or disappears. For enterprise deployments involving sensitive documents and internal systems, sending screenshots of your applications to a third-party API is often a non-starter regardless of performance.
Open weights at 78%+ on a verified benchmark changes that calculation. An organization can deploy Holo3 on their own infrastructure, control what data leaves their network, and fine-tune on their own workflows without depending on a vendor’s fine-tuning offering. The 3B active parameter footprint means they can do this at reasonable inference cost rather than needing datacenter-scale hardware.
This is the recurring pattern of the past two years: a capability appears first in proprietary systems, then open-weight models reach parity within several months, then the open-weight versions start winning on efficiency because they benefit from the broader ecosystem of quantization, optimization, and tooling. Computer use has taken longer than text reasoning to follow this curve, partly because the training data problem is harder and partly because the evaluation infrastructure is more complex. Holo3 looks like the inflection point.
What Remains Unproven
The H Corporate Benchmark is worth treating with some skepticism until independent evaluations replicate it. The 486-task set was designed by H Company for their own product, which means it likely reflects the kinds of tasks their training pipeline was optimized for. This does not make the benchmark meaningless, but it does mean the number should be read as “performs well on realistic enterprise workflows as H Company defines them” rather than as a universal claim.
The roadmap mention of “Adaptive Agency,” described as real-time learning to navigate previously unseen enterprise software, is ambitious enough to treat as a future product promise rather than a current capability. Adapting on the fly to a novel interface without prior training is a substantially harder problem than performing well on a fixed distribution of known applications.
There is also the question of robustness. Benchmark performance and deployment reliability are different things. A model that completes 78% of benchmark tasks successfully might have a very different error distribution than one that is actually pleasant to use in production, where partial completions and graceful failures matter as much as success rate.
Where This Points
H Company’s inference API offers a free tier alongside the open weights, which gives developers a path to prototype before committing to self-hosting. The combination of accessible weights and a hosted option mirrors what has worked for text models: let researchers and small teams experiment with the open version, convert production users to the API where the operational overhead is handled for them.
The broader implication is that the compute cost of computer use agents is about to drop significantly. When the leading open model has 3B active parameters, optimization efforts from the community, including quantization to INT4 or INT8, speculative decoding, and batched action generation, become directly applicable. The same infrastructure improvements that made 7B text models fast enough for interactive use will apply here.
Building with this kind of agent has always had an obvious appeal for bot and automation work. The question has been whether the models were reliable enough to be worth the integration complexity. At 78% on a verified benchmark with an open license and a sub-4B active parameter footprint, the reliability argument is getting harder to dismiss.