The Throughput Bet: Holotron-12B and the Case for Hybrid SSM in GUI Agents
Source: huggingface
Most GUI agent papers optimize for one thing: benchmark score. HCompany’s Holotron-12B is notable because it optimizes for two things simultaneously, and the second one — inference throughput at high concurrency — is arguably more important for anyone trying to run these agents in production.
The headline numbers: 80.5% on WebVoyager (up from 35.1% for the base model), and 8,900 tokens per second on a single H100 with 100 concurrent sessions. That throughput figure is 75% higher than their previous Holo2-8B model, despite Holotron being 50% larger by parameter count. That combination is not obvious, and the reason for it sits entirely in the architecture.
Why KV Cache Is the Real Bottleneck for Computer Use Agents
A computer use agent processes screenshots. Lots of them. Each screenshot, when encoded by a vision-language model, expands into hundreds or thousands of tokens depending on the resolution and encoding scheme. A multi-step task — filling out a form, navigating a booking site, operating a desktop application — might require twenty or thirty screenshots across a single session. That’s potentially 30,000 to 60,000 vision tokens in a single context, on top of all the text.
In a standard transformer, every one of those tokens needs a KV cache entry that persists for the duration of generation. At batch size 1, this is manageable. At 100 concurrent sessions on one GPU, the KV cache alone can fill the device’s memory before you’ve generated a single output token. The usual response to this is to limit batch size, which directly tanks throughput.
This is the specific problem that Holotron-12B’s architecture is designed to address.
The Hybrid SSM Architecture
Holotron-12B is built on NVIDIA’s Nemotron-Nano-12B-v2 as a base, which is a hybrid model combining State-Space Model (SSM) layers with traditional attention layers. The distinction matters enormously for inference.
State-Space Models, particularly the selective variant introduced by Mamba (Gu and Dao, 2023), process sequences by compressing history into a fixed-size recurrent state. The state doesn’t grow as the sequence gets longer. At inference time, SSM layers don’t read or write a KV buffer — they update a constant-size hidden state. Memory per sequence is essentially flat regardless of context length.
Pure SSMs have one known weakness: precise retrieval of specific earlier tokens. Attention handles this well because every token can directly attend to every other token in the window. SSMs, because they compress history, can lose precise detail. The hybrid approach keeps a fraction of attention layers for retrieval while delegating most sequence processing to SSM layers. The Nemotron-Nano family, along with similar architectures like Jamba from AI21 Labs and Zamba from Zyphra, all sit in this design space: the SSM:attention layer ratio becomes a key tunable hyperparameter trading off quality versus memory footprint.
For a computer use agent, the practical outcome is that you can pack far more concurrent sessions onto a single GPU without blowing up memory. Larger effective batch sizes translate directly to higher tokens per second when you’re serving many users or running large-scale data generation pipelines.
Training: From 35% to 80% on WebVoyager
Starting from the Nemotron-Nano base, HCompany ran a two-stage post-training process on roughly 14 billion tokens of proprietary data focused on two things: localization (where on the screen is this element) and navigation (how do I sequence actions to complete this task). The result moves WebVoyager score from 35.1% to 80.5%.
WebVoyager is a live web navigation benchmark where the agent operates a real browser to complete tasks like searching for information, filling forms, and extracting data from pages. It’s more end-to-end than pure grounding benchmarks like ScreenSpot, which only tests whether a model can point at the right pixel given a description. WebVoyager success requires both accurate UI understanding and coherent multi-step planning.
80.5% is competitive with models significantly larger. The comparison worth watching is against UI-TARS from ByteDance, which comes in a 72B variant and achieved strong numbers on OSWorld and ScreenSpot Pro in early 2025. Holotron-12B doesn’t publish ScreenSpot Pro numbers in the current post, so direct comparison on that benchmark isn’t possible yet, but the WebVoyager result at 12B parameters puts it firmly in contention.
HCompany also reports improvements on OS-World-G (the grounding sub-task from OSWorld), GroundUI, and WebClick, though the post focuses on WebVoyager as the primary end-to-end signal.
The Production Economics Argument
The throughput-first framing in this release is deliberate, and it points at something the research community doesn’t always foreground: building a computer use agent that works is a different problem from building one that’s economically viable to run at scale.
Consider what’s required to train a capable GUI agent beyond the initial supervised fine-tuning stage. Online reinforcement learning, where the agent generates its own training data by actually completing tasks, requires running thousands of agent sessions concurrently, evaluating outcomes, and updating on the results. Data annotation pipelines at scale require the same. At 8,900 tokens per second with 100 concurrent workers on one H100, Holotron-12B substantially reduces the GPU-hours required to generate a given volume of training data compared to a model that produces 5,100 tokens per second with the same hardware allocation.
HCompany frames this explicitly: the throughput advantage is about making the data generation and online learning pipelines cheaper to run, which creates a compounding advantage as the model improves through iteration.
This is a real distinction from how research groups typically evaluate computer use agents. Anthropic’s computer use capability, released in October 2024, is a general-purpose adaptation of Claude rather than a purpose-trained policy model. OpenAI’s Operator uses a similar adaptation approach. These systems optimize for capability and safety in a general-purpose context. HCompany’s Holo line is optimizing for the specific policy-training trajectory, where throughput is a first-class constraint alongside accuracy.
What the Architecture Tradeoffs Actually Cost
Hybrid SSM-attention isn’t free. The fixed-size recurrent state in SSM layers means the model can lose information from very distant context in ways that pure attention cannot. For a web navigation task, this might matter when the relevant screenshot from ten steps ago contains a piece of information needed to complete the current step. Attention would retrieve it precisely; SSM layers might have compressed it away.
How much this matters in practice depends heavily on the task distribution. For most GUI agent tasks, the relevant context is recent: the current screenshot, the last few actions, the original task description. Precise retrieval of a token from 50,000 positions back is rare. HCompany’s 80.5% WebVoyager score suggests the hybrid architecture doesn’t hurt on the benchmark tasks where this precision matters, but it’s worth being cautious about extrapolating to tasks with very long dependency chains.
The other constraint is the base model’s license. Nemotron-Nano is released under the NVIDIA Open Model License, which permits commercial use with conditions including attribution and certain use restrictions. For anyone building on top of Holotron-12B, the licensing chain matters. This is different from a model built on a fully permissive base like Mistral or Llama, so it’s worth reading the license carefully before committing to it for a production system.
Where This Fits in the GUI Agent Landscape
The GUI agent space in early 2026 has stratified into roughly three clusters. General-purpose models adapted for computer use (Claude, GPT-4o) bring broad capability but weren’t trained with GUI tasks as a primary objective. Large purpose-trained models (UI-TARS 72B) push accuracy as high as possible at significant inference cost. Smaller purpose-trained models (Holotron-12B, prior Holo2-8B) target the specific point where accuracy is good enough for most tasks and inference cost enables the scale needed to keep improving.
Holotron-12B occupies that third tier and advances it meaningfully. The combination of 80.5% WebVoyager performance with 8,900 tokens per second throughput establishes a new Pareto frontier for models in this class. NVIDIA’s roadmap to build Nemotron 3 Omni with a hybrid SSM and Mixture of Experts (MoE) combination suggests this architectural direction will get more competitive over time, not less.
The model is available on Hugging Face and runs on standard vLLM v0.14.1 infrastructure. For teams building agentic workflows that need to run many sessions concurrently — automated testing, data collection, enterprise task automation — the throughput numbers alone make it worth evaluating against whatever you’re currently using.