Throughput as a First-Class Concern: What Holotron-12B Gets Right About Computer Use Agents
Source: huggingface
The distinction between a model that can use a computer and a model that can run hundreds of computer-use sessions simultaneously is not just a matter of scaling hardware. It requires rethinking what architectural properties matter for the workload. Holotron-12B, released by H Company in March 2026, makes that argument explicitly through its architecture choices.
The Memory Problem at the Core of Computer Use
A computer use agent processes sequences of screenshots. Every action the agent takes, click, scroll, type, navigate, produces a new observation: a high-resolution image of the current screen state. Over a full task trajectory, this compounds. A 50-step web navigation might accumulate 50 separate screenshots in context, each encoded into hundreds of vision tokens before the language model even begins reasoning about what to do next.
Transformer-based vision-language models handle this through the KV cache. During autoregressive inference, the key and value projections for every token in the current context are stored in GPU memory so subsequent forward passes do not need to recompute attention over the full sequence. The memory cost grows linearly with context length, but the attention computation itself is quadratic: attending over a context of N tokens requires O(N²) work per forward pass. For a single session, this is manageable. For 100 concurrent sessions on the same GPU, each with its own growing KV cache full of encoded screenshots, it becomes the binding constraint on batch size and therefore on throughput.
This is the problem that State Space Models were designed to address. SSMs process sequences through a fixed-size recurrent state rather than growing attention maps. Each new token updates the state vector, and that vector stays constant-size regardless of how many tokens have passed through it. The memory footprint for a running SSM sequence is bounded, which means you can fit far more concurrent sequences into the same VRAM budget. The tradeoff is that SSMs are less precise at selective retrieval from long context, the kind of reasoning that requires attending back to something specific seen hundreds of tokens ago, which transformers handle cleanly through direct attention.
Mamba, introduced in late 2023, demonstrated that selective SSMs could close much of the quality gap with transformers on language tasks while preserving linear-memory scaling. The subsequent research trajectory has focused on hybrid architectures that intersperse SSM layers with sparse attention layers, capturing the memory efficiency of recurrence while retaining precise retrieval capability where it matters most.
Nemotron-Nano as a Foundation
Holotron-12B is post-trained from NVIDIA’s Nemotron-Nano-2 VL model, which is built on exactly this hybrid SSM-attention design. The architecture replaces most standard multi-head attention layers with SSM-based recurrent layers, keeping a subset of attention layers for tasks that require long-range precise retrieval. For a computer use agent, this maps reasonably onto the task structure: most of the visual processing is local and compressive, encoding what the current screen looks like right now, while planning decisions benefit from attending back to the original task specification and recent action history.
The practical consequence is that Nemotron-Nano’s inference memory profile looks quite different from a pure transformer at the same parameter count. KV caches do not grow at the same rate as context lengthens, which means the GPU can handle larger effective batch sizes as sequences accumulate. For a deployment serving 100 concurrent agents, each processing a long trajectory of screenshots, this difference becomes measurable at the system level.
Measured Throughput
H Company benchmarked inference on a single H100 GPU using vLLM v0.14.1. At 100 concurrent requests, Holotron-12B reaches 8.9k tokens per second. The previous H Company model, Holo2-8B, reaches 5.1k tokens per second at peak concurrency and plateaus quickly as concurrent request count increases. Holotron-12B continues scaling with additional concurrent requests because its bounded state size keeps effective batch sizes large even as individual sequences grow longer.
At the 100-request benchmark point, that is roughly 75% more output throughput on identical hardware. For workloads that generate large numbers of agent trajectories, this translates directly to iteration cost. If you are running an online RL loop where each training cycle requires thousands of rollouts from a live computer use environment, the throughput of your policy model determines how fast the loop runs. More rollouts per unit time means more gradient signal per hour of GPU compute.
This is not the metric most computer use agent research prioritizes. Most benchmarking focuses on task success rate: what percentage of tasks does the model complete correctly. That metric matters, but it is not the only thing that matters in production.
Benchmark Performance
On WebVoyager, a web navigation benchmark requiring real browser interaction across a range of task types, Holotron-12B scores 80.5%. The Nemotron base model without agent-specific post-training scores 35.1% on the same benchmark. The gap reflects two stages of training: the base model provides strong visual understanding and language capabilities, while H Company’s post-training pipeline adds the specific behaviors needed for accurate UI grounding and navigation.
Grounding is one of the harder sub-problems in computer use. A click target might be a small icon in a dense toolbar. Being off by 20 pixels means missing it entirely. Models trained on general visual data often lack the coordinate precision required. Holotron-12B shows improvement over the Nemotron base on GroundUI, OS-World-G, and WebClick, three benchmarks that test the ability to identify and localize UI elements accurately in screenshots. The post-training data mixture covers approximately 14 billion tokens focused on screen understanding and navigation tasks drawn from H Company’s proprietary data pipeline.
What the Design Priorities Reveal
H Company uses computer use agents internally for data generation and annotation, running thousands of agent trajectories to produce training data for subsequent model generations. These are throughput-bound workloads by nature. The accuracy of any individual trajectory matters less than the aggregate quality of thousands of trajectories, and the speed at which those trajectories can be generated determines the pace of the development cycle.
This shapes the model design in a way that differs from an agent built primarily for interactive single-session use. An interactive computer use tool serves one user at a time and is latency-sensitive: the user is waiting for the next action. A data-generation agent runs in parallel across hundreds of tasks and is throughput-sensitive: the system is processing as many trajectories per GPU-hour as possible. These two use cases have different bottlenecks, and Holotron-12B is optimized around the second.
The online RL angle is worth dwelling on. Training a computer use agent with reinforcement learning on live environment interactions requires running many rollouts in parallel, scoring them against a reward signal, and using those scores to update model weights. The rollout generation step is typically the bottleneck because it requires the agent to actually interact with an environment in real time. A model with higher concurrent throughput means a faster rollout pipeline, which means faster RL iteration. The architecture choice here is not incidental to the training methodology; it enables it.
The Broader Framing
The computer use agent space has been defined largely by accuracy benchmarks and demonstrations. Single-session task completion rates are what get reported and compared. The research focus has been on getting the right answer: navigate to the correct page, fill out the correct form, extract the correct information from a cluttered interface.
Holotron-12B’s emphasis on concurrent throughput and production-scale data generation represents a different framing of what maturity in this space looks like. If computer use agents are going to become infrastructure rather than demos, running continuously to automate knowledge work at scale, then the throughput properties of the model matter alongside per-task accuracy. An agent that completes 80% of tasks but can process twice as many tasks per GPU-hour has a very different cost structure than one that completes 85% of tasks at half the throughput.
NVIDIA has announced a next-generation Nemotron architecture combining further improvements to the hybrid SSM-attention design with mixture-of-experts routing, and H Company has indicated plans to post-train computer use models on that foundation. The pattern suggests a compounding advantage: each generation of the base model improves the fundamental memory efficiency of the SSM layers, and the post-training pipeline converts that architectural efficiency into deployed agent capability at scale. Whether the rest of the field converges on similar architecture choices will depend on how production computer use workloads actually develop, but the case being made here is coherent and the numbers behind it are concrete.