The Throughput Problem in Computer Use Agents, and How Holotron-12B Approaches It
Source: huggingface
Computer use agents have a throughput problem that most benchmark discussions ignore. The capability question, whether a model can correctly identify and click a UI element, navigate a web form, or coordinate multi-step desktop workflows, gets most of the attention. The infrastructure question, how many concurrent agent sessions can you run on a given GPU budget, almost never does. Holotron-12B from H Company is interesting precisely because it treats throughput as a design constraint from the start, not an afterthought.
What Holotron-12B Actually Is
Holotron-12B is a fine-tuned multimodal model built on top of NVIDIA’s Nemotron-Nano-12B-v2-VL, released on March 16, 2026. H Company took that base model, applied supervised fine-tuning on a proprietary data mixture focused on screen understanding, UI grounding, localization, and navigation, and trained for approximately 14 billion tokens. The result is a model that scores 80.5% on the WebVoyager benchmark, up from 35.1% on the base Nemotron model, and achieves 8.9k tokens per second at 100 concurrent workers on a single H100 GPU running vLLM v0.14.1.
The comparison point H Company provides is their previous model, Holo2-8B, which tops out at around 5.1k tokens per second before throughput plateaus. The gain is substantial, especially given that Holotron-12B is larger.
The Architecture Decision That Drives the Throughput
The throughput story is inseparable from the architecture. Nemotron-Nano-12B-v2-VL is not a standard transformer. It uses a hybrid design combining State Space Models with attention layers, an approach rooted in the Mamba architecture from Albert Gu and Tri Dao. The install dependencies for Holotron-12B make this explicit:
pip install torch "transformers>4.53,<4.54" causal_conv1d timm "mamba-ssm==2.2.5" accelerate open_clip_torch numpy pillow
causal_conv1d and mamba-ssm are the Mamba-specific packages. These are not incidental dependencies.
The practical consequence of SSMs concerns the KV cache. In a standard transformer, the key-value cache grows linearly with sequence length. A long agent session with many screenshots, navigation history, and accumulated context will fill VRAM quickly, which directly limits how many concurrent sessions you can run on a fixed GPU. More sessions means more VRAM for KV caches, which creates a hard ceiling on concurrency unless you start doing aggressive quantization or offloading.
SSMs sidestep this by maintaining a fixed-size recurrent state per layer. The state is constant regardless of how many tokens have been processed. When you process a new token, you update the state; you do not append to a growing cache. The computation is linear in sequence length rather than quadratic, and the memory footprint per active session stays bounded.
For short single-turn interactions this difference is minimal. For agentic workloads where a session might span dozens of UI interactions across hundreds of high-resolution screenshots, the difference becomes structurally important.
Why Concurrency Matters So Much for Agentic Systems
The benchmark that H Company uses to measure throughput is WebVoyager running at 100 concurrent workers. That choice of benchmark is telling. WebVoyager is a real-world web navigation evaluation that involves actual browser interactions, not synthetic question-answering, and running 100 agents simultaneously represents a plausible production workload for any company deploying computer use at scale.
But H Company’s stated use cases for Holotron-12B go further than production serving. They specifically call out data generation, annotation, and online reinforcement learning as target workloads. Online RL is the key one. Training a computer use agent using reinforcement learning from actual environment interactions requires generating large volumes of rollouts, essentially running the agent many times across diverse tasks and collecting the results as training signal. The throughput of your inference infrastructure directly determines how fast you can generate that training data and how quickly your RL loop can iterate.
This creates a compounding dynamic: a model that achieves higher throughput can generate more RL rollouts per unit time, which means faster iteration on training, which potentially means better models sooner. Capability and infrastructure are not separate concerns; they feed each other.
WebVoyager 80.5%: Context for the Number
The WebVoyager benchmark tests an agent’s ability to complete real web navigation tasks using a live browser. The original WebVoyager paper reported GPT-4V achieving around 55.7% with human evaluation, and various subsequent models have pushed that number higher. An 80.5% score from a 12B parameter model fine-tuned for roughly 14 billion tokens on proprietary data represents a meaningful position in that landscape, though direct comparisons are complicated by differences in evaluation protocols.
What the number does confirm is that the base Nemotron model’s 35.1% score is genuinely transformed by H Company’s fine-tuning. The grounding and localization work, teaching the model to accurately identify and interact with specific UI elements rather than just describe them, accounts for most of that gap. This is consistent with what other computer use model developers have found: raw vision-language capability transfers poorly to the pixel-level grounding required for reliable UI interaction.
The Hybrid SSM Trade-offs
SSMs are not without costs. The recurrent state is a fixed bottleneck: information that doesn’t fit within the state is lost, and the model cannot attend back to arbitrary earlier positions the way a full attention mechanism can. Hybrid architectures like the one in Nemotron-Nano try to get the best of both by interspersing attention layers with SSM layers, preserving some direct access to long-range context while capturing the memory and compute advantages of the recurrent approach for most of the processing.
In practice, for computer use specifically, this trade-off seems favorable. A UI agent primarily needs to understand the current screen state and maintain a compact representation of recent navigation history. It does not typically need to cross-reference content from fifty steps ago with precision. The recurrent state is a reasonable fit for that access pattern.
H Company’s roadmap mentions post-training on the Nemotron 3 Omni architecture with higher-resolution vision training as the next step. Omni architectures in this context typically refer to models that handle multiple modalities, including audio and structured data, in addition to vision and text, which would extend the agent’s ability to interact with a broader range of interfaces.
What This Points Toward
The broader trend Holotron-12B represents is the specialization of model architecture for agent-specific constraints. General-purpose VLMs optimized for visual question answering or document understanding carry assumptions that don’t fit agentic workloads well: long context with many high-resolution images, high concurrency for parallel rollouts, and tight latency requirements for interactive sessions.
Building on top of a hybrid SSM base rather than a standard transformer is one way to address those constraints at the architecture level rather than through inference-time engineering alone. Whether SSMs ultimately dominate this space or whether transformer-based models close the gap through better KV cache management, quantization, and hardware-aware optimization is not settled. But the fact that a serious computer use developer is choosing a hybrid SSM foundation says something about where the production pressures in this space are pointing.
The model is available on Hugging Face under the NVIDIA Open Model License, and the specific dependency pinning (mamba-ssm==2.2.5, transformers>4.53,<4.54) suggests the stack is still maturing. Anyone running this in production should expect some operational overhead around dependency management that would not exist with a standard transformer-based VLM.