High Concurrency Computer Use: The Architecture Decision Behind Holotron-12B
Source: huggingface
The computer use agent benchmarks everyone publishes measure task success rate: how often the model clicked the right button, filled the right form, completed the workflow. That’s a legitimate metric for evaluating capability, but it tells you nothing about whether you can run these models in production at any meaningful scale.
H Company’s Holotron-12B makes throughput the lead story, and that framing is worth taking seriously.
What Makes Computer Use Inference Expensive
A standard LLM inference session has a relatively predictable memory profile. You load the model weights, allocate a KV cache proportional to your sequence length, and generate tokens. For chat applications, sequences stay short and the arithmetic is manageable.
Computer use agents break that model. A single task might involve 20 to 50 steps, and each step involves a screenshot. High-resolution screenshots converted to image tokens consume substantial context. By step 30, you’re managing a sequence that includes dozens of images alongside the full text history of actions and observations. The KV cache, which in standard transformer architectures grows linearly with sequence length, balloons accordingly.
When you’re running 100 concurrent sessions for a data generation pipeline or an online reinforcement learning loop, that memory pressure compounds. You can’t pack as many requests into a single GPU because each request is consuming VRAM proportional to its context length. Batch sizes shrink, throughput drops, and the GPU sits underutilized relative to its actual compute capacity.
This is the specific problem Holotron-12B is built to address.
The Architecture Decision
Holotron-12B is post-trained on NVIDIA’s Nemotron-Nano-12B-v2-VL, which uses a hybrid State Space Model (SSM) and attention architecture. SSMs have been generating research interest since the Mamba paper demonstrated that recurrent sequence models could match transformer perplexity while offering fundamentally different inference characteristics.
The core difference is memory. Transformers maintain a KV cache that grows with sequence length. SSMs maintain a fixed-size state per layer, constant regardless of how long the sequence becomes. For computer use specifically, that constant state changes the operational arithmetic. A 50-step task with 50 screenshots doesn’t require proportionally more VRAM per request than a 5-step task. The model processes each new observation by updating its fixed recurrent state rather than appending to an ever-growing cache.
The hybrid approach in Nemotron-Nano mixes SSM layers with selective attention layers. Pure SSMs can struggle with tasks requiring precise recall of specific tokens seen earlier in the sequence, and selective attention layers address that weakness. You trade away some of the memory efficiency of a pure SSM, but you retain most of it while preserving the model’s ability to look back precisely when needed. For production agent workloads, that’s a reasonable tradeoff.
The Numbers
H Company benchmarks Holotron-12B on a single H100 GPU using vLLM v0.14.1 with SSM-specific optimizations. Peak throughput reaches 8.9k tokens per second, compared to 5.1k tokens per second for their previous Holo2-8B model.
The model size comparison matters here. Holo2-8B is a smaller model; Holotron-12B has roughly 50% more parameters. Under normal circumstances you’d expect throughput to drop as model size increases, since larger models require more compute per forward pass. The throughput improvement despite the size increase reflects architectural efficiency from SSM rather than raw hardware headroom.
The concurrency scaling behavior matters as much as the peak figure. Throughput scales linearly with concurrency up to 100 workers. Holo2-8B plateaus before that point. Linear scaling means the model’s memory footprint isn’t creating a hard ceiling at moderate concurrency levels, which is essential when running large parallel batches for trajectory generation or RL rollouts.
On task capability, Holotron-12B reaches 80.5% on WebVoyager, a benchmark covering real-world web navigation across live websites. The base Nemotron model scores 35.1% on the same benchmark before post-training. Closing that gap required approximately 14 billion tokens of two-stage supervised fine-tuning on H Company’s proprietary localization and navigation data, focused on screen understanding, UI grounding, and interaction sequencing.
Why Throughput Is the Right Target Here
The groups evaluating computer use models on task benchmarks mostly run individual sessions to measure success rate. The groups building infrastructure, annotation pipelines, synthetic trajectory systems, and online RL environments need to run thousands of parallel sessions continuously.
For those workloads, the SSM memory profile provides a concrete operational advantage. Fitting more concurrent sessions per GPU directly reduces cost per task at scale. H Company explicitly lists data generation, annotation, and online RL as primary target applications. These aren’t edge cases; they’re the core use cases for organizations building computer use agents, because generating trajectory data at scale is how you train the next version of the model.
There’s a compounding dynamic in this. Higher inference throughput means more training data generated per unit of compute. More training data enables a better model in the next training cycle. H Company notes that Holotron-12B’s successor is already being post-trained on NVIDIA’s Nemotron 3 Omni architecture, which adds Mixture-of-Experts efficiency on top of the SSM memory benefits. If that pattern holds, each generation of the model funds the data generation capacity for the next.
Real Tradeoffs
SSMs aren’t uniformly better than transformers. The fixed state size is a lossy compression of the input history. For tasks requiring exact recall of specific tokens seen many steps earlier, the SSM state may not preserve that information reliably. Selective attention layers mitigate this, but the tradeoff doesn’t disappear entirely.
Long-horizon computer use tasks can have dependencies that span many steps. If the model needs to recall an exact value entered in step 3 when it’s at step 47, a pure SSM is more likely to lose that than a transformer with a full KV cache intact. How much this affects real task performance depends heavily on task distribution. WebVoyager scores suggest it isn’t prohibitive for general web navigation. Enterprise workflows with high-precision recall requirements would need direct evaluation rather than relying on general benchmarks.
The infrastructure dependency is also worth noting. SSM models have historically required custom CUDA kernels and non-standard inference tooling. vLLM’s growing native SSM support simplifies deployment, but teams with existing inference pipelines built around transformer assumptions should expect integration work before capturing the throughput gains.
Where This Fits in the Ecosystem
Anthropic’s Claude computer use and OpenAI’s analogous capabilities attract more attention as consumer-facing products, but the open model ecosystem needs efficient options for the research and infrastructure layer, where the economics of scale matter directly and model licensing determines what you can build on.
Holotron-12B is a specific architectural bet: SSM is the right foundation for high-concurrency agent inference. The throughput numbers on a single H100 support that bet for the workload profile H Company is targeting. Whether the advantage holds as tasks grow longer-horizon, require more precise memory retrieval across many steps, or move into specialized domains will determine how much value this approach delivers beyond current benchmarks.
The model is available on Hugging Face under the NVIDIA Open Model License.