· 6 min read ·

Throughput as a First-Class Concern: What Holotron-12B Gets Right About Computer Use Agents

Source: huggingface

The throughput problem in computer-use agents is different from the throughput problem in chat applications. In a chat system, latency dominates: users wait for each response, and slow generation is immediately felt. In a computer-use agent, the model may issue dozens or hundreds of actions in a single session, often with many parallel workers running simultaneously for data collection, RL rollouts, or batch annotation jobs. What matters in that context is how many agent-steps per second you can push through a given piece of hardware.

This is the framing behind Holotron-12B, released by H Company on March 17, 2026. The model is a 12B-parameter multimodal vision-language model built for computer-use workloads, post-trained on top of NVIDIA’s Nemotron-Nano-12B-v2-VL-BF16. The headline number is straightforward: roughly 8,900 tokens per second on a single H100 at concurrency=100 using vLLM v0.14.1, compared to around 5,100 for the previous Holo2-8B. That is more than 2x the throughput from a model that is 50% larger in parameter count.

Why the Architecture Explains the Throughput

The throughput gain does not come from better hardware or more aggressive batching. It comes from the base model’s architecture. Nemotron-Nano is built on a hybrid State-Space Model (SSM) and attention design, drawing from the same lineage as Mamba and related work on linear-complexity sequence models.

Pure transformer attention has quadratic complexity with respect to sequence length: processing N tokens requires O(N²) computation and O(N) KV cache memory. For a computer-use agent processing high-resolution screenshots with long action histories, sequence lengths can be substantial. A single annotated screenshot, tokenized at the resolutions common in recent VLMs, can consume thousands of tokens before any prior context is added.

The SSM layers replace much of this with a recurrent state that has constant memory per layer regardless of sequence length. Each SSM layer maintains a fixed-size hidden state, rather than growing a KV cache proportional to the amount of context processed. The practical effect on throughput is significant: at high concurrency, memory bandwidth and KV cache size often become the bottleneck before compute does. A model that eliminates most of the KV cache can serve more concurrent requests on the same hardware.

The installation requirements reveal exactly what is happening under the hood:

pip install torch "transformers>4.53,<4.54" causal_conv1d timm "mamba-ssm==2.2.5" accelerate open_clip_torch numpy pillow

The mamba-ssm==2.2.5 and causal_conv1d dependencies are the SSM infrastructure. These are CUDA-level custom kernels, not pure PyTorch. The Mamba paper’s selective scan is implemented in optimized CUDA to maintain efficiency; without those kernels, the SSM layers would be slower than attention in practice. This matters for deployment planning: the model requires a CUDA-capable GPU, and the specific kernel versions are not interchangeable.

Training Pipeline

Holotron-12B is not trained from scratch. H Company started from Nemotron-Nano-12B-v2-VL-BF16, NVIDIA’s base multimodal model, and applied supervised fine-tuning on approximately 14 billion tokens of proprietary localization and navigation data.

The two-stage post-training approach is fairly standard in the computer-use agent space. The base VLM provides general visual reasoning and language understanding. The fine-tuning stage specializes the model for the specific perceptual and decisional demands of screen interaction: identifying clickable elements, understanding UI affordances, predicting coordinates for mouse actions, and decomposing multi-step workflows.

The localization benchmarks reflect this. On OS-World-G, GroundUI, and WebClick, Holotron-12B substantially outperforms the Nemotron base model, which is expected: Nemotron-Nano-12B-v2-VL-BF16 was not designed for screen grounding tasks. The fine-tuning data teaches the model where buttons are, how scroll indicators work, and what “close this dialog and proceed” means in the context of a UI interaction sequence.

The WebVoyager result deserves closer examination. WebVoyager is a web navigation benchmark that tests an agent’s ability to complete realistic browsing tasks: finding information, filling forms, navigating multi-page flows. The base Nemotron model scores 35.1%; Holotron-12B scores 80.5%. The gap is substantial, and it suggests the fine-tuning data is doing significant work beyond localization alone. The model appears to have internalized something about planning web navigation sequences, not just identifying where to click.

For comparison, evaluations of general desktop computer-use agents on OSWorld tend to land in a much lower range, though those benchmarks test broader and more open-ended tasks than WebVoyager’s structured web navigation. WebVoyager is a more constrained evaluation domain, which makes 80.5% more achievable, but it remains a meaningful number for the specific class of web-based workflows Holotron-12B targets.

What High Throughput Actually Enables

The use cases H Company lists for Holotron-12B are data generation and annotation, online reinforcement learning, and throughput-bound agentic workloads. From an infrastructure perspective, the first two are closely related.

Training a computer-use agent with reinforcement learning requires rolling out many trajectories. Each rollout is an agent session: the model perceives a screen state, decides on an action, the environment executes the action, and the cycle repeats. A single RL training step may require hundreds or thousands of such rollouts for stable gradient estimates. If each rollout involves 20 to 50 model calls and you are running 100 parallel workers, the model server’s throughput becomes the rate-limiting factor for the entire training pipeline.

At 8,900 tokens per second with 100 concurrent workers, Holotron-12B can sustain a meaningful RL training loop on a single H100. Many organizations doing computer-use RL currently operate with multi-GPU inference setups and complex scheduling infrastructure precisely because their models cannot serve rollouts fast enough from a single card. Consolidating that onto one GPU changes the economics of experimentation significantly.

The vLLM choice (v0.14.1) is worth noting. vLLM provides paged attention and continuous batching, which maximizes throughput for concurrent inference. The SSM architecture’s reduced KV cache demand makes it particularly compatible with this batching approach: when KV cache is not the bottleneck, vLLM’s dynamic scheduling can pack more requests per batch and use memory more efficiently. The two design decisions reinforce each other.

Placement in the 2026 Computer-Use Landscape

The computer-use agent model landscape has grown substantially. Anthropic, OpenAI, and Google have all shipped computer-use capabilities in their respective products. Most of these are API-only or involve proprietary weights. Holotron-12B is notable because it is openly available under the NVIDIA Open Model License, and because it is specifically optimized for the infrastructure requirements of running agents at scale rather than for single-user interaction.

H Company’s previous model, Holo2-8B, established the foundation. Holotron-12B improves on it in both benchmark performance and throughput despite the larger parameter count. The fact that the larger model achieves higher throughput is counterintuitive until you account for the SSM architecture’s memory efficiency: the same hardware can handle more concurrent requests when each request requires less KV cache, and the throughput-per-request efficiency of the SSM layers offsets the added parameters.

The NVIDIA Inception Program membership and the choice of Nemotron as a base model suggest H Company is working in close coordination with NVIDIA’s research direction. The next generation they reference, built on Nemotron 3 Omni with higher-resolution vision training and enhanced reasoning, would presumably push both accuracy and throughput further in this direction.

Deployment Considerations

The version constraint on transformers (>4.53,<4.54) warrants attention for production deployments. Pinning to a narrow range indicates the model depends on behavior that changed between versions, which creates maintenance burden over time. The model card also notes an invalid config.json; this is a minor issue but worth verifying before integrating into an inference pipeline.

For teams already running vLLM-based inference infrastructure, Holotron-12B slots in without major changes to the serving stack. The Mamba-SSM custom kernels require CUDA, so CPU inference is not viable. The NVIDIA Open Model License permits commercial use for most purposes but has restrictions; reviewing it before integrating into a production product is the right move.

Takeaway

Holotron-12B is a specific architectural bet: that SSM-attention hybrid models are the right foundation for computer-use agents where throughput is the primary constraint. The benchmark results support it. The 2x throughput gain over a smaller predecessor, combined with the large improvement on WebVoyager, makes this worth evaluating for anyone running RL training pipelines or large-scale annotation workflows on screen interaction data.

The broader architectural point is that the KV cache problem motivating SSM research is more acute for computer-use agents than for chat systems. Concurrent rollouts under RL training impose a very different workload than sequential user conversations. A model designed specifically for that workload profile, rather than adapted from one that was not, should win on that metric. Holotron-12B is currently one of the clearest examples of that design philosophy applied end-to-end.

Was this interesting?