Building Computer Use Infrastructure: The Architecture Choices Inside Holotron-12B
Source: huggingface
Throughput as a design constraint
When computer use agents get discussed, accuracy tends to dominate the conversation: can the model click the right button, fill the right form, navigate multi-step workflows without getting stuck? That framing makes sense when the primary use case is a human occasionally delegating a task to an AI. It breaks down when you start thinking about running hundreds of sessions concurrently, generating synthetic training data at scale, or building online reinforcement learning loops.
H Company’s Holotron-12B is built around a different premise: that for computer use agents to move from demo to infrastructure, throughput is a first-class design constraint rather than a post-hoc optimization. The decision that makes this possible sits in the model’s architecture, and it is worth understanding in some detail.
Where the memory pressure comes from
A standard transformer-based vision-language model maintains a key-value (KV) cache during inference. The cache stores compressed representations of all previous tokens so the model does not have to recompute attention over the full history at each step. The problem is that this cache grows linearly with sequence length.
Computer use agent sessions are unusually long-context workloads. Each step includes a full screenshot, which encodes as a significant number of image tokens. Add the text history of actions and observations across 20 to 50 steps, and a single session accumulates tens of thousands of tokens. At 100 concurrent sessions, the memory pressure multiplies accordingly. For standard transformers, this makes dense concurrency expensive: you either run fewer sessions per GPU or accept that most of the GPU’s compute is spent managing cache rather than doing useful work.
Holotron-12B is post-trained from NVIDIA’s Nemotron-Nano-12B-v2-VL-BF16, a hybrid model that combines State Space Model (SSM) layers with selective attention layers. The SSM layers, based on the Mamba architecture developed by Gu and Dao, maintain a fixed-size recurrent state per layer regardless of sequence length. Instead of appending to a growing KV cache, each new token updates a compressed recurrent state. The amount of memory consumed per session is constant whether the session is 5 steps or 50.
The practical consequence: you can run more concurrent sessions on the same hardware. H Company reports 8,900 tokens per second on a single H100 under vLLM v0.14.1, scaling linearly to 100 concurrent workers without plateauing. Their prior model, Holo2-8B, hit roughly 5,100 tokens per second and plateaued before reaching 100 workers. Holotron-12B is 50% larger by parameter count and still delivers about 74% more throughput. The improvement comes from architectural efficiency, not from having a smaller model.
The trade-off SSMs make
Pure SSMs are not uniformly better than transformers. The fixed state size means information is compressed, and compression loses detail. A session where the model needs to precisely recall a value it observed at step 3 when it is now at step 47 is harder for an SSM than for a transformer with a full KV cache. The Jamba paper from AI21 in 2024 explored this same design space and reached a similar conclusion: a hybrid approach, interleaving SSM layers with selective attention layers, captures most of the memory efficiency benefit while using the attention layers to compensate for cases where exact recall matters.
Holotron-12B uses this hybrid design. The WebVoyager benchmark, which tests end-to-end web navigation across live websites including multi-step search, form submission, and goal verification, offers empirical grounding for how the trade-off plays out in practice. The base Nemotron-Nano-12B-v2-VL scores 35.1% on WebVoyager. Holotron-12B scores 80.5% after two stages of supervised fine-tuning on H Company’s proprietary data, covering roughly 14 billion tokens focused on screen understanding, UI grounding, and navigation. GPT-4V with human evaluation scored around 55.7% on the same benchmark when the WebVoyager paper was originally published.
The caveat is that WebVoyager represents a general web navigation workload. Workflows requiring precise long-range recall, financial form entry where a specific value from 30 steps back must be reproduced exactly, or multi-stage processes with complex state dependencies, deserve direct evaluation before assuming the headline number transfers to those cases.
Why throughput shapes what you can build
H Company is explicit about the three workloads they are optimizing for: generating synthetic trajectory data for training, running online reinforcement learning loops, and serving concurrent production sessions. Each has a different relationship to throughput.
Synthetic data generation is direct: more inference throughput means more training data per unit compute, which means faster model iteration cycles.
Online RL is more interesting. The standard RL training loop alternates between collecting experience by running the model to generate rollouts, and updating model weights from those rollouts. If inference is slow, the experience collection phase becomes the bottleneck and the training loop idles. At 8,900 tokens per second across 100 concurrent workers, the data collection side of the loop is fast enough to actually feed a training process without becoming the constraint. This changes the economics of online RL substantially, making it feasible to run rollout collection and policy updates without either side waiting on the other.
The production serving case is about session density. If a computer use agent session requires an H100 to run one session at a time, deployment costs are prohibitive for most applications. At 50 to 100 concurrent sessions on the same GPU, the cost structure changes in ways that open up use cases that were previously not worth pursuing.
There is also a compounding dynamic here: higher inference throughput enables more RL rollouts, more rollouts produce better models, and better models can potentially achieve higher throughput at the next generation. H Company’s roadmap mentions a Nemotron 3 Omni architecture currently in post-training, which adds Mixture-of-Experts on top of the SSM-Attention hybrid, along with higher-resolution vision training and enhanced multimodal support. Whether the efficiency compounding holds through that architecture change is an open question, but the direction of travel is clear.
The open weights angle
Anthropic’s Claude computer use and OpenAI’s computer use agent capabilities have gotten significant attention, but both route through external APIs. For enterprise deployments where screenshots of internal applications cannot leave the network, on-premise deployment is a requirement rather than a preference. Holotron-12B is released under the NVIDIA Open Model License with weights in BF16 Safetensors format, and inference is supported through vLLM, TRT-LLM, and SGLang on H100, A100, L40S, and B200 hardware.
The practical integration note: the SSM stack requires specific pinned dependencies.
pip install torch "transformers>4.53,<4.54" causal_conv1d timm "mamba-ssm==2.2.5" accelerate open_clip_torch numpy pillow
The pinned version ranges on mamba-ssm==2.2.5 and transformers>4.53,<4.54 indicate the inference tooling is still stabilizing around the SSM architecture. Teams with existing transformer-based inference pipelines should expect some integration work. vLLM’s native SSM support has been improving, but this is not a drop-in replacement for a standard VLM endpoint. The causal_conv1d package is specific to Mamba-style models and is a non-obvious addition to a standard serving environment.
Where this fits in the broader picture
H Company has been releasing models since mid-2025, moving through the Holo1, Holo1.5, and Holo2 families with increasing scale and capability. Holotron-12B is the first time they have made the architectural rationale for throughput into the central framing of a release, rather than leading with benchmark scores and treating deployment properties as secondary.
The broader computer use space is still sorting out what the right infrastructure abstraction looks like. Proprietary APIs cover the accuracy side reasonably well for tasks that can tolerate external routing. The open infrastructure layer, models that can run at volume for data generation pipelines, training loops, and cost-efficient production serving, has lagged behind. Holotron-12B addresses that gap directly, and the SSM architecture is the structural reason it can.
For data generation and online RL workloads, constant-memory-per-session is a meaningful advantage that is difficult to engineer around in a transformer-based system without resorting to aggressive context truncation or other approximations. For workflows requiring precise long-range recall, direct evaluation is worth doing before committing to the architecture. The next generation, built on Nemotron 3 Omni with MoE layers on top, will show whether the throughput advantage compounds further as the model scales, or whether the architectural trade-offs become more pronounced at higher capability levels.