· 7 min read ·

The Memory Architecture Behind Holotron-12B

Source: huggingface

Why Computer Use Breaks the Transformer Serving Model

Most serving benchmarks for large vision-language models measure single-turn or short-context performance. A user sends a prompt with an image, the model responds, and the session ends. KV cache pressure exists but stays manageable.

Computer use agents work differently. At every step of a task, the agent captures a screenshot, appends it to the context, reasons about what to do, and executes an action. The session is stateful and sequential by design. A typical desktop screenshot at reasonable resolution tiles into hundreds of vision tokens via the model’s patch encoder. Over a 30-step task, that accumulates 30,000 to 60,000 vision tokens in context before you have counted the text.

Now scale that to 100 concurrent sessions, which is a realistic target for any production deployment or training rollout. With a standard transformer, the KV cache memory requirement scales as:

KV cache bytes = sessions × context_length × num_layers × 2 × num_kv_heads × head_dim × dtype_bytes

At 100 concurrent sessions each carrying 16K tokens, on a model with a typical layer and head configuration, you are looking at KV cache requirements that exceed 200 GB. A single H100 has 80 GB of HBM. You cannot fit that workload on one card with a pure transformer. You either shard across multiple GPUs, reduce concurrency, or accept that your serving cost is dominated by memory rather than compute.

This is the specific problem Holotron-12B was built to address.

The SSM Memory Model

State space models like Mamba and Mamba-2 replace the attention KV cache with a fixed-size recurrent state. Each layer maintains a state vector whose size depends on the model configuration, not the sequence length. For Holotron-12B, that configuration is ssm_state_size=128, mamba_num_heads=128, mamba_head_dim=80.

The key property is that state size does not grow with sequence length. Running 100 concurrent SSM sessions costs tens of megabytes of state memory. The same concurrency with a transformer costs hundreds of gigabytes of KV cache. That gap is not marginal; it is the difference between fitting the workload on one H100 and needing a multi-GPU cluster.

The tradeoff is that SSMs compress history lossily. The recurrent state summarizes everything seen so far, but precise retrieval of a specific token from many steps back is unreliable with a pure SSM. For most reasoning tasks that is acceptable, but for computer use you occasionally need exact recall of a UI element or a URL seen several steps ago.

Why the Hybrid Architecture Makes Sense

Holotron-12B is built on NVIDIA Nemotron-Nano-12B-v2-VL-BF16 (the Nemotron-H architecture), which has 62 hidden layers: approximately 56 are Mamba-2 SSM layers and 6 are full attention layers. The attention layers use grouped query attention with 8 KV heads against 40 query heads, which reduces their own KV cache footprint significantly compared to multi-head attention.

The sparse attention layers provide the precise retrieval that pure SSMs cannot reliably guarantee. The SSM layers handle the bulk of the sequence compression. The result is a model where KV cache exists only for the 6 attention layers rather than all 62, and those 6 layers are already using a compressed head configuration.

This is the same design logic that AI21 used with Jamba, which demonstrated roughly 3x higher throughput than Mixtral-8x7B at long contexts. The throughput advantage of hybrid SSM architectures at high concurrency is well-established at this point; Holotron-12B applies that established advantage to a deployment problem that transformers handle poorly.

Measured Throughput

H Company measured throughput on a single H100 using vLLM v0.14.1 at 100 concurrent workers with a real WebVoyager workload: multimodal, long context, sequential screenshots.

ModelParametersThroughput at 100 workers
Holo2-8B (transformer)8B5,100 tokens/sec
Holotron-12B (hybrid SSM)12B8,900 tokens/sec

Holotron-12B is approximately 50% larger by parameter count and runs 74% faster at high concurrency. The parameter count increase would predict worse throughput on a standard transformer; the SSM architecture inverts that expectation at this concurrency level.

The important caveat is where SSMs lose their advantage: prefill. Transformers can process all input tokens in parallel via self-attention, which means digesting a long context from scratch is fast. SSMs process the input sequentially because the recurrent state must be updated one step at a time. For workloads where prefill dominates, the SSM architecture is slower. For autoregressive generation at high concurrency, which is the actual bottleneck in serving a production computer use agent, SSMs win clearly.

The Vision Encoder

The visual grounding layer uses RADIOv2-H, a ViT trained with multi-teacher distillation from CLIP, SigLIP, DINOv2, and SAM simultaneously. The encoder supports up to 12 tiles at 512x512 pixels per tile, enabling input resolutions up to 2048x1536 pixels. That tile budget is what drives the token accumulation math: 12 tiles per screenshot, hundreds of tokens per tile, 30 or more steps per task.

RADIOv2-H’s multi-teacher training gives the encoder representations that generalize across the different types of visual grounding a computer use agent needs. CLIP and SigLIP provide semantic alignment with text. DINOv2 provides spatial and structural feature quality. SAM provides segmentation-quality boundary awareness. A UI agent benefits from all four: it needs to identify icons semantically, locate them spatially, and click them precisely.

The Online RL Training Angle

H Company is explicit that Holotron-12B is positioned as a policy model for three use cases: synthetic training data generation, online RL rollout collection, and high-concurrency production serving. The first two are directly related, and they are the less-discussed reason the throughput number matters.

Online reinforcement learning for agent training requires running the agent many thousands of times across tasks and collecting (state, action, reward) trajectories. Rollout throughput directly determines how fast the RL training loop iterates. If your rollout model generates 5,100 tokens/sec at 100 concurrent workers, your data collection pipeline is slower than if it generates 8,900 tokens/sec. The math compounds quickly across the number of rollouts needed to train a capable policy.

Agent RL training at scale consistently runs into the same wall: rollout throughput becomes the limiting factor before compute does. A more memory-efficient rollout model means more trajectories per GPU-hour, which means faster iteration on the policy. The architectural choice in Holotron-12B is not just about inference economics; it feeds directly into training economics.

This framing also clarifies why H Company is releasing this model openly. The throughput advantage makes it viable as a rollout generator for their internal training pipeline. Releasing it publicly simultaneously builds community adoption and positions it as standard infrastructure for anyone else trying to do computer use RL research.

What the Benchmarks Say and Don’t Say

WebVoyager performance: Holotron-12B scores 80.5% compared to the Nemotron base model at 35.1% and Holo2-8B at 80.2%. The improvement over base Nemotron is substantial, which validates that the fine-tuning on proprietary screen interaction data is doing real work across approximately 14 billion training tokens. The improvement over Holo2-8B is marginal in accuracy terms while being large in throughput terms. Holotron-12B is not a step-change in task success rate over H Company’s previous model; it is a step-change in the efficiency with which you can deploy and train at that accuracy level.

H Company did not publish OSWorld results at launch. OSWorld measures full desktop task completion, where the human baseline is 72.4% and leading models have historically scored in the 22-27% range since Claude 3.5 Sonnet’s computer use launch in late 2024. The absence of an OSWorld number at launch suggests the model’s advantage is primarily in throughput and grounding rather than end-to-end task completion on difficult desktop benchmarks. That is a reasonable trade-off for a model explicitly positioned as a policy model rather than a frontier capability demo.

Deployment Requirements

Serving Holotron-12B requires custom CUDA kernels: causal_conv1d and mamba-ssm==2.2.5. The full install:

pip install torch "transformers>4.53,<4.54" causal_conv1d timm "mamba-ssm==2.2.5" \
    accelerate open_clip_torch numpy pillow

The model serves via vLLM, TRT-LLM, and SGLang on H100, A100, L40S, and B200 hardware. The dependency on custom SSM kernels is real operational overhead compared to a standard transformer that runs on any recent version of transformers or vLLM without extras.

The license is the NVIDIA Open Model License, which is more restrictive than the Apache 2.0 license that Holo2-8B used. That is a real constraint for commercial use cases and worth reading carefully before building a production dependency on it. H Company’s prior Holo2 series remains cleaner for unrestricted commercial deployment.

The Broader Pattern

The roadmap points toward NVIDIA Nemotron 3 Omni as the next base, which adds mixture-of-experts on top of the hybrid SSM-Attention design. Combining MoE’s parameter efficiency with SSM’s memory efficiency would push the throughput advantage further, assuming the training complexity is manageable.

Holotron-12B fits a trend that has been building since Mamba’s release in late 2023: SSM and hybrid architectures carving out specific deployment niches where transformer KV cache scaling is a hard constraint. Long-context serving, high-concurrency inference, and now agentic computer use are all workloads where the O(n) KV cache growth of transformers creates real infrastructure problems.

The computer use case is particularly clear-cut because the token accumulation is mechanical and predictable: screenshots, every step, for the duration of the task. You can calculate exactly how much KV cache you need before you run a single session. That predictability makes the architectural argument for SSMs straightforward to evaluate, and the throughput numbers H Company measured on a single H100 make the production case concrete.

Was this interesting?