The Memory Argument for SSM in Computer Use Inside Holotron-12B

When H Company released Holotron-12B last week, the headline number was a throughput figure, not an accuracy score. That is an unusual ordering of priorities for a model release, and it tells you something important about how the company is thinking about where computer use agents actually get deployed.

Holotron-12B scores 80.5% on WebVoyager as a standalone agent. Their previous Holo2-8B scored 80.2%. The accuracy difference is negligible. But on a single H100 with vLLM v0.14.1 at 100 concurrent workers, Holotron-12B sustains roughly 8,900 tokens per second against Holo2-8B’s 5,100. That is a 75% throughput gain from a model that is 50% larger by parameter count, running on identical hardware.

The reason for this is architectural, and it is the more interesting story.

The KV Cache Problem at Scale

Transformer-based vision-language models accumulate a key-value cache as they process sequences. Each token in the context requires stored attention states for every layer, so the cache grows linearly with sequence length. At moderate concurrency this is manageable. At 100 concurrent computer use sessions, each potentially thousands of tokens deep with screenshots encoded at every step, the KV cache becomes the primary constraint on batch size. You cannot fit more sequences without running out of VRAM, regardless of how fast the compute itself would be.

State space models work differently. An SSM compresses the entire history of a sequence into a fixed-size recurrent state. The memory footprint per sequence does not grow with context length. This property is what makes SSMs attractive for agentic workloads: each additional concurrent session costs a constant amount of memory rather than a variable one proportional to how long the session has been running.

Holotron-12B is built on NVIDIA’s Nemotron-Nano-12B-v2-VL, which uses a hybrid architecture that interleaves SSM layers with conventional attention layers rather than replacing attention entirely. The vision encoder is CRadioV2-H. The hybrid approach preserves attention for the kinds of precise token-to-token relationships where transformers remain superior while using SSM layers to handle the long-range state accumulation that dominates agent trajectories.

Running this in practice requires mamba-ssm==2.2.5 and causal_conv1d alongside the usual transformer dependencies. The context window is 128K tokens shared across input and output, with support for up to four images at a 12-tile layout.

What H Company Has Built So Far

H Company is a French AI startup with around 34 people, including Laurent Sifre, who previously worked on AlphaGo at DeepMind. Their model lineage has moved quickly: Holo1 in June 2025 on Qwen2.5-VL, Holo1.5 in September, the Holo2 family spanning November 2025 through February 2026, and now Holotron-12B in March 2026.

Holo1/1.5 and Holo2 were all Apache 2.0. Holotron-12B ships under the NVIDIA Open Model License, which is more restrictive. That change comes with the territory of using NVIDIA’s base model.

Alongside the Holo model family, they released Surfer-H, a three-module web agent combining a policy model, a localizer, and a validator. Surfer-H using Holo1-7B with 10 attempts per task reaches 92.2% on WebVoyager at $0.13 per task, edging out a GPT-4.1 configuration at $0.54 per task. Holotron-12B as a standalone agent is below both of those numbers. It is positioned as a deployment primitive for high-concurrency workloads, not as a top-of-leaderboard single-session solver.

Where Throughput Actually Matters

The cases where you need hundreds of concurrent computer use sessions are not end-user products. They are infrastructure: synthetic data generation, UI annotation pipelines, automated quality assurance across browser configurations, and online reinforcement learning loops where the agent generates its own training signal by interacting with real environments.

Each of these requires running large batches of agent sessions continuously. For RL in particular, the throughput of the policy model directly constrains how fast you can generate rollouts, which constrains how fast you can improve the model. A 75% throughput gain translates fairly directly into a 75% reduction in wall-clock time for one full training iteration of that loop, assuming the policy model is the bottleneck.

H Company explicitly calls out this use case. They built Holotron-12B to serve as what they describe as a high-throughput workhorse for these pipelines while they continue developing higher-capability models. The SSM architecture is a production choice, not a research preference.

The Accuracy Trade-off

Holotron-12B’s 80.5% on WebVoyager is respectable but not competitive with the best agentic systems. For reference, OpenAI’s Operator scores around 87% and Google’s Project Mariner around 83.5% on the same benchmark. Surfer-H itself reaches 92.2% with multi-attempt sampling.

H Company did not publish detailed localization benchmark numbers for Holotron-12B specifically, beyond noting improvements over the Nemotron base (which scored 35.1% on WebVoyager before fine-tuning, illustrating how much the post-training contributes). Their Holo2-8B has been more thoroughly benchmarked: 58.9% on ScreenSpot-Pro, 70.1% on OSWorld-G, 83.8% on Ground-UI-1K, and 89.5% on their own WebClick benchmark.

The honest read is that Holotron-12B trades some accuracy headroom for production efficiency. For workloads where you need a good-enough computer use model at scale, this is the right trade. For workloads where you need the most capable agent for a single high-stakes session, the Holo2-235B or a Surfer-H configuration is still the better option.

The Competitive Landscape for Computer Use

The computer use agent field has organized itself into two rough tiers. There are API-based agents from Anthropic, OpenAI, and Google, which are black boxes you pay per call, and there are open-weight models you can run on your own infrastructure. The open-weight tier started thin and has filled in quickly over the past year.

H Company sits in an interesting position within the open-weight tier because they have invested in both the model (the Holo family) and the scaffolding (Surfer-H). Most open-weight computer use models assume you will bring your own agentic loop. Surfer-H provides a validated agentic framework with a documented cost per task, which is useful for teams trying to project operational costs before committing to a deployment.

The license situation is worth watching. Apache 2.0 was a deliberate permissive stance from H Company across all their prior work. Moving to the NVIDIA Open Model License for Holotron-12B limits some commercial use cases. Whether future releases continue on NVIDIA bases or return to more permissive licensing will depend partly on what model architectures end up being most useful.

H Company has announced they are continuing post-training on NVIDIA Nemotron 3 Omni, which adds a Mixture of Experts component on top of the hybrid SSM-Attention architecture. MoE reduces active parameter count per forward pass while maintaining total parameter count, which combined with the SSM constant-state property could push throughput further while also improving capability through scale. That is the expected trajectory for the next Holotron release.

What This Signals for the Field

For most computer use agent benchmarks, accuracy is the only axis that gets reported. Holotron-12B is one of the clearer examples of a production-oriented release where throughput is explicitly treated as a first-class metric.

This matters for how the field evaluates models. A model that scores 80% at 8,900 tokens per second is genuinely more useful for certain workloads than a model that scores 83% at 4,000 tokens per second, even though it would appear inferior on a standard leaderboard. The evaluation infrastructure for computer use agents has not caught up to production deployment realities, where cost per task, latency at concurrency, and memory efficiency are often more constraining than peak accuracy on a single-session benchmark.

Holotron-12B is an argument, made in hardware benchmarks rather than papers, that SSM architectures deserve serious consideration for agentic deployment. The argument is well-supported by the throughput numbers. Whether it holds as model capabilities scale further is the open question, and H Company’s roadmap suggests they intend to find out.