· 6 min read ·

The Production Case for SSM-Based Computer Use Agents

Source: huggingface

Computer use agents have spent most of the last year competing on task success rate. Can the model click the right button? Can it navigate a multi-step form? The OSWorld and WebVoyager benchmarks have become the common currency for these comparisons, and most of the published work treats throughput as secondary, something you optimize once the agent is capable enough to be worth deploying.

H Company’s Holotron-12B inverts that priority. The model is a 12B parameter hybrid SSM-attention architecture built on NVIDIA’s Nemotron-Nano-12B-v2-VL-BF16 foundation, and the design choices that make it interesting are almost entirely about production-scale serving, not just benchmark scores. The throughput number they lead with, 8,900 tokens per second at 100 concurrent workers on a single H100 via vLLM 0.14.1, is not incidental. It is the point.

What Makes Computer Use Contexts Expensive

To understand why the architecture matters, it helps to think about what a computer use agent actually processes. Each step in a task involves at minimum one screenshot, often a high-resolution one. Multi-step tasks chain dozens of these images together along with the full action history. By the time an agent is halfway through booking a flight or filling out a government form, the context contains tens of thousands of tokens of visual and text data.

For a standard transformer, the cost of processing all that context grows quadratically with sequence length. More critically for deployment, the key-value cache that transformers maintain during inference grows linearly with each new token generated, across every layer, for every concurrent session. At 100 parallel agent sessions, each maintaining a long context, the KV cache alone can saturate GPU memory before you have generated a single useful action.

This is the constraint that SSMs are designed to remove. A state space model maintains a fixed-size recurrent state per layer regardless of how many tokens it has already processed. The memory footprint for inference stays constant as context grows. Two screenshots or two hundred screenshots consume the same amount of activation memory at each generation step. That property is what makes the 100-worker concurrency figure achievable.

The Hybrid Design

Pure SSMs trade some expressiveness for that memory efficiency. The standard Mamba architecture, which Holotron-12B relies on (the model requires mamba-ssm==2.2.5 and causal_conv1d as dependencies), compresses the entire prior context into a fixed state vector rather than attending over the full token sequence. That compression means the model cannot in principle attend to an arbitrary past token with full precision the way a transformer can.

H Company addresses this by keeping the model hybrid: some transformer attention layers remain, providing the precise retrieval capability where it matters most, while SSM layers handle the bulk of the sequence processing with linear complexity. This is the same design philosophy behind models like Jamba from AI21 Labs and the Zamba series. The bet is that for most of what a computer use agent needs to do, fixed-state recurrence is sufficient, and the memory savings at inference time are worth the limited expressiveness tradeoff.

Throughput Numbers in Context

The benchmark H Company publishes is throughput under load. At 100 concurrent workers on a single H100, Holotron-12B sustains 8,900 tokens per second. Their prior model, Holo2-8B, reaches 5,100 tokens per second at the same concurrency level and then plateaus. Holotron-12B continues scaling past that point.

The comparison is worth pausing on. The older model has fewer parameters (8B vs 12B), which you would normally expect to mean higher throughput at equivalent concurrency. That Holotron-12B outperforms it so significantly at high concurrency is a direct consequence of the SSM architecture eliminating KV cache contention. At low concurrency, a well-optimized transformer will often match or beat an SSM of the same parameter count. At high concurrency with long contexts, the KV cache becomes the bottleneck, and SSMs avoid it entirely.

For anyone deploying a computer use agent as a service, where the goal is to run as many simultaneous sessions as possible on a fixed hardware budget, that scaling behavior is the central figure that matters.

Task Performance

The task success numbers are also worth examining. On WebVoyager, a benchmark of web navigation tasks requiring multi-step browser interaction, Holotron-12B scores 80.5%. The base NVIDIA Nemotron model scores 35.1% on the same benchmark. That is a large gap for supervised fine-tuning alone, no reinforcement learning involved. H Company fine-tuned on approximately 14 billion tokens of proprietary localization and navigation data, covering screen understanding, UI element grounding, and interactive environment navigation.

The WebVoyager score is competitive. Recent work on GPT-4V-based agents reported WebVoyager scores in the 55-60% range, and more recent frontier models with full computer use capabilities have pushed higher. H Company does not publish direct comparisons to Anthropic’s Claude computer use or OpenAI’s CUA, which makes it difficult to place Holotron-12B precisely on the capability axis. The OSWorld grounding subtasks (OS-World-G, GroundUI, WebClick) are mentioned as showing substantial improvement over the base model, but full OSWorld task success rate numbers are absent from the published results.

That absence is notable. OSWorld full task success is the standard leaderboard metric for comparing computer use systems across vendors. The decision to omit it, whether intentional or because the evaluation is still running, means Holotron-12B currently cannot be placed on the same axis as Claude 3.5 Sonnet or other systems that have published OSWorld results.

Policy Models vs. General VLMs

The framing H Company uses is worth taking seriously: Holotron-12B is described as a policy model for computer use, not a general-purpose vision-language model. That distinction matters architecturally and in terms of what the model is actually good at.

General VLMs are trained to handle a broad distribution of image-text tasks. Policy models are trained to output actions in a specific environment. The 14 billion tokens of localization and navigation data that went into fine-tuning Holotron-12B are domain-specific to the extent that the model likely underperforms on general VQA or captioning benchmarks relative to its parameter count. What it gains in exchange is highly calibrated grounding for interactive UI environments, the ability to identify a specific button in a cluttered interface or track state across many screenshots.

This specialization-throughput combination is arguably the right design for production computer use infrastructure. A model that is exceptionally good at clicking the right thing and can serve 100 concurrent sessions on a single GPU is more deployable than a larger, more general model that does both worse.

The vLLM Dependency

Holotron-12B’s throughput figures are measured specifically with vLLM 0.14.1. That version was released in early 2026 and includes improved SSM kernel support alongside the existing PagedAttention infrastructure. The coupling to a specific vLLM version signals that H Company’s production stack assumes continuous batching through vLLM rather than naive batched inference.

For anyone evaluating Holotron-12B for deployment, this is the practical reality: you get the published throughput numbers if you run vLLM 0.14.1 on an H100. Different hardware or inference backends will produce different results. The vLLM SSM backend has improved substantially since Mamba support was first added, but SSM throughput is still more sensitive to the serving infrastructure than standard transformer inference, because the recurrent state management requires careful kernel implementation to avoid unnecessary memory copies between steps.

What This Points To

H Company has been building toward a production computer use service, not just a research demo. The throughput-first design, the policy model framing, the explicit concurrency benchmarking at 100 workers, and the vLLM integration all point to infrastructure that is meant to run at scale. The model weights are public on Hugging Face under the NVIDIA Open Model License, which permits research and certain commercial uses.

The gap in their published benchmarks, the missing OSWorld full task success rate and the absence of comparisons to Anthropic and OpenAI systems, leaves the capability picture incomplete. But the architecture argument for SSMs in high-concurrency agentic workloads is sound, and the throughput numbers back it up. Whether 80.5% WebVoyager and a constant-memory serving model is sufficient for the tasks you need to automate depends entirely on what those tasks are.

For deployments where the bottleneck is GPU memory at high concurrency rather than per-task accuracy, Holotron-12B is a serious engineering option. That is a narrower claim than the source article implies, but it is a more useful one.

Was this interesting?