· 6 min read ·

The Memory Argument for SSM-Based Computer Use Agents

Source: huggingface

H Company released Holotron-12B on March 16, 2026, and the headline claim is straightforward: on a single H100 running 100 concurrent workers, it delivers around 8,900 tokens per second against WebVoyager-style workloads. Their previous 8B model tops out near 5,100 tokens per second under the same conditions. The 12B model outperforms the 8B model on throughput despite having more parameters, and that is the whole story worth unpacking.

What the KV Cache Actually Costs at Scale

When you run a transformer-based model at high batch sizes, the KV cache dominates your VRAM budget. For each sequence in the batch, the model stores key and value tensors for every token, for every attention layer. The memory grows as batch_size × sequence_length × num_layers × 2 × hidden_dim. At short sequences and small batch sizes, this is manageable. At 100 concurrent agent sessions each running multi-turn conversations with screenshots attached, it becomes the binding constraint long before compute does.

For computer use specifically, the problem is worse than it sounds. Agents generate long observation-action traces, and each step may include one or more screenshot images tokenized into hundreds of vision tokens. Holotron-12B supports up to 4 images at a 12-tile layout with 512×512 tiles, which means a single observation step can inject thousands of tokens into the context. Multiply that by concurrent sessions and the KV cache inflates fast.

A Mamba state-space model handles this differently. The recurrence is linear, and each layer maintains a fixed-size hidden state per sequence, independent of how many tokens have been processed. The state does not grow with sequence length or with batch size. VRAM consumption stays roughly constant as you add more concurrent sessions, which is why the throughput curve for Holotron-12B keeps rising to 100 concurrent workers while the transformer model plateaus and then falls behind.

Holotron-12B’s Architectural Choices

The model uses NVIDIA’s Nemotron-Nano-12B-v2-VL as its base, which is a hybrid architecture: some layers use standard multi-head attention and some use Mamba SSM blocks. The vision encoder is CRadioV2-H, a vision transformer that processes screen images into patch embeddings before the language backbone handles planning and action prediction.

This hybrid design is a meaningful trade-off. Pure SSM models tend to struggle with tasks that require precise retrieval of specific earlier tokens, because the fixed-size state compresses the full history lossy. Hybrid models keep attention layers for the cases where exact token-level recall matters and use SSM layers everywhere else. The result is a model that behaves well on long contexts while retaining the memory scaling properties of SSMs for serving.

H Company fine-tuned this base on approximately 14 billion tokens of proprietary screen understanding, UI grounding, and interaction trajectory data. Their training is two-stage supervised fine-tuning without the GRPO reinforcement learning pass they applied to the Holo2 series. The fine-tuning brought WebVoyager accuracy from 35.1% (base model) to 80.5%, which puts it slightly above their transformer-based Holo2-8B (80.2%) and below Holo2-30B-A3B (83.0%).

Throughput vs. Capability Trade-offs

The obvious question is what accuracy you give up to get the throughput gain. Based on the published numbers, the answer for WebVoyager is essentially nothing, maybe a marginal fraction. Holotron-12B matches or slightly exceeds the Holo2-8B on the benchmark that H Company publishes exact numbers for, and the localization benchmarks (ScreenSpot, OSWorld-G, GroundUI, WebClick) show clear improvements over the untuned base, though H Company shows bar charts without exact figures for those.

Comparing against the broader field: UI-TARS-1.5-7B from Bytedance averages around 70.35% across H Company’s localization benchmark suite, Qwen2.5-VL-72B sits at 70.93%, and Holo2-8B reaches 78.00%. Holotron-12B lands above all of these on WebVoyager while fitting in 12B parameters and providing radically better throughput characteristics. For teams running data generation pipelines or online RL rollout loops, that combination matters considerably more than a few percentage points on a single benchmark.

The intended use cases H Company lists are revealing: data generation, annotation at scale, and online reinforcement learning. These are batch inference workloads. When you train the next generation of your own model using synthetic trajectory data, or when you run GRPO with a large rollout batch, you need to run thousands of agent episodes quickly and cheaply. An architecture that can saturate an H100 with 100 concurrent sessions instead of 50 cuts your training compute wall-clock in half.

The Serving Stack

Holotron-12B requires vLLM v0.14.1 or later for the SSM optimizations to be effective. SGLang supports it via the --max-mamba-cache-size flag, and TRT-LLM is also listed. The model runs on H100, H200, A100, L40S, B200, RTX PRO 6000, and GB200.

The dependency constraints are strict. The transformers version is pinned to >4.53,<4.54, mamba-ssm must be exactly 2.2.5, and causal_conv1d is required. If you run a mixed model zoo in a shared vLLM deployment, this creates potential version conflicts you will need to work around, likely through containerization.

A minimal inference setup looks roughly like:

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model = AutoModelForImageTextToText.from_pretrained(
    "Hcompany/Holotron-12B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("Hcompany/Holotron-12B")

For production throughput, you want vLLM with SSM support enabled. The constant Mamba cache size means you can tune --max-mamba-cache-size to control the fixed memory reservation per cached sequence, which gives you a knob to balance between maximum concurrency and cache hit rate.

Licensing and What It Means

Holo2-4B and Holo2-8B ship under Apache 2.0. Holotron-12B ships under the NVIDIA Open Model License, which permits commercial use but requires attribution, prohibits modification and redistribution of weights, and restricts use to NVIDIA hardware. If you build a product on top of Holotron-12B, you are tied to NVIDIA GPUs. If your infrastructure runs AMD or Google TPUs, the Apache-licensed Holo2 models remain your options.

The license difference also reflects the dependency on Nemotron-Nano as the base, which carries its own licensing terms from NVIDIA. This is a material constraint for anyone evaluating this model for long-term production use. The throughput story is compelling, but the combination of NVIDIA-only hardware and the restrictive redistribution terms narrows the addressable deployment environments.

Context in the Computer Use Agent Landscape

H Company sits in a crowded space. Claude’s computer use API has been in public beta since late 2024. OpenAI’s Operator targets browser automation. Microsoft’s Copilot integrations cover desktop GUI control. All of these rely on large transformer-based models with general capabilities.

H Company’s bet is specialization. Rather than a general-purpose model capable of computer use among many things, they ship a model trained specifically on screen interaction data, at a scale that enables high-concurrency batch workloads. The Surfer-H product, built on their Holo1 family, published a cost of $0.13 per task on WebVoyager. At 80.5% accuracy on the same benchmark, Holotron-12B positions itself as the engine for pipelines where you need to run many agent tasks in parallel, not for interactive single-user workflows.

The roadmap points toward Nemotron 3 Omni as the next base, which would add Mixture of Experts and enhanced hybrid SSM-Attention blocks. If that lands while maintaining the constant-memory properties, the throughput ceiling on fixed hardware should increase further.

For the immediate question of whether to use Holotron-12B today: if you are building a data flywheel that depends on running large volumes of browser or desktop automation tasks and you already run NVIDIA H100s, the throughput argument is concrete and the benchmark accuracy is competitive. If you need Apache-licensed weights or hardware flexibility, Holo2-8B gives you most of the capability on a less constrained license. The architectural insight, that memory footprint not FLOPS is the scaling constraint for concurrent agentic inference, is well-founded and likely to shape the next generation of computer use model designs regardless of which model family you use.

Was this interesting?