· 7 min read ·

Holotron-12B Is an Architecture Argument, Not Just a Benchmark

Source: huggingface

H Company’s Holotron-12B earns attention not because it tops a benchmark leaderboard but because it represents a deliberate architectural divergence from every model H Company has shipped before it. Every prior model in the Holo lineage, from the original Holo1-3B through the Holo2-235B-A22B MoE, was fine-tuned from a Qwen base. Holotron-12B breaks that pattern entirely by building on NVIDIA’s Nemotron-Nano-12B-v2-VL, which is itself built on the Nemotron-H architecture described in NVIDIA’s technical report. The reason for that shift has everything to do with what computer use agents actually are as a computational process.

GUI Agents Are State Machines, Not Document Retrievers

Standard transformer attention was designed around a retrieval intuition: given a query, attend over the full context to surface relevant information. That framing works well for tasks like question answering, document summarization, and code completion, where the model needs to pull from arbitrary positions in the input.

A GUI agent running a multi-step browser task is doing something structurally different. It maintains a sequential process: observe the screen, decide on an action, execute it, observe the resulting screen, decide on the next action. The agent is not retrieving from a static document; it is updating a running state. Each screenshot reflects the world after the previous action. The relevant context for the current decision is almost always recent, and the accumulated history of prior screenshots exists mainly to inform the model of what has already been attempted.

This access pattern matches a recurrent architecture more naturally than it matches full attention. The Mamba SSM paper from Gu and Dao formalized the argument that structured state space models can capture long-range dependencies without the quadratic memory cost of attention, precisely because they compress history into a fixed-size state vector rather than retaining every past token explicitly.

Holotron-12B operationalizes that argument at production scale.

What the Nemotron-H Architecture Actually Looks Like

Nemotron-H is not a pure SSM. It is a hybrid that places Mamba-2 SSM layers and standard attention layers in a fixed interleaved pattern. The Holotron-12B configuration has 62 hidden layers total. Based on the model’s hybrid_override_pattern field in its config, roughly 6 of those 62 layers are attention layers, with the remaining 56 being Mamba-2 SSM layers. Attention layers appear approximately every 8 to 10 positions, giving a ratio of about 9:1 SSM to attention.

The parameters involved are:

{
  "model_type": "nemotron_h",
  "num_hidden_layers": 62,
  "num_attention_heads": 40,
  "num_key_value_heads": 8,
  "hidden_size": 5120,
  "mamba_num_heads": 128,
  "mamba_head_dim": 80,
  "ssm_state_size": 128,
  "max_position_embeddings": 131072
}

The ssm_state_size of 128 is the critical number. Every Mamba-2 layer maintains a 128-dimensional recurrent state per head, and that state size does not change with sequence length. A standard attention layer in the same model accumulates a KV cache that grows by 2 × hidden_size per token per layer. At 128K context length with 40 attention heads across 6 attention layers, the KV cache for a single session reaches several gigabytes. The SSM layers contribute a constant memory footprint regardless of how long the session runs.

The grouped query attention (8 key-value heads vs 40 query heads) on the attention layers also reduces the KV cache relative to standard multi-head attention, compressing those layers further. But the dominant effect is the SSM layers: 56 layers with bounded memory versus 6 layers that grow.

The reason you still need attention layers at all is that pure recurrent models lose some capacity for selective retrieval from earlier context. If the agent needs to reference the exact content of a UI element observed 50 steps ago, the SSM state may not preserve that with full fidelity. The sparse attention layers preserve random-access capability for cases where it matters, while the SSM layers handle the sequential state accumulation that constitutes the bulk of the workload.

What Changes at Concurrency 100

H Company’s benchmark measures throughput on the WebVoyager task with 100 concurrent sessions on a single H100, using vLLM v0.14.1 with SSM-aware optimizations. The numbers are 8,900 tokens per second for Holotron-12B versus 5,100 tokens per second for Holo2-8B.

The interesting part of that comparison is the asymmetry in model size. Holotron-12B is larger than Holo2-8B by roughly 50% in parameter count, yet it achieves 75% higher throughput at high concurrency. The SSM architecture narrows the gap that parameter count would otherwise create, and at 100 concurrent sessions, it reverses the relationship.

The reason is straightforward: at 100 concurrent sessions with long contexts, the H100’s 80GB of HBM2e is largely occupied by KV caches for the transformer model, leaving less room for batching. The SSM model’s bounded per-session memory footprint allows larger effective batch sizes, which in turn allows the GPU’s compute throughput to be utilized more fully. GPU inference at scale is typically memory-bandwidth bound rather than compute bound, so reducing memory pressure yields disproportionate throughput gains.

For serving infrastructure, this is the difference between needing 4 GPUs to run 100 concurrent agents and needing 2. At any reasonable compute cost, that matters more than a few percentage points on a benchmark.

Deployment Requirements

The model does not drop into a standard Transformers or vLLM setup without additional dependencies:

pip install torch \
  "transformers>4.53,<4.54" \
  causal_conv1d \
  timm \
  "mamba-ssm==2.2.5" \
  accelerate \
  open_clip_torch \
  numpy \
  pillow

The mamba-ssm==2.2.5 package provides the CUDA-optimized Mamba-2 kernels. Without those kernels, the SSM layers run in pure Python fallback mode, which eliminates most of the throughput advantage. The tight version pin on transformers reflects that Nemotron-H support was added at that specific version and later versions may not be compatible without changes.

Serving via vLLM follows the standard pattern:

vllm serve Hcompany/Holotron-12B \
  --trust-remote-code \
  --dtype bfloat16 \
  --video-pruning-rate 0

For programmatic use, the inference API is consistent with other Transformers vision-language models:

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model = AutoModelForCausalLM.from_pretrained(
    "Hcompany/Holotron-12B",
    trust_remote_code=True,
    device_map="cuda:0",
    torch_dtype=torch.bfloat16,
).eval()

processor = AutoProcessor.from_pretrained(
    "Hcompany/Holotron-12B",
    trust_remote_code=True,
)

H Company’s hai-cookbook repository contains integration examples that apply to Holotron-12B as well as the Holo2 family.

The Vision Side

The language backbone is only part of the architecture. Holotron-12B uses the RADIOv2-H encoder for vision, a multi-teacher distilled ViT trained from CLIP, SigLIP, DINOv2, and SAM simultaneously. The result is an encoder that handles multiple visual domains without specialized fine-tuning for each one.

The input resolution ceiling is high: up to 12 tiles at 512×512 pixels each, supporting full images up to 2048×1536 or 1536×2048 pixels. That matters for computer use because UI elements at standard desktop resolutions are often small, and a low-resolution encoder would lose the precise localization information needed to click the right button or read a small label.

Up to four images can be passed per request, which covers the typical pattern of providing the current screenshot alongside a few prior states for context.

Accuracy and What Is Not Published

Holotron-12B scores 80.5% on WebVoyager, compared to 80.2% for Holo2-8B and 83.0% for Holo2-30B-A3B. The model is competitive with the smaller prior-generation model and behind the larger one. That is a reasonable result for a model being optimized primarily for throughput rather than benchmark rank.

What H Company chose not to publish at launch is notable: there are no localization benchmark scores. The Holo2 models were benchmarked extensively on ScreenSpot-Pro, OSWorld-G, GroundUI-1K, WebClick, and ScreenSpot-v2. Holo2-8B scores 58.9% on ScreenSpot-Pro, and Holo2-30B-A3B scores 66.1%. Whether Holotron-12B is competitive on those tasks is not documented.

Localization, the ability to correctly identify and click specific UI elements on a screen, is arguably more important for real-world GUI automation than navigation success rate, because it determines whether the model can reliably execute the individual actions that compose a task. The gap in published numbers suggests either that Holotron-12B is still being evaluated on those benchmarks or that the results are not favorable enough to highlight alongside the throughput claims.

The License Constraint

Holo2 models are Apache 2.0, which places no meaningful restrictions on commercial use. Holotron-12B is governed by the NVIDIA Open Model License, which has conditions on commercial deployment at scale. For any production system that will be offered as a commercial service, those terms need review before Holotron-12B becomes the core model. H Company’s own prior models remain the cleaner option for unrestricted commercial use.

The Broader Pattern

Holotron-12B fits into a small but growing set of hybrid SSM models being built for production inference at scale. Falcon Mamba from the Technology Innovation Institute explored pure SSM architectures for language modeling. Jamba from AI21 Labs used a similar Mamba-Transformer hybrid approach for general-purpose language tasks. NVIDIA’s Nemotron-H is the first such architecture specifically packaged as a base for multimodal agent training.

The question the field is working through is whether the constant-state recurrence of SSM layers sacrifices enough retrieval capability to matter in practice, or whether the sparse attention layers in a hybrid are sufficient to cover the cases where precise recall is needed. Holotron-12B’s WebVoyager score, matching an 8B transformer at 12B parameters, suggests the hybrid arrangement holds up reasonably well on navigation tasks. The localization benchmark data, when published, will be more informative about whether the architecture generalizes across the full range of GUI agent demands.

H Company has stated that their next model will post-train NVIDIA Nemotron 3 Omni, which is expected to incorporate Mixture of Experts architecture on top of the hybrid SSM-Attention design. That combination, SSM layers for sequential state, sparse attention for retrieval, and MoE for parameter efficiency, would address each of the main constraints separately. Whether those design bets compound well in practice is the question the next release cycle will answer.

Was this interesting?