The Base Model Switch That Defines Holotron-12B

H Company has released five Holo models before Holotron-12B, and every one of them was fine-tuned from a Qwen base. Holo1-3B through Holo2-235B-A22B all follow that pattern. Holotron-12B breaks it by building on NVIDIA’s Nemotron-Nano-12B-v2-VL-BF16, a model that uses a fundamentally different architecture, and the switch carries consequences beyond the throughput benchmarks that lead the release.

The Nemotron-H Architecture

Nemotron-Nano-12B-v2-VL implements what NVIDIA calls Nemotron-H: a hybrid design that interleaves Mamba-2 state space model layers with standard grouped-query attention at roughly a 9:1 ratio across 62 total layers. The SSM layers maintain a fixed-size recurrent state, 128-dimensional per head across 128 Mamba heads, rather than a key-value cache that grows with context length. A small number of attention layers are distributed through the stack at regular intervals.

The memory implication for computer use is significant. A GUI automation session accumulates context at a rate that few other LLM workloads match: every step includes a screenshot encoded as hundreds to thousands of image tokens, plus the text history of prior actions and observations. After 30 steps, a session commonly holds 15,000 to 40,000 tokens. In a pure-transformer system, KV cache requirements scale linearly with that context, multiplied across concurrent sessions. In Holotron-12B’s hybrid stack, the SSM layers carry constant per-session state regardless of session length, which is why H Company can report 8,900 tokens per second at 100 concurrent workers on a single H100, versus roughly 5,100 for their prior Holo2-8B.

The attention layers interspersed through the SSM stack are not decorative. Pure SSM architectures compress session history into a fixed state, which makes precise recall of specific information from many steps ago unreliable. Occasional attention layers give the model access to exact token-level recall when it matters: recovering a field value observed early in a session, verifying that a confirmation message appeared at a specific step. Jamba from AI21 Labs explored the same design tradeoff in 2024 and reached the same conclusion, that hybrid designs outperform pure SSM for practical task completion while retaining most of the memory efficiency benefit.

The Vision Encoder

What has received less attention in coverage of Holotron-12B is the vision component. The model uses RADIOv2-H as its vision encoder, a multi-teacher distilled vision transformer trained to simultaneously satisfy the objectives of CLIP, SigLIP, DINOv2, and SAM. Multi-teacher distillation produces a single encoder whose features work well across tasks that each of those models specializes in: CLIP for semantic image-text alignment, SigLIP for fine-grained vision-language matching, DINOv2 for dense visual features, and SAM for spatial segmentation boundaries.

For GUI grounding specifically, that combination matters. Clicking the right element on a screen is not a semantic matching problem in the usual sense. It requires recognizing a UI control as a distinct object with spatial extent, matching it to a textual description, and producing accurate pixel coordinates. SAM-like spatial awareness helps identify where a UI element ends and the surrounding interface begins. DINOv2-like dense features help distinguish visually similar controls, such as two buttons at similar positions with different labels. The multi-teacher approach means the encoder was not trained to optimize for any single aspect of that problem at the expense of others.

Holotron-12B supports up to 12 image tiles of 512x512 pixels per request, accommodating screenshots up to approximately 2048x1536 resolution, with a maximum of 4 images per request. That resolution budget is generous for GUI work, where the fine detail of small button labels and form fields is routinely what determines whether a grounding prediction lands on the right element or misses by a few pixels.

Throughput Numbers in Context

H Company reports 80.5% on WebVoyager versus 35.1% for the base Nemotron model, after 14 billion tokens of fine-tuning on proprietary screen understanding, UI localization, and navigation data across two training stages. GPT-4V with human evaluation scored around 55.7% when the WebVoyager paper was originally published. The 45-point improvement from fine-tuning reflects that raw vision-language capability does not transfer cleanly to pixel-level GUI interaction without substantial domain adaptation.

What the release does not publish is an OSWorld full-task success rate. OSWorld covers desktop task completion across browsers, spreadsheets, file managers, and terminal emulators, and it is the benchmark where cross-vendor comparisons are made. Claude 3.5 Sonnet scored roughly 22 to 27% on full OSWorld desktop tasks in late 2024. UI-TARS 72B from ByteDance pushes significantly higher but at six times the parameter count. H Company’s ScreenSpot-Pro scores for prior Holo2 models are published (58.9% for Holo2-8B, 66.1% for Holo2-30B-A3B); comparable numbers for Holotron-12B are not. The pattern of published benchmarks emphasizes web navigation over desktop task diversity, consistent with a model trained primarily on browser-based environments.

This is worth noting because Holotron-12B sits in a different competitive category from Claude computer use or Operator. Those systems adapt general-purpose vision-language models for computer use without purpose-training a dedicated policy model. They optimize for breadth across desktop environments. Holotron-12B optimizes for throughput and concurrency within a narrower deployment context, closer in spirit to ByteDance’s UI-TARS than to Anthropic’s approach.

License Change

Every prior Holo2 model was released under Apache 2.0. Holotron-12B uses the NVIDIA Open Model License, which permits commercial use but imposes different conditions. The NVIDIA Open Model License requires disclosure that derivative models are based on a Nemotron model and prohibits certain classes of competitive use. For teams operating under legal review for open model deployments, this is not a drop-in replacement for the prior Holo2 licenses, and the difference warrants careful reading before committing to a production deployment.

The license change is a direct consequence of the base model choice. NVIDIA’s Nemotron-Nano-12B-v2-VL carries its own terms, and downstream fine-tunes must respect them. H Company presumably weighed this when deciding to move from Qwen. The throughput advantages and architectural alignment with NVIDIA’s inference stack were apparently sufficient to accept the more constrained license.

Deployment Specifics

Running Holotron-12B requires a pinned dependency set:

pip install torch "transformers>4.53,<4.54" causal_conv1d timm "mamba-ssm==2.2.5" accelerate open_clip_torch numpy pillow

The mamba-ssm==2.2.5 pin provides CUDA-optimized kernels for Mamba-2 layers. Without those kernels, SSM computation falls back to slow Python code and most of the throughput gain disappears. The tight transformers bound reflects that Nemotron-H support was added at a specific release; broader version ranges will fail to load the model correctly. causal_conv1d is a Mamba-specific dependency that handles the 1D convolution component of SSM layer computation.

vLLM serving is supported at version 0.14.1, which added SSM-aware memory management. Earlier versions handle only transformer-style KV cache management and do not correctly preserve per-session SSM recurrent state across decoding steps. SGLang and TRT-LLM are also listed as supported serving frameworks, both maintained within NVIDIA’s inference software ecosystem.

vllm serve Hcompany/Holotron-12B --trust-remote-code --dtype bfloat16 --video-pruning-rate 0

Supported hardware includes H100, H200, A100, L40S, B200, and GB200.

What the Roadmap Signals

H Company’s stated next model will post-train on Nemotron 3 Omni, which adds Mixture-of-Experts routing on top of the hybrid SSM-Attention design. MoE reduces active compute per forward pass by routing each token through a subset of experts rather than the full parameter set. The combination of constant-state SSM for memory efficiency, selective attention for precise recall, and MoE for compute efficiency would address three distinct costs simultaneously.

The Qwen-to-Nemotron switch is the most concrete signal about where H Company sees the architectural frontier for computer use, and it was not made lightly: it brings license constraints, a new dependency chain, and a departure from a well-understood fine-tuning base. The throughput gains at scale and the alignment with NVIDIA’s roadmap toward hybrid SSM plus MoE architectures make the case for why those tradeoffs were worth accepting.