Granite 4.1: IBM Walks Back the Hybrid Experiment

IBM dropped Granite 4.1 this week, and the most interesting thing about it is what is missing. Granite 4.0, released a few months ago, was IBM’s big bet on a hybrid Mamba-2/Transformer architecture, with a Mixture-of-Experts variant called Granite 4.0-H-Small running 32B total parameters and 9B active. It was the headline feature: linear-attention state-space blocks interleaved with classic softmax attention, marketed as the path to lower memory and longer context.

Granite 4.1 throws all of that out. The new 3B, 8B, and 30B models are plain dense decoder-only transformers. GQA, RoPE, SwiGLU, RMSNorm, shared input/output embeddings. There is nothing here architecturally that you would not have seen in a Llama 2 paper. And yet IBM claims the 8B dense model matches or beats the older 32B MoE hybrid across most benchmarks.

That is the story worth chewing on.

What changed under the hood

The spec sheet for the new models is short. The 8B has 40 layers, 32 attention heads with 8 KV heads (so GQA at a 4:1 ratio), 128-dim heads, and a 12800 MLP hidden size. The 30B scales up to 64 layers and a 32768 MLP hidden, but keeps the same 8 KV heads. Context extends to 512K tokens through staged training, going 32K to 128K to 512K in the final phase. License is Apache 2.0, which keeps the line consistent with the rest of the Granite family.

The training recipe is more involved than the architecture. Roughly 15 trillion tokens across five phases: a 10T general pretrain dominated by CommonCrawl with 20% code and 7% math, then a 2T phase that flips the ratio to 35% math and 30% code, then a 2T annealing phase with chain-of-thought and instruction data baked in, a 0.5T refinement pass, and finally the long-context extension. Supervised finetuning runs on about 4.1M curated samples with an LLM-as-judge gate. RL is a four-stage pipeline ending in math-specific RL.

Compare this to the Granite 4.0 technical paper, which leaned heavily on the hybrid story: Mamba-2 SSM blocks for most layers, sparse attention at intervals, MoE routing for the H-Small variant. IBM was explicit that the hybrid was supposed to fix the quadratic-attention problem at long context while keeping transformer quality.

Why dense came back

The Mamba and hybrid-Mamba family has had a strange run. Mamba and Mamba-2 showed real wins on synthetic long-context tasks and inference throughput. Jamba from AI21 and Zamba from Zyphra both shipped hybrid architectures with strong throughput numbers. But the recall-heavy benchmarks have been less kind: MQAR and several long-context retrieval evals show pure SSMs degrading on associative recall, which is why hybrids exist in the first place.

The trade always was: you accept some recall loss and some training complexity for inference efficiency. If your benchmark numbers do not actually clear the dense baseline at the same parameter count, the trade evaporates. IBM’s own table shows the 8B dense hitting 73.84 on MMLU, 92.49 on GSM8K, 87.20 on HumanEval, and 68.27 on BFCL v3 for tool calling. Those numbers compete with Llama 3.1 8B Instruct and beat it on math and coding. The 32B Granite 4.0-H-Small MoE, with 9B active params, is similar or slightly behind on most of those.

So IBM is implicitly arguing that for the parameter scales and budgets they care about, the architectural cleverness was not paying for itself. A dense 8B with a better data pipeline and a longer training schedule did the job that an MoE hybrid was supposed to do with 4x the parameters.

This is not a unique conclusion. Qwen3 ships both dense and MoE variants, and the dense models remain competitive. DeepSeek-V3 goes hard on MoE, but their dense ancestors trained on similar data did surprisingly well per active parameter. The pattern across the open-weights ecosystem in 2025 has been that data quality, RL post-training, and long-context staged extension matter more than whether you have an exotic attention variant.

The 512K context claim

The context-length claim deserves a closer look. Many models advertise large context windows but degrade sharply past 32K or 64K on needle-in-a-haystack and the RULER benchmark. IBM gets to 512K through staged extension: pretrain at the base length, then expand RoPE bases and continue training at 32K, 128K, and finally 512K. This is the same recipe used by Llama 3.1, and the result there was usable but uneven recall past 128K.

IBM has not published RULER scores in the launch material I can see. That is the number I would want before believing the 512K claim in production. A dense 30B serving 512K context will also be brutal on KV cache memory, since GQA with 8 KV heads still scales linearly with context length. Back of the envelope: 64 layers, 8 KV heads, 128-dim, bfloat16, gives about 256KB per token, so 512K tokens is roughly 130GB of KV cache. That is two H100s just for the cache before you serve a single user. The hybrid Mamba approach in Granite 4.0 was supposed to dodge exactly this cost, since SSM state is fixed-size.

Which is why the architectural backtrack is interesting rather than purely a win. Granite 4.0’s hybrid still makes sense if you are doing long-context inference at scale. Granite 4.1’s dense models will be cheaper to train, cheaper to fine-tune, and easier for the open-source community to extend, but the inference economics at the advertised context window are not free.

What it looks like to use

The HF integration is unremarkable in the good sense. Standard AutoModelForCausalLM load, standard chat template, and tool calling exposed through apply_chat_template with a tools argument:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "ibm-granite/granite-4.1-8b"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="cuda")

tools = [{
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"]
        }
    }
}]

chat = [{"role": "user", "content": "Weather in London?"}]
prompt = tokenizer.apply_chat_template(chat, tools=tools, add_generation_prompt=True, tokenize=False)

BFCL v3 at 68.27 puts it in the same neighborhood as Hermes 3 for tool calling at the 8B scale, which is fine for agent work without being class-leading. For Discord bot use cases, where I usually want a model that follows JSON schemas reliably and does not refuse to call functions, that score is the relevant one.

The bigger lesson

IBM is one of the few labs willing to ship two consecutive generations with such different architectures, and to be honest about the result. The implicit message of Granite 4.1 is that the hybrid-Mamba story did not generalize as well as the literature suggested, at least not at this scale and this budget. The data and training pipeline did most of the lifting.

That is a useful data point for anyone evaluating which architectural papers to actually care about. The benchmarks that move are still the ones driven by data, post-training, and RL. The architectural variants are interesting research, but the production wins keep coming from the boring parts of the stack.