When Matrix Multiplication Becomes Addition: The Engineering Behind BitNet
Source: hackernews
The headline for Microsoft’s BitNet is that a 100-billion-parameter language model can run on a CPU. That is true, and the math behind it is worth understanding, because it rests on a fundamental change to how transformer computations work rather than an incremental compression improvement.
The Difference Between Compression and Constraint
Most approaches to making LLMs smaller apply quantization after training. You train a model at full FP16 or BF16 precision, then run it through a quantizer that rounds weights to 4-bit integers, choosing which rounding strategy best preserves the original weight distribution. This is the approach behind GPTQ, AWQ, and the GGUF format used by llama.cpp and Ollama. These tools can take any model that has ever been trained and compress it; the compression trades some quality for speed and memory.
BitNet takes a different path. The b1.58 scheme, developed by Microsoft Research in 2024, constrains model weights to {-1, 0, +1} during training. Not after training. The ternary constraint is enforced from the first forward pass, via a custom layer called BitLinear that replaces standard nn.Linear. During training, the optimizer maintains full-precision weight copies for gradient updates, and a straight-through estimator lets gradients flow through the quantization step. At inference, only three weight values exist for any given parameter.
The “1.58 bits” naming comes from information theory: log₂(3) ≈ 1.58, which is the minimum bits required to represent three states. In storage, ternary values still require 2 bits to encode, so the real storage cost lands slightly above the theoretical minimum, but the computational picture is where the architecture earns its advantages.
What Happens to Matrix Multiplication
The bottleneck in running a transformer is matrix multiplication. Each attention and feed-forward layer multiplies large activation matrices against weight matrices, requiring one floating-point multiply per weight-activation pair. On CPU, floating-point multiplication is expensive relative to addition; it also dominates power consumption across hardware classes.
When weights can only be {-1, 0, +1}, that multiplication collapses:
- Weight = 1: the activation passes through unchanged
- Weight = -1: the activation is negated
- Weight = 0: the term contributes nothing
A dot product across ternary weights becomes a series of conditional additions and subtractions. The inference kernel needs integer addition, negation, and zero-skipping. Floating-point multiply units are never engaged.
The BitNet repository implements three kernel families around this property:
| Kernel | Approach | Hardware |
|---|---|---|
| I2_S | Pack 4 ternary weights per byte (2 bits each), SIMD integer additions | Baseline x86 |
| TL1 | Lookup tables to accelerate conditional-add patterns | x86 with AVX2 |
| TL2 | Improved memory access over TL1 | x86 with AVX-512 VNNI |
On ARM, NEON and SVE kernels implement the equivalent approach for Apple Silicon and Graviton. Models ship as GGUF files using a custom 2-bit quantization type embedded in the container format, which means the distribution infrastructure overlaps with the existing llama.cpp ecosystem even though the computational path is different.
The Performance Numbers
For the 3B parameter model on Apple M2, CPU-only inference with the TL2 kernel, benchmarks land around 40-60 tokens per second single-threaded. A comparably-sized GGUF model at Q4_K_M quantization in stock llama.cpp on the same hardware runs at roughly 25-35 tokens per second. Memory at inference: BitNet 3B needs approximately 0.8 GB, compared to 1.9 GB for Q4_K_M GGUF and 6 GB for FP16.
Perplexity comparisons from the paper show BitNet b1.58 at 7B parameters achieving 9.17 on WikiText-2 against 9.34 for full FP16 LLaMA at the same scale. The ternary model slightly outperforms the uncompressed original. Energy consumption is down roughly 71% at 7B scale compared to FP16, attributed almost entirely to eliminating floating-point multiplication from the matmul operations.
The BitNet-b1.58-2B-4T technical report from April 2025 documents a 2B model trained on 4 trillion tokens performing competitively with LLaMA 3.2 and Gemma 3 at the same parameter count. Raspberry Pi 5 benchmarks in that report demonstrate usable inference speeds with no GPU, which is the intended deployment target.
Why Earlier Binarization Failed
Binary and ternary neural networks are not new. Courbariaux et al. showed binary weight networks working for small image classification tasks in 2015. The persistent problem was that aggressive weight binarization at production LLM scales caused quality degradation that no post-processing recovered.
What changed with BitNet b1.58 is the interaction between scale, training data, and the choice of ternary over pure binary weights. The zero value turns out to matter substantially. It gives the model a way to ignore irrelevant activations entirely, which {-1, +1} binary networks cannot do. The original BitNet paper from October 2023 showed that the quality gap between 1-bit and FP16 narrows as model size increases. Larger models trained on more data compensate for the ternary constraint in ways that smaller models cannot, which is why the 2B-4T model is the most compelling published result and not the 700M parameter base model.
The Ecosystem Gap
You cannot convert an existing model to BitNet. Every GGUF file on Hugging Face is a full-precision model that was compressed after training. BitNet models must be trained with the ternary constraint active from the beginning. There is no post-training path.
This is the practical limitation the headline numbers do not communicate. The local inference ecosystem runs on post-training quantization precisely because it operates on any trained model as input. Ollama and LM Studio give access to thousands of fine-tuned models, from coding assistants to domain-specific variants, because GPTQ, AWQ, and GGUF quantization require nothing from the original training process. BitNet’s publicly released models are a handful of general-purpose pretrained checkpoints with no instruction tuning.
Fine-tuning a BitNet model also does not map onto standard workflows. LoRA and full fine-tune pipelines for GGUF models assume full-precision weights and standard linear layers. Ternary-aware fine-tuning is possible but requires tooling that has not reached the maturity of frameworks like axolotl or unsloth. The software integration path exists: the BitNet repository builds on llama.cpp’s tokenization, KV cache, sampling, and GGUF container infrastructure. The model availability gap is the binding constraint, and it closes through training compute investment, not through software changes.
The Longer View
A 71% energy reduction is marginal in a datacenter and decisive on a device with a battery. The Raspberry Pi 5 benchmark is not a curiosity; it is pointing at the deployment environments where BitNet’s advantages compound: smartphones, embedded hardware, edge systems where memory bandwidth is limited and there is no GPU to fall back on.
The quality gap between ternary and full-precision also narrows as model size increases, which aligns with the general direction of the field toward larger base models. The technical foundation of BitNet is solid, the published benchmarks are credible, and the hardware trajectory favors low-energy, low-memory inference over time.
The timeline for BitNet becoming practically competitive with the GGUF ecosystem depends on how much training compute moves toward native ternary models. That shift has not happened yet at the scale that would produce a BitNet model library comparable in diversity to what Ollama’s catalog offers today. But the architectural case is clear enough that the question is when, not whether, the gap starts to close.