· 5 min read ·

BitNet's Ternary Weights and the Limits of Post-Training Quantization

Source: hackernews

Every GGUF file on Hugging Face represents a compromise. Someone trained a model at full precision, then someone else ran it through a quantizer that rounded weights from 16-bit floats down to 4 or 8 bits, accepting a quality penalty in exchange for a smaller file that fits in less RAM. The entire local-inference ecosystem, from llama.cpp to Ollama to LM Studio, is built around consuming these compressed artifacts. Microsoft’s BitNet takes a different approach, and the distinction matters more than the headline numbers suggest.

What Ternary Actually Means

BitNet b1.58, described in the February 2024 paper from Microsoft Research, trains transformer models with weights constrained to three values: {-1, 0, +1}. The name comes from the information-theoretic minimum to represent three states, which is log₂(3) ≈ 1.58 bits per weight. The models are not compressed from a higher-precision baseline; the ternary constraint is enforced during training from the first parameter update.

The mechanism is a custom layer called BitLinear that replaces the standard nn.Linear. During the forward pass, weights are quantized using what the paper calls absmean quantization:

W_q = RoundClip(W / (mean(|W|) + ε), -1, 1)

The gradient flows through this as though the quantization did not happen, via a straight-through estimator, so the optimizer can still update the underlying full-precision weights during training. At inference time, only the ternary values matter. A scale factor per output channel, stored in FP16, handles rescaling after the matmul. Activations are quantized separately to INT8 using per-token absmax quantization.

What Ternary Weights Do to Matrix Multiplication

The computational payoff is significant. A standard matmul multiplies each activation by each weight. When weights can only be {-1, 0, +1}, that multiplication degenerates: multiplying by 1 is a no-op, multiplying by -1 is a negation, and multiplying by 0 contributes nothing. The entire dot product becomes an accumulation of additions and subtractions conditional on weight values.

This is what the BitNet inference kernels exploit. The repository includes three kernel families targeting different hardware:

  • I2_S: INT2 symmetric, packs four ternary weights per byte using 2 bits each, uses standard SIMD integer addition
  • TL1: Ternary Lookup 1, uses lookup tables on x86 CPUs (AVX2 and above) to accelerate the conditional-add pattern
  • TL2: Ternary Lookup 2, an improved variant with better memory access patterns; requires AVX-512 VNNI for full performance

On ARM, equivalent NEON and SVE kernels handle the same operation for Apple Silicon and Graviton-class servers.

The theoretical basis connects to work on binary neural networks going back to Courbariaux et al. (2015), but what has changed is scale. That earlier binarization work showed clear quality degradation at inference-grade scales. BitNet b1.58 recovers from that by training larger models for longer, where the ternary constraint acts more like a regularizer than a hard quality ceiling.

The Numbers

Reported performance for the 3B parameter model on Apple M2, CPU-only inference with the TL2 kernel, comes out to roughly 40-60 tokens per second single-threaded. A comparably-sized GGUF model with Q4_K_M quantization on the same hardware runs at around 25-35 tokens per second. The memory story is more striking: the BitNet 3B model weighs approximately 0.8 GB at inference time, compared to about 1.9 GB for a Q4_K_M GGUF of the same parameter count and roughly 6 GB in FP16.

The perplexity comparison from the paper puts BitNet b1.58 at 3B parameters very close to full-precision LLaMA at the same scale (9.91 vs 10.0 on WikiText-2), and slightly ahead at 7B (9.17 vs 9.34). The energy reduction reported is around 71% at 7B scale compared to FP16, attributed almost entirely to eliminating the floating-point multiplications that dominate matmul energy consumption.

The April 2025 technical report on BitNet-b1.58-2B-4T showed a 2B model trained on 4 trillion tokens performing competitively with LLaMA 3.2 and Gemma 3 at the same parameter count. The Raspberry Pi 5 can run it at usable speeds, which is a legitimate benchmark for “runs on CPU-only hardware with no discrete GPU required.”

The Ecosystem Problem

All of this comes with a constraint the headline numbers do not communicate clearly. You cannot take an existing model and convert it into BitNet. The ternary weight constraint must be present during training; there is no post-training path. The GGUF ecosystem that gives users access to thousands of fine-tuned LLaMA and Mistral variants, from coding assistants to domain-specific models, is incompatible with BitNet by design.

This is the core tension. GGUF Q4 quantization does degrade quality relative to FP16, sometimes noticeably at smaller scales. But it operates on any model ever trained, including the entire Hugging Face catalog. BitNet matches or exceeds Q4 quality for natively-trained models, at better speed and smaller size, but requires the model to be trained with BitNet from scratch.

The distinction between training-time and post-training quantization also matters for fine-tuning. A BitNet model targeted at a specific domain needs to be fine-tuned with ternary-aware training. Standard LoRA or full fine-tuning workflows for GGUF models do not translate directly, and the tooling to do this well is less mature.

What the Repository Actually Gives You

The BitNet repository is built on top of llama.cpp, sharing its tokenization, KV cache management, and sampling infrastructure. Models are distributed in GGUF format using a custom 2-bit quantization type within the GGUF container. The build system is CMake, and setup scripts automate environment preparation and model downloads.

The publicly released models, including bitnet_b1_58-large at ~700M parameters, bitnet_b1_58-3B, and BitNet-b1.58-2B-4T, are general-purpose pretrained models demonstrating the quality ceiling. They are not fine-tuned for instruction following or specific tasks, which limits their immediate utility for most end-user applications.

For BitNet to displace GGUF in practice, the community needs to develop BitNet-native fine-tunes across the same variety of use cases that GGUF models currently cover. That work is underway but slow relative to the existing ecosystem, which benefits from years of tooling and collective effort.

Where This Points

The hardware trajectory favors BitNet’s approach over time. As LLM inference moves toward edge devices, phones, and embedded systems with limited or no GPU, the energy and memory reduction from eliminating floating-point multiplication becomes more valuable than it is on a workstation. A 71% energy reduction is a rounding error on a cloud GPU cluster; it is the difference between a battery-powered deployment being feasible or not.

The “1.58-bit” naming is technically defensible on information-theoretic grounds but worth clarifying: storage requires 2 bits per weight in practice, since ternary values need at least two bits to encode. The branding understates the storage slightly compared to what an ideally-packed ternary representation would require, though the gap between 1.58 and 2 is narrow enough that it barely changes the memory math.

Whether BitNet’s approach reshapes local inference depends on whether enough training compute gets directed at producing well-trained, task-specific BitNet models. The technical argument for ternary-from-training is solid. The practical argument depends on collective investment that has not yet materialized at the scale the GGUF ecosystem represents. The gap between the number of usable GGUF models and usable BitNet models today is orders of magnitude, and no kernel optimization closes that.

Was this interesting?