· 6 min read ·

Why 100 Billion Parameters on a CPU Finally Makes Sense

Source: hackernews

Microsoft’s BitNet has been circulating for a while, but the 100B parameter benchmark landed with enough force to get the Hacker News crowd’s attention again. The headline is that a 100-billion-parameter language model can run on a CPU, in roughly 12 to 15 GB of RAM, at speeds that are actually usable. Before writing that off as marketing, it’s worth understanding why the claim holds up, and what makes BitNet architecturally different from every other quantized model you’ve run locally.

The Standard Quantization Story

Most local LLM deployment today runs on some variant of post-training quantization. You take a trained FP16 model, run a tool like GPTQ, AWQ, or llama.cpp’s GGUF conversion, and compress the weights to 4-bit or 8-bit integers. The quality loss is manageable at 4-bit and above. Below that, things degrade quickly.

The problem is structural. Post-training quantization (PTQ) treats a trained model as a fixed artifact and tries to approximate its weights with fewer bits. The model never learned to be a quantized model; you’re forcing it into a representation it wasn’t designed for. At 4-bit you lose some quality. At 2-bit with Q2_K in llama.cpp, the degradation is significant enough that most people don’t use it for anything serious. At 1-bit or 2-bit territory, PTQ falls apart.

BitNet sidesteps this entirely by doing something different: it trains with quantization baked in from the start. Every forward pass during training uses ternary weights. The model learns, from scratch, to function with those constraints.

How the Ternary Scheme Works

The paper is titled “1.58 bits” because log₂(3) ≈ 1.58, and the weights are constrained to exactly three values: {-1, 0, +1}. During training, full-precision weights are maintained as a latent copy for gradient computation. On each forward pass, those weights are quantized via absmean quantization:

α = mean(|W|)
W_q = RoundClip(W / α, -1, 1)

where RoundClip rounds to the nearest integer and clips to {-1, 0, +1}. The scalar α is a per-tensor scale factor stored alongside the weights. Activations are quantized separately to 8-bit integers using per-token absmax scaling.

The straight-through estimator handles gradients through the non-differentiable quantization step. The optimizer updates the full-precision latent weights; the quantized weights are only used in the forward pass. This is standard quantization-aware training (QAT), applied consistently from the first training step.

The result is a model that has internalized the ternary constraint. About 30 to 50 percent of weights converge to zero, which the model uses for sparsity, gating, and ignoring irrelevant inputs. The non-zero weights are {-1, +1}, which are the additions and subtractions the model relies on for everything else.

The Multiply-Free Matmul

Here is where the practical performance story comes from. A standard linear layer computes y = Wx, which for a full-precision model requires N×M floating-point multiply-accumulate operations. On modern hardware, this is the dominant cost.

For a BitNet b1.58 layer:

y_i = (Σ x_j where W_q[i,j]=+1) - (Σ x_j where W_q[i,j]=-1)

The inner sum is purely addition and subtraction. The only multiplications are two scalar scale factors per output element (the weight scale α and the activation scale β), applied after the sum. For a 3B model’s 4096-dimensional linear layers, that’s 4096 multiplications replaced with 4096 additions per output element.

On CPUs, this translates directly. Modern CPUs are memory-bandwidth-bound during LLM inference, not compute-bound. The transformer’s attention and FFN layers spend most of their time loading weight matrices from RAM into cache, then performing multiply-accumulate operations. BitNet reduces both the memory footprint (ternary packing achieves roughly 11x compression vs FP16) and the arithmetic cost. The actual kernels in bitnet.cpp use lookup-table based decoding with ARM NEON’s vtbl or x86 AVX2’s _mm256_shuffle_epi8 to process 8 ternary weights at a time, with no floating-point in the hot path.

The Numbers

For the 3B model, which has been publicly released as microsoft/bitnet-b1.58-3B, the practical results on consumer hardware are:

  • Apple M2 Pro: approximately 45 to 55 tokens per second (single thread)
  • Intel Core i9-13900K: approximately 28 to 38 tokens per second
  • Raspberry Pi 5: approximately 5 to 8 tokens per second
  • RAM required: roughly 400 to 500 MB

For comparison, running a LLaMA 3B at FP16 on the same M2 Pro produces around 10 to 15 tokens per second and requires 6 GB of RAM. Even llama.cpp’s Q4_0 GGUF format, at 1.8 GB, runs noticeably slower than the ternary model because Q4 inference still performs integer multiply-accumulates, while BitNet’s kernels avoid multiplications entirely.

At 100B scale, the memory footprint lands around 12 to 15 GB, compared to roughly 200 GB for FP16. A machine with 32 GB of RAM can run it. The inference speed on a high-core-count server CPU (AWS Graviton3 at 64 cores, for instance) scales well because the all-add arithmetic is trivially parallelizable and the reduced memory bandwidth requirement means more cores can stay fed.

The Training Requirement

The limitation that comes up every time this project is discussed is the unavoidable one: you cannot BitNet-quantize an existing model. There is no BitNet conversion script for LLaMA, Mistral, or Gemma. Post-training quantization to ternary destroys quality because those models were never designed to work with three-valued weights. You have to train from scratch, with QAT, for the full training run.

This is a genuine constraint. Training a 3B model from scratch requires significant compute. Training a 100B model requires the kind of infrastructure that Microsoft Research has and most organizations do not. The 100B weights have not been released publicly, and whether they will be is an open question.

What this means in practice is that BitNet is currently a research result and a demonstration, more than an ecosystem. The GGUF/llama.cpp approach, despite its inferior compression ratios, works on every model that anyone has already trained. That’s a powerful advantage that quantization-aware approaches can’t easily overcome until there are large-scale BitNet-trained models available.

That said, the 2B model trained on 4T tokens (bitnet-b1.58-2B4T) is publicly available, and it runs well. The community has also produced BitNet-trained variants at 700M and 8B parameter scales. The ecosystem is small but growing, and the inference infrastructure in bitnet.cpp is mature enough to be practical.

What Changes at 100B

The scaling result is the part worth paying attention to. With post-training quantization approaches, quality degradation at extreme compression typically worsens as models get larger, because larger models have more complex weight distributions that are harder to approximate at low bit counts.

BitNet’s QAT approach does the opposite. At larger scales, the model has more capacity to adapt to the ternary constraint during training. The 100B result reports performance comparable to LLaMA-3 70B on standard benchmarks (MMLU, HellaSwag, ARC), despite the ternary weights. The performance gap vs FP16, which is visible at small scales, essentially disappears above 70B parameters.

This scaling behavior is what makes the project interesting beyond the current state of the weights. If you can train a 100B model that matches FP16 quality, runs in 15 GB of RAM at reasonable speed, and requires only a competent multi-core CPU to serve, you’ve changed the economics of inference for large models significantly. Serving a 100B BitNet model costs about as much as serving a 7B GGUF model in terms of hardware.

How to Run It Today

For the models that are available, setup is straightforward:

git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet
pip install -r requirements.txt
cmake -B build -DGGML_NATIVE=OFF
cmake --build build --config Release

# Download and run the 2B model
huggingface-cli download microsoft/bitnet-b1.58-2B4T --local-dir models/bitnet-2b
python setup_env.py -md models/bitnet-2b -q i2_s
./build/bin/llama-cli -m models/bitnet-2b/ggml-model-i2_s.gguf -p "What is the capital of France?" -n 128

The i2_s quantization format is the standard 2-bit-per-weight ternary packing used by bitnet.cpp. The conversion step handles the GGUF format specifics. From there it behaves like any llama.cpp-based inference, with the same basic sampling parameters and prompt formatting.

The Larger Point

BitNet is not a drop-in replacement for the existing local LLM ecosystem, because it requires training from scratch and the trained models are still limited in availability. But the underlying result is technically sound in a way that most “run LLMs locally” approaches are not. The multiply-free inference is not a compression trick; it’s a consequence of training models that fundamentally operate differently. The scaling behavior suggests that as more organizations can afford to train at 70B-plus parameter counts, ternary models might become the standard deployment target for edge inference.

For now, the 2B and 3B public models are worth running if CPU inference speed matters to you. The RAM requirements are so low that these models are practical on hardware that would struggle with any GGUF alternative, and the speed is competitive with 4-bit quantized models that are several times larger. The 100B benchmark is the headline, but the architecture behind the small models is the part that will matter over time.

Was this interesting?