· 6 min read ·

Two Ways to Fit a 70B Model Into Your Laptop, and Why the Method Matters

Source: simonwillison

Running a large language model locally has gone from impossible to routine in about three years. The mechanism behind that shift is quantization, and Simon Willison’s recent ground-up walkthrough covers how the arithmetic works. What it does not address is the deeper question: there are now two fundamentally different strategies for achieving low-bit inference, and they represent different bets about the future of the field.

One strategy retrofits precision reduction onto models that were trained without it in mind. The other builds quantization constraints into training from the beginning. Both work. They work for different reasons, and the gap between them matters more as models get larger.

What Post-Training Quantization Is Actually Doing

The dominant approach in local LLM inference today is post-training quantization, or PTQ. You take a model trained at 16-bit or 32-bit precision, and you compress its weights after the fact to 8-bit, 4-bit, or lower. The tools that make llama.cpp, Ollama, and the HuggingFace ecosystem possible are all doing this: GGUF, GPTQ, AWQ.

The core operation is affine quantization. A float32 weight x maps to an integer as:

quantized = round(x / scale) + zero_point
reconstructed = (quantized - zero_point) * scale

For INT8, the 256 available integer levels cover whatever range your weights span. For INT4, you have 16 levels. The scale factor is computed over a block of weights rather than the entire tensor, which is why formats like GGUF’s Q4_K_M report slightly more than 4 bits per weight in practice: the per-block scale values add overhead. A 64-weight block with an FP32 scale runs about 4.5 bits effective; 128-weight blocks bring it down to about 4.25.

The reason PTQ works at all comes from a property of overparameterized models: most of the precision in a trained weight is redundant. A weight that carried a value of 0.314159 during training could be stored as 0.31 at inference without the model noticing, because the network learned to distribute information across millions of parameters. Rounding introduces error, but the error is small relative to the signal, and the network degrades gracefully rather than catastrophically.

The perplexity data from the llama.cpp project makes this quantitative. For a 7B parameter model on WikiText-2:

FormatBits/weightPerplexity
FP1616.05.68
Q8_08.55.69
Q4_K_M4.85.90
Q3_K_M3.96.50
Q2_K2.67.80

From FP16 to Q8_0 is nearly free. From FP16 to Q4_K_M costs about 0.2 perplexity points while cutting memory from 14 GB to 4.1 GB. The quality cliff sits between Q3 and Q4 formats, and below that the degradation accelerates sharply. These numbers are model-size-dependent: larger models tolerate the same quantization with less penalty because they have more redundancy to absorb the error.

The ecosystem built on PTQ is mature. Formats like Q5_K_M and Q6_K occupy the space between Q4 and Q8 for users with more memory headroom. Schemes like GPTQ and AWQ use calibration data to minimize quantization error more intelligently than naive rounding: GPTQ computes per-layer Hessian information to route quantization error toward less-sensitive weights; AWQ identifies the roughly 1% of weights corresponding to large activation channels and scales them before quantization to preserve their representational fidelity. Both meaningfully outperform block-wise round-to-nearest at 4-bit and especially at 3-bit.

But PTQ has a ceiling. It is compressing models that were not designed to be compressed, and every technique in the stack is trying to make the best of a mismatch between the training objective (minimize loss over a continuous parameter space) and the inference constraint (represent those parameters in 4 bits).

Training-Time Quantization: A Different Starting Point

Microsoft’s BitNet and its subsequent 2B-4T variant represent the other approach. Instead of compressing after training, BitNet constrains weights to ternary values during training: each weight must be in {-1, 0, +1}. The quantization constraint is baked into the forward pass from the first gradient update.

The technical mechanism uses absmean quantization with a straight-through estimator for gradients:

def ternary_quantize(W):
    gamma = W.abs().mean()  # absmean of the full weight matrix
    W_q = (W / gamma).round().clamp(-1, 1)
    return W_q, gamma

During training, gradients flow through round() as if it were the identity function. The weights accumulate gradient signal in full precision; the quantized values are what participates in the forward pass. At inference time, the full-precision master copy is discarded. What remains is a matrix of {-1, 0, +1} with a single float16 scale per row.

This has a consequence that PTQ cannot achieve: matrix multiplication degenerates into additions and subtractions. No floating-point multiplies at inference. The compute density difference is substantial on hardware that supports it. On an Apple M2 CPU using the TL2 kernel (requires AVX-512 VNNI), BitNet b1.58 at 3B parameters runs at roughly 40-60 tokens per second where an equivalent Q4_K_M GGUF model runs at 25-35. Memory is 0.8 GB versus 1.9 GB at comparable parameter count. Microsoft reports approximately 71% energy reduction relative to FP16 at 7B scale.

The quality story is more nuanced. BitNet b1.58 3B achieves a perplexity of 9.91 on a standard benchmark where a full-precision model at comparable scale reaches around 10.0. For this to be a useful comparison it needs context: the BitNet model was trained with its quantization constraint in place from scratch, which means the network learned to represent its knowledge within ternary weights. The result is that quality is competitive with, not simply degraded from, a higher-precision model. The parameter count is not interchangeable with a PTQ model’s parameter count because the information capacity per parameter is lower.

The Trade-offs Are Real on Both Sides

PTQ’s advantage is flexibility. Any model produced by any lab can be quantized after the fact. The open model ecosystem, from Llama variants to Mistral to Phi, is immediately quantizable. Users can choose their precision tradeoff at deployment time. A model released today can be running on a 4-bit quantized GGUF on a MacBook by tomorrow.

BitNet’s constraint is structural. It cannot be applied to existing pretrained models. Converting a Llama 3 model to ternary weights after training is not meaningful: the network learned to use floating-point representations, and imposing ternary constraints retroactively is not quantization, it is destruction. Any BitNet model requires training from scratch with the ternary constraint, which means the compute cost of producing one is the same as training any large model. That limits production-scale BitNet models to organizations that can afford that training budget.

The file format incompatibility also matters practically. GGUF and the llama.cpp ecosystem have no native representation for ternary weights. Running BitNet requires BitNet’s own kernels. The tooling is improving rapidly, but it is not the same level of integration that GGUF models enjoy in Ollama, LM Studio, or Jan.

Where This Leaves Local Inference

For the next few years, post-training quantization remains the practical answer for running large models locally. The formats have converged around GGUF with Q4_K_M as the sensible default when memory is the binding constraint, and Q5_K_M or Q6_K when you have headroom. The 2B-4T BitNet model from Microsoft is genuinely interesting as a demonstration that ternary-weight models can train to useful quality and that the inference efficiency gains are real, but it is not yet replacing PTQ workflows for the models most people want to run.

The longer-term question is whether the next generation of foundation models gets trained with quantization constraints in mind. The efficiency case for doing so is compelling: a model designed for INT8 or INT2 inference from the first training step can reach acceptable quality with significantly fewer parameters, and the inference cost reduction at deployment scale is not marginal. If model providers start releasing ternary or binary-weight models alongside their standard checkpoints, the tooling will follow. Until then, GGUF and its companions are doing genuinely good work compressing models that were never designed to be compressed.

Was this interesting?