Scale Factors, Zero Points, and the Design Decisions Behind Q4_K_M
Source: simonwillison
Simon Willison published a ground-up explainer on quantization this week, and it is worth using as a launching point to go deeper on some of the design decisions that make modern LLM quantization work as well as it does. The practical result, Q4_K_M becoming the de facto format for local inference, is not obvious from the outside. Getting there required solving some non-trivial problems about how to represent floating-point distributions in integer form without destroying model quality.
What You Are Actually Doing When You Quantize
A language model trained in fp32 or bf16 stores each weight as a 32- or 16-bit floating-point number. Quantization maps those floats to integers, typically 8-bit or 4-bit, so the model occupies less memory. A 7-billion-parameter model in fp16 requires roughly 14 GB of RAM. At 4-bit integer precision, that same model fits in about 4.3 GB, a 3.25x reduction that turns a 24 GB workstation GPU into a machine that can run a reasonably capable model.
The mapping is defined by two parameters per quantization group: a scale and a zero point. For a block of weights with float range [x_min, x_max] being mapped to an integer range [q_min, q_max]:
scale = (x_max - x_min) / (q_max - q_min)
zero_point = round(q_min - x_min / scale)
x_int = round(x_float / scale) + zero_point
x_float ≈ (x_int - zero_point) * scale
This is affine (asymmetric) quantization. The zero point lets the integer range shift to match the actual distribution of the weights, rather than forcing the midpoint to correspond to zero. For symmetric quantization, you drop the zero point and force the range to be centered at zero, which is simpler and marginally faster at inference but wastes some integer range for skewed distributions.
The reconstruction is lossy. A 4-bit integer can take only 16 distinct values. When you dequantize back to float, every weight snaps to one of those 16 levels, introducing quantization error. How much error, and where it matters, determines whether the model stays usable.
Why Per-Tensor Quantization Does Not Work
The naive approach is to compute one scale and one zero point for the entire weight matrix. This fails in practice for two reasons.
First, individual weight matrices have uneven distributions. Some have outlier values that are 10-20x larger than the typical weight. A single scale computed over the full matrix must accommodate that outlier, compressing all the typical values into a narrow band of integer levels. The quantization error on those typical weights, which represent the vast majority of the matrix, becomes large.
Second, and more important for LLMs: the problem of activation outliers documented in the LLM.int8() paper by Dettmers et al. (2022). In models above roughly 6.7 billion parameters, a small number of activation channels develop values that are orders of magnitude larger than the rest. Quantizing matrices that interact with those channels using per-tensor statistics gives you catastrophic quality loss, not a graceful degradation.
The solution is granularity: instead of one scale per tensor, compute separate scales for small groups of weights.
Per-Group Quantization and the K-Quant Design
The standard approach in GGUF (the format used by llama.cpp and Ollama) uses groups of 32 or 256 weights. Each group has its own scale, so the quantization can track local variations in the weight distribution rather than being dominated by distant outliers.
The K-quant variants (Q4_K_M, Q5_K_M, Q6_K, etc.) go further with a hierarchical scheme. A super-block of 256 weights is divided into smaller blocks, each with its own scale, and the block scales themselves are quantized at 6-bit precision. The super-block scale provides coarse range information; the per-block scales refine it. This captures more of the weight distribution for a modest increase in storage overhead.
Q4_K_M works out to about 4.85 bits per weight after accounting for the scale storage, not a clean 4.00 bits. The extra 0.85 bits are buying considerably better accuracy than a naive 4-bit scheme would give. Community benchmarks on Llama-2 7B against WikiText-2 perplexity tell the story:
| Format | Bits/weight | PPL delta vs fp16 |
|---|---|---|
| Q8_0 | 8.5 | +0.01 |
| Q6_K | 6.6 | +0.06 |
| Q5_K_M | 5.7 | +0.10 |
| Q4_K_M | 4.85 | +0.16 |
| Q3_K_M | 3.91 | +0.58 |
| Q2_K | 3.35 | +1.25 |
Q8 is essentially lossless; the 0.01 perplexity difference is measurement noise. Q4_K_M, at 3.25x memory reduction, costs only 0.16 perplexity points. Below Q4, quality degrades quickly. The cliff between Q4 and Q3 is steeper than the gradual slope above it, which is why Q4_K_M became the practical floor for serious use.
Larger models tolerate quantization better. Llama-2 70B at Q4_K_M shows only a +0.07 perplexity delta versus fp16, even though it is also compressed 3.25x. The redundancy in larger models leaves more room to absorb the rounding error. That 70B model at Q4_K_M fits in about 41 GB rather than 140 GB, which moves it from a multi-GPU data center workload into range of a single high-end GPU or an M2 Ultra Mac.
The Bigger Model, Lower Precision Heuristic
One of the most useful practical insights to emerge from this area: within a fixed memory budget, a larger model at lower precision often outperforms a smaller model at higher precision.
At an 8 GB budget, Llama-2 7B in fp16 (14 GB) does not fit, but Llama-2 13B at Q4_K_M (~7.8 GB) does, and at lower perplexity than the 7B model at any quantization level that fits. The model’s capacity to represent complex patterns scales with parameter count in ways that survive lossy compression better than you might expect.
This has driven how quantized models are distributed in practice. The community standardized on releasing Q4_K_M rather than smaller models at Q8 or Q6, because the larger model at the lower precision is almost always the better trade.
Inference Speed, Memory Bandwidth, and Why Int4 Is Faster
Quantization does not just save memory; it speeds up inference, though the reason is not purely fewer arithmetic operations.
LLM inference at batch size 1 (the single-user interactive case) is memory-bandwidth bound, not compute bound. Each weight is loaded from VRAM once, used in a matrix multiply, and discarded. Modern GPUs take roughly 100x longer to load a weight from memory than to perform the multiply once it arrives. Halving the size of each weight (fp16 to int8) halves the load time, with only a small overhead for the dequantization step before compute.
At 4-bit precision versus fp16, the net result is typically a 2-4x throughput improvement for single-user inference. ExLlama2, which uses highly optimized CUDA kernels that fuse dequantization with the matrix multiply, benchmarks Llama-2 13B EXL2 at around 85 tokens per second on a single RTX 3090, against about 35 tokens per second for fp16 on the same GPU. The quantized model not only fits in the 24 GB VRAM where fp16 barely does, it runs faster.
For large-batch serving (the multi-user API case), the advantage shrinks as compute becomes the bottleneck. But memory savings still matter: a smaller model leaves more room for the KV cache, which in turn enables larger context windows or more concurrent sessions.
KV Cache Quantization: The Next Frontier
Weight quantization is well understood, but the KV cache, the stored key/value tensors from attention layers that grow linearly with context length, is increasingly the memory bottleneck for long-context inference. A 128K context window generates a KV cache that can dwarf the model weights.
KV cache quantization to int8 has been shown to maintain quality well across multiple model families, and int4 KV cache is now supported in vLLM and llama.cpp. At very long contexts, quantizing the KV cache can be more impactful on total memory than quantizing the weights further.
The other direction is training-time quantization rather than post-training. BitNet b1.58 (Microsoft Research, 2024) trains models from scratch with ternary weights ({-1, 0, +1}), achieving comparable quality to Llama-3 8B at 3B scale while running 5.5x faster and using 70% less memory. That approach bypasses the accuracy loss problem entirely by designing the model around the quantization from the start, though it requires training your own model rather than compressing an existing one.
For most practical work with released models, Q4_K_M in GGUF remains the right default: enough quality to be useful, enough compression to be accessible, and broad support across the tools that actually run these models.