· 7 min read ·

The Arithmetic Behind 4-Bit LLMs: Why Block Quantization Changed Local AI

Source: simonwillison

Simon Willison recently published a ground-up explanation of quantization that is worth reading for the clarity it brings to a topic that usually gets explained either too abstractly or not at all. This post goes a level deeper into the mathematics of why specific quantization schemes work better than others, and how the llama.cpp community arrived at the formats that now dominate local AI inference.

The Core Formula

Quantization is a mapping problem. You have a weight matrix full of 32-bit floats, and you want to represent those values using far fewer bits. The simplest possible approach is affine quantization:

scale (s)      = (x_max - x_min) / (q_max - q_min)
zero_point (z) = round(q_min - x_min / s)

quantize:      q = clamp(round(x / s) + z,  q_min, q_max)
dequantize:    x̂ = s * (q - z)

For symmetric INT8, where the range is [-127, 127] and the zero point is fixed at zero, this simplifies further:

s   = max(|x|) / 127
q_i = round(x_i / s)
x̂_i = q_i * s

The reconstruction error for any single weight is bounded by s / 2. If the scale is small, the error is small. If the scale is large, the error is large. This observation explains almost everything that comes after.

The Outlier Problem

Neural network weight tensors have an inconvenient property: they contain outliers. Most weights cluster in a narrow range, but a small fraction can be significantly larger in magnitude. When you compute the scale over an entire tensor, those outliers force a large scale factor, which in turn causes large quantization error for all the ordinary weights.

This is not a minor issue. Tim Dettmers’ LLM.int8() paper found that transformer models develop persistent outlier features at scale, concentrated in specific columns of specific layers. At 6.7B parameters and above, these outliers appear in nearly every transformer layer. A per-tensor INT8 scheme degrades significantly because the scale is being determined by a handful of extreme values, leaving the vast majority of weights quantized with lower effective precision than the bit width implies.

The LLM.int8() solution was to detect these outlier columns and keep them in float16, quantizing only the well-behaved columns to INT8. It works, but it adds complexity and requires runtime detection logic.

Block Quantization: The Key Insight

The llama.cpp approach, which now underlies the GGUF format, takes a different path. Instead of computing one scale for the whole tensor and fighting outliers, it computes a separate scale for each small block of consecutive weights, typically 32 values at a time.

With a block size of 32 and INT4 quantization, the storage layout looks like this:

[scale: float16][q0 q1 q2 ... q31: 4 bits each]

Each block takes 2 bytes for the scale plus 16 bytes for the weights, for 18 bytes total. That works out to 4.5 bits per weight rather than a clean 4. The extra half-bit pays for the scale storage.

This is enough to sidestep the outlier problem. A block of 32 weights rarely spans more than one or two orders of magnitude, so the scale is well-conditioned and the error bound stays tight. The early GGUF format types, Q4_0 and Q4_1, use this approach. Q4_0 stores one scale per block; Q4_1 stores both a scale and a minimum, enabling asymmetric quantization at the cost of one more float16 per block.

K-Quants: Quantizing the Quantizers

The next step in llama.cpp’s evolution was realizing that the scales themselves were a significant source of overhead. In Q4_0, every 32-weight block carries a full float16 scale, which is 2 bytes. For a 7B model with roughly 3.5 billion weight values, that’s around 219 million scale values, consuming about 440 MB just for scales at float16 precision.

The K-quant formats, introduced in mid-2023, quantize the block scales themselves. In Q4_K, the scales and minimum values for each block are stored as 6-bit integers rather than float16, with a second-level scale applied to interpret them. This reduces the overhead from 2 bytes per 32-weight block to about 0.75 bytes per block, saving meaningful space without much quality loss because the scales are smoother and more compressible than the weights themselves.

The S, M, and L suffixes on K-quant types indicate a mixed-precision strategy within a single model file. Q4_K_M, the most widely recommended format, applies Q6_K (6-bit K-quants) to the attention layers, which are more sensitive to quantization error, and Q4_K to the feed-forward layers, which are more tolerant. The result is a file that is nearly the same size as a flat Q4_K_S but measurably better in output quality, because the bits are spent where they matter.

The Perplexity Numbers

Perplexity on Wikitext-2 is the standard benchmark for quantization quality, where lower is better. Community benchmarks for Llama 2 7B show a clear picture:

FormatPerplexityvs float16
float165.68baseline
Q8_05.69+0.2%
Q6_K5.70+0.4%
Q5_K_M5.72+0.7%
Q4_K_M5.77+1.6%
Q3_K_M6.00+5.6%
Q2_K6.60+16%

Q4_K_M sits in a favorable position: under 2% perplexity degradation at roughly one-third the memory of float16. More importantly, for larger models, the perplexity gap between quantization levels narrows considerably. A 70B model at Q4_K_M often outperforms a 7B model at float16 on most benchmarks, which is the practical reason 4-bit quantization dominates local inference workflows.

The speed benefit compounds this. LLM inference during generation is almost entirely memory-bandwidth-bound, not compute-bound. Loading a 4-bit weight from VRAM or system RAM takes roughly one-quarter the time of a float16 weight, which translates almost linearly to faster token generation. Benchmarks on Apple Silicon show Q4_K_M generating tokens at 2 to 3 times the speed of float16, with the gap widening on machines where RAM is shared between CPU and GPU.

Methods That Go Further

GGUF quantization is simple and calibration-free, which is its main advantage. GPTQ and AWQ take a more expensive approach that yields better results at the same bit width.

GPTQ (Frantar et al., 2022) uses the Hessian of each layer’s output error, approximated from calibration data, to compensate for quantization error in already-quantized weights by adjusting the weights that have not yet been quantized. When a weight is rounded, the accumulated error is distributed across neighboring weights in a way that minimizes the total output distortion. The result is INT4 quality that approaches INT8 without calibration.

AWQ (Lin et al., 2023) takes a different angle. Rather than fixing up quantization error after the fact, it identifies the small fraction of weight channels that are most salient (typically around 1% of channels, identified by high activation magnitudes across a calibration set) and scales them up before quantization. A larger absolute value means the same rounding error is proportionally smaller. The corresponding activations are scaled down to compensate, which can be absorbed into the preceding layer norm. AWQ requires a single forward pass of calibration data rather than the full Hessian computation that GPTQ needs, making it faster to apply.

Both methods are used in server-side inference stacks like vLLM and TGI. For local use, GGUF remains dominant because it requires no calibration data and llama.cpp handles the format natively.

The more recent IQ (importance-matrix) formats in llama.cpp bring a calibration-aware approach to GGUF. IQ4_NL uses a non-linear lookup table with values distributed to match the actual distribution of weights rather than uniform integer steps, which reduces mean squared error without changing the bit width. IQ1_S and IQ2_XXS push into extreme compression territory by using importance matrices computed from a calibration pass to allocate bits non-uniformly across weight positions. These formats are mainly useful for very large models (70B and above) where the sheer parameter count gives the quantization enough headroom to preserve quality.

A Note on NF4

One quantization type that does not fit cleanly into the GPTQ/AWQ/GGUF taxonomy is NF4, introduced in the QLoRA paper. NF4 is a 4-bit data type whose 16 quantization levels are not uniformly spaced. Instead, they are chosen so that each interval between adjacent levels contains an equal fraction of the probability mass of a standard normal distribution. Since weight values are approximately normally distributed after training, this minimizes expected quantization error for the actual distribution of values rather than for a worst-case uniform distribution.

NF4 is primarily used for QLoRA fine-tuning, where the base model is kept in 4-bit and only the LoRA adapter weights are trained in full precision. It is not commonly used for inference because the non-uniform lookup adds overhead and GGUF formats are better optimized for generation throughput.

Choosing a Format

The practical decision tree for most local inference use cases is straightforward. For GPU inference on hardware with sufficient VRAM, Q6_K or Q8_0 are near-lossless. For fitting models into constrained memory, Q4_K_M is the first choice: the quality loss is small enough to be imperceptible on most tasks, and the speed and memory benefits are substantial. Q3_K_M is worth considering only when Q4_K_M does not fit, and Q2_K is generally too lossy for 7B models, though it can be acceptable at 70B scale.

For server inference where throughput matters and you have calibration capacity, AWQ INT4 via vLLM or GPTQ through ExLlamaV2 will outperform equivalent GGUF formats at the same bit width. The quality gap is real but small; the infrastructure cost of maintaining calibrated quantizations is the main consideration.

The deeper point, which the progression from Q4_0 to Q4_K_M to IQ formats illustrates clearly, is that bits per weight is only a first approximation of quality. Where the bits go, how the scales are stored, and whether the quantization is calibration-aware or calibration-free all determine how much of the model’s original capability survives. Two 4-bit quantizations of the same model can differ by as much as a 3-bit naive quantization differs from 5-bit, depending on how carefully the scheme handles the parts of the weight distribution that matter most.

Was this interesting?