· 7 min read ·

The Outlier Problem That Made 4-Bit Quantization Hard to Get Right

Source: simonwillison

The memory math of large language models is unforgiving. A 70-billion-parameter model stored in FP16, the standard 16-bit floating point format used for most inference, requires roughly 140 gigabytes of GPU memory. Consumer hardware tops out at 24 GB. Even professional workstations with multiple GPUs struggle to host models at this scale without creative engineering.

Quantization is the standard answer: reduce the bit width of each weight from 16 bits to 8, or 4, or even 2, and the memory requirement shrinks proportionally. A 70B model at 4-bit fits in about 35 GB, reachable with a pair of consumer GPUs or, depending on the model architecture, a single high-end card. Simon Willison recently published a ground-up explanation of how this works, covering the basic arithmetic involved. What that kind of introductory framing leaves out is why naive quantization falls apart at 4-bit, and what the last three years of research have done to fix it.

The Basic Arithmetic

The core operation is affine quantization. Given a floating-point value x, you compute:

x_quantized = round(x / scale + zero_point)
x_reconstructed = (x_quantized - zero_point) * scale

The scale maps the range of floating-point values onto the available integer range. For 4-bit unsigned integers, that range is 0 to 15. The zero_point shifts the range to handle distributions that don’t center on zero. To recover an approximation of the original value, you reverse the operation.

The error introduced here is the difference between x and x_reconstructed. It depends on how tightly the integer grid can represent the continuous distribution of the original weights, which is why granularity, meaning how many weights share a single scale value, matters as much as bit width.

Per-tensor quantization uses one scale for an entire weight matrix. It is fast but imprecise: the scale has to accommodate the full range of the matrix, so small values get very coarse representation. Per-channel quantization assigns one scale per output neuron, which is substantially better. Per-group quantization, where typically 128 consecutive weights share a scale, is the standard approach in 4-bit methods because it provides good accuracy without excessive metadata overhead.

The Outlier Problem

Naive round-to-nearest quantization at INT8 works reasonably well. At INT4, it fails in a way that was not fully understood until the LLM scaling era made it unavoidable.

Large language models develop what researchers now call “emergent outlier features”: a small fraction of channels, typically less than 0.1% of dimensions, carry values 10 to 100 times larger than the rest of the weight distribution. These outliers appear to encode semantically important information. They also destroy quantization quality. When you compute a scale that must accommodate both a weight of 0.003 and a weight of 3.2, every value in the normal range gets rounded to the same coarse integer, and the information is gone.

Tim Dettmers and colleagues documented this phenomenon in LLM.int8() in 2022. Their solution was mixed-precision decomposition: identify the outlier dimensions, keep them in FP16, and quantize everything else to INT8. This worked well enough for 8-bit, but the computational overhead of the decomposition prevented it from being a throughput optimization. The method enabled larger models to fit in memory without accelerating inference throughput.

GPTQ: Second-Order Compensation

The breakthrough for INT4 came from a different direction. GPTQ, published by Frantar, Ashkboos, and colleagues in late 2022, drew on a line of work stretching back to Optimal Brain Damage and Optimal Brain Surgeon from the early 1990s. The core idea is to use second-order information, specifically the Hessian of the loss with respect to the weights, to compensate for the error introduced by quantization.

The algorithm processes one column of a weight matrix at a time. When a weight is quantized and the rounding error is computed, GPTQ updates the remaining unquantized weights in the row to partially cancel that error. The Hessian tells you which remaining weights most affect the output, so the compensation goes where it matters most. By the time the entire matrix is quantized, each weight’s error has been absorbed into the others as much as the Hessian information allows.

This requires a calibration dataset, typically 128 samples of 2048 tokens, to compute the Hessian. The quantization process takes minutes to hours depending on model size. The result is substantial: GPTQ at 4-bit with group size 128 typically produces perplexity within 2 to 5% of the FP16 baseline, compared to 15 to 20% degradation from naive rounding. On Llama-2-7B evaluated against WikiText-2, FP16 perplexity is roughly 5.47; GPTQ 4-bit lands around 5.60; naive round-to-nearest at 4-bit is closer to 6.50.

AWQ: Protecting What Matters

AWQ, from MIT’s Han Lab in 2023, takes a simpler approach that reaches comparable quality. Instead of using Hessian information to compensate for error after the fact, AWQ identifies the weight channels that correspond to large activation values and protects them from quantization degradation during the process itself.

The mechanism is a per-channel scale applied before quantization: multiply the salient weights by a scale factor greater than 1, quantize, then divide by the same factor after dequantization. The scaling does not change the mathematical output of the layer, but it changes the distribution that the quantizer sees. Salient weights occupy a larger fraction of the integer range and receive finer-grained representation, while less important weights absorb the additional rounding error.

AWQ is faster to apply than GPTQ because it does not require Hessian computation, and it generalizes better across tasks because it focuses on weights activated across many diverse inputs rather than optimizing against a specific calibration set. In practice, AWQ and GPTQ produce similar benchmark results, with AWQ showing a slight edge on instruction-following and chat tasks.

How GGUF K-Quants Handle the Scale Problem

The llama.cpp project developed its own quantization format, now GGUF, and introduced a class of types called K-quants around 2023. These address a subtle inefficiency in standard group quantization: when you store one scale value per 128 weights, the scales themselves consume memory and must be stored in FP32 or FP16. At model scale, this overhead adds up.

K-quants use a nested structure. A “super-block” of 256 weights is divided into smaller internal groups of 16 or 32. The scale and minimum value for each internal group are stored, but those values are themselves quantized to 6-bit integers rather than kept in full floating point. A single FP32 super-scale dequantizes the group scales. This hierarchy reduces the metadata overhead significantly without meaningfully affecting quality.

The practical result is a family of types where Q4_K_M and Q4_K_S achieve better quality than the simpler Q4_0 at nearly the same storage. Q4_K_M is the most widely used format for local inference today; it offers a strong balance between size and quality, and virtually every major open model has pre-quantized variants available on Hugging Face and through Ollama.

A newer class, the IQ types using importance-matrix quantization, extends this further by incorporating a calibration-informed importance matrix to weight rounding decisions. IQ4_XS, for instance, achieves quality comparable to Q4_K_M in a slightly smaller file by concentrating precision where the importance matrix indicates it matters most.

FP8 and the Hardware-First Path

A different approach has taken hold on server hardware. NVIDIA’s H100 introduced native FP8 tensor cores, supporting two 8-bit floating-point formats: E4M3 (four exponent bits, three mantissa) and E5M2. FP8 retains the dynamic range advantages of floating point rather than compressing to a fixed integer grid, which handles the outlier problem more gracefully than INT8 does.

Production deployments at major inference providers now routinely use FP8 weight-and-activation quantization, delivering roughly 2x FLOP throughput versus BF16 with near-zero quality loss. Libraries including vLLM, TensorRT-LLM, and SGLang all support FP8 inference on H100. For those without H100 access, this path is not available, which is why INT4 weight quantization with 16-bit activations remains the dominant approach on consumer hardware.

KV cache quantization has emerged as a separate and increasingly important dimension. As context windows have grown to 128K tokens and beyond, the key-value cache can exceed the model weights in memory consumption during long conversations. Quantizing the KV cache to INT8 or FP8 cuts this cost roughly in half with minimal perplexity impact, and production systems now treat it as a standard optimization rather than an experimental one.

Choosing a Quantization Level

For most local inference tasks, Q4_K_M or an equivalent AWQ or GPTQ 4-bit quantization with group size 128 is the right default. The quality loss versus FP16 is real but small: a few percent on aggregate benchmarks, often undetectable on conversational or coding tasks. Where quality is critical and memory allows it, Q5_K_M or Q6_K approach FP16 performance closely enough that the difference rarely matters.

What the ground-up math ultimately shows is that quantization is not a single operation but a set of design choices, each with measurable consequences. Granularity, calibration method, and how the quantizer handles outlier features all compound. The gap between a naive 4-bit implementation and a well-calibrated GPTQ or AWQ quantization is larger than the gap between 4-bit and 5-bit using the same method, which is a useful thing to understand when evaluating claims about model quality at reduced precision.

Was this interesting?