The Outlier Problem: What Makes LLM Quantization Harder Than It Looks
Source: simonwillison
The basic premise of quantization is easy to state: model weights are stored as 32-bit floats during training, and you can represent them with fewer bits at inference time without the model falling apart entirely. Run a 70B parameter model in 4-bit instead of 16-bit and it fits in GPU memory at a fraction of the size. The quality degrades somewhat, and the degree of that degradation is the engineering problem.
Simon Willison’s recent walkthrough builds the concept from scratch, which is useful for building intuition. But there is a layer below the intuition worth digging into: why do some quantization schemes work dramatically better than others, and why did it take researchers several years to figure out how to quantize large language models without significant quality loss?
The Arithmetic of Precision Loss
Start with absmax quantization, the simplest possible scheme. Take a tensor of float32 weights, find the absolute maximum value, divide by 127 (for int8), and you have a scale factor. Every weight maps to round(weight / scale), producing an integer in [-127, 127]. Reconstruction multiplies by the same scale. The maximum possible error per weight is scale / 2.
def absmax_quantize(weights):
scale = weights.abs().max() / 127
quantized = (weights / scale).round().clamp(-127, 127).to(torch.int8)
return quantized, scale
def absmax_dequantize(quantized, scale):
return quantized.float() * scale
This is clean and fast. For int8 it halves memory relative to float16, and for int4 it quarters it. The problem is that it does not work well for transformer weights, and the reason took time to become clear.
The Outlier Problem
In 2022, a paper from Dettmers et al. identified something unexpected in large transformer models: as models scale past roughly 6.7 billion parameters, a small fraction of hidden dimensions develop persistent, large-magnitude activations across nearly all inputs. These outliers are not artifacts; they appear to be functional. But they are catastrophic for quantization.
When a single dimension has values 100x larger than most others, the absmax scale gets set by that outlier. Everything else gets squashed into a few integer values near zero. A float16 weight that should map to, say, 45 out of 127 now maps to 1. The reconstruction error for non-outlier weights becomes enormous.
The effect compounds with model size. Smaller models might have no outlier dimensions at all; models above 30B parameters typically have them in every layer. Applying naive int8 quantization to a 65B parameter model produces output noticeably worse than the float16 version. At 4-bit, the situation is worse still.
LLM.int8(), introduced in that same paper, was the first practical solution. The core idea: decompose the matrix multiplication. Identify outlier dimensions (typically fewer than 1% of them), keep those in float16, and quantize everything else to int8. The two partial results are added together. Memory savings are slightly reduced but still substantial, and quality is largely preserved.
# Pseudocode for mixed-precision matmul
outlier_cols = find_outlier_dimensions(activation, threshold=6.0)
normal_cols = ~outlier_cols
result = (activation[:, normal_cols].int8() @ weights[normal_cols, :].int8()) * scale
result += activation[:, outlier_cols].half() @ weights[outlier_cols, :].half()
This works, but decomposing matrix multiplications adds latency and implementation complexity. It also does not directly address 4-bit quantization, where rounding errors are larger to begin with.
Block Quantization: A Structural Fix
Rather than decomposing the computation, block quantization assigns a separate scale factor to each small block of weights, typically 32 or 64 values. Each block is quantized relative to its own maximum. An outlier value in one block only sets the scale for that block, not the entire tensor.
A weight tensor with one outlier dimension might have poor representation for that dimension’s blocks, but all other blocks are quantized accurately relative to their own range. The aggregate error is far lower, without any change to how the matrix multiply is structured.
GGUF’s quantization formats, used extensively by llama.cpp, are built around block quantization with additional structure. Q8_0 uses blocks of 32 weights with a single float16 scale each. Q4_0 uses 4-bit integers in blocks of 32. The K-quant variants go further: they use a two-level scheme where the block scales themselves are quantized relative to a super-block scale.
In Q4_K_M specifically, weights are stored in 4-bit, organized in blocks of 32. Groups of 8 blocks form a super-block of 256 weights. The block scales are stored as 6-bit integers relative to a float16 super-block scale, compressing scale metadata significantly. The “M” suffix indicates a mixed strategy: some layers, typically attention output projections and feed-forward down projections, which are more sensitive to precision loss, use Q6_K instead of Q4_K. This is why Q4_K_M quality consistently exceeds naive Q4, even though the average bit count per weight lands around 4.5 bits rather than a clean 4.0.
GPTQ: Using Curvature to Compensate
Block quantization handles the outlier problem structurally. GPTQ (Frantar et al., 2022) takes a different approach: it quantizes weights one at a time and uses information about the loss landscape to compensate for each rounding error before moving to the next weight.
The key insight comes from optimal brain compression, a technique originally developed for neural network pruning. For each weight you quantize, you can compute the resulting change in output error and then adjust the remaining unquantized weights to partially cancel it. The required information is the inverse Hessian of the layer’s output with respect to its weights, estimated from a calibration dataset of real text.
GPTQ makes this tractable for billion-parameter models by processing weights column by column and using a Cholesky decomposition to efficiently maintain the inverse Hessian. The calibration pass uses a few hundred samples to compute activation statistics. The resulting post-quantization weights are not simply rounded originals; they are adjusted to minimize accumulated rounding error column by column.
In practice, GPTQ 4-bit produces lower perplexity than naive 4-bit across a wide range of models, and the quality gap grows at smaller bit widths. At 3-bit, naive quantization is often unusable on capable models; GPTQ 3-bit remains competitive with float16 on many benchmarks.
AWQ: Protecting What Matters Most
AWQ (Activation-aware Weight Quantization) from Lin et al. arrives at a cleaner observation: not all weights contribute equally to the output. Weights corresponding to large-magnitude input activations have outsized influence because they participate in more significant computations. Rounding error on those weights propagates further through the network.
AWQ protects salient weights not by keeping them in higher precision, which would add implementation complexity, but by scaling them up before quantization and scaling the corresponding activations down to compensate. A weight of 0.5 becomes 4.0 after a per-channel scaling factor of 8, then rounds to a 4-bit integer that reconstructs with much lower relative error. At runtime, the activation scaling factor is absorbed into the preceding layer’s normalization or linear transform, adding no inference overhead.
The calibration requirement for AWQ is lighter than GPTQ: it only needs activation statistics, not a full Hessian computation. This makes it faster to apply and less sensitive to what calibration data you use. The quality is generally comparable to GPTQ, with slight advantages on some models at lower bit widths.
AutoAWQ and bitsandbytes bring both approaches into the HuggingFace ecosystem. AutoAWQ handles calibration and weight scaling for AWQ; bitsandbytes provides the LLM.int8() and 4-bit NF4 formats used extensively in QLoRA fine-tuning workflows, where keeping a frozen quantized base model in memory alongside trainable adapters is the whole point.
Where the Research Has Moved
Post-training quantization remains a practical necessity for the near term: the models people want to run were trained at full precision and need to be compressed after the fact. For local inference, Q4_K_M or Q5_K_M via llama.cpp is the reasonable default for most use cases, landing around 4.5 bits per weight with quality close to float16 on standard benchmarks.
The research frontier has pushed into sub-4-bit and training-time quantization. BitNet from Microsoft trains models from scratch with 1-bit weights (ternary values: -1, 0, or +1), showing that models do not need to start at high precision and be compressed afterward. BitNet b1.58 achieves competitive perplexity with full-precision models at the same parameter count while enabling dramatically faster inference on hardware that can exploit ternary arithmetic. The ecosystem tooling is not there yet, but the research result is significant: the constraint is not fundamental.
Understanding the distinctions between schemes matters for practical decisions. Q4_K_M is not simply the 4-bit version of a model; it is a specific combination of block structure, two-level scale quantization, and mixed-layer precision that happens to land in a useful spot on the quality-size curve. GPTQ and AWQ are not interchangeable; they make different structural bets about where quantization error originates and how to correct it. The choice between them depends on your hardware, your calibration data, and which models you are running. Picking blindly based on bit count alone leaves a lot of quality on the table.