· 7 min read ·

Where Quantization Math Gets Hard: The Outlier Problem Behind GPTQ, AWQ, and GGUF K-Quants

Source: simonwillison

Simon Willison recently published a ground-up explanation of quantization that walks through the arithmetic of mapping floating-point weights to integers. The math is genuinely approachable once you see it laid out. What the ground-up framing reveals, though, is something worth dwelling on: the arithmetic itself is simple, but transformers have a structural property that makes applying that arithmetic to real models surprisingly difficult. That property is outliers, and understanding it explains almost every design decision in the quantization ecosystem.

The Arithmetic Itself

Linear quantization maps a floating-point value x into an integer q using a scale factor and an optional zero point:

q = round(x / scale + zero_point)
x_hat = (q - zero_point) * scale

For symmetric quantization (zero_point = 0), the scale is just max(|x|) / (2^(bits-1) - 1). For 8-bit symmetric quantization, that denominator is 127. Every value in your tensor gets divided by that scale, rounded to the nearest integer, and stored. Reconstruction multiplies back by the scale.

The quantization error for any single value is bounded by scale / 2. If your tensor has a max absolute value of 1.0, the scale is about 0.0079, and your worst-case error per weight is around 0.004. That is fine.

The problem is what happens when one value in your tensor is 100.0 and the rest cluster between -1.0 and 1.0.

Outliers Break the Scale Calculation

One outlier at 100.0 forces the scale to roughly 100 / 127 = 0.787. Now your worst-case error is 0.39, and the entire effective precision of your 8-bit quantization is being spent on a range that almost no weights actually occupy. The 99% of weights sitting between -1 and 1 are getting quantized with an effective resolution of about 2.5 steps across their entire range. You have used 8 bits to represent what you could have represented with 2.

This would be a manageable nuisance in most neural networks. In transformers it is pervasive and systematic. The LLM.int8() paper by Dettmers et al. (2022) documented the phenomenon carefully: as transformer models scale beyond a few billion parameters, they develop persistent outlier features in specific activation dimensions. These outliers are not random noise. The same dimensions consistently produce activations an order of magnitude larger than the rest, across layers, across tokens, across inputs. The models have learned to use those dimensions for something important, and they are not going away.

Three Answers to One Problem

The post-training quantization landscape since 2022 is largely a set of different answers to this outlier problem. They make different tradeoffs between compute cost, quality, and deployment complexity.

LLM.int8(): Just Separate the Outliers

The LLM.int8() approach is the most direct: identify the outlier dimensions and handle them separately. The method detects which columns of the weight matrix correspond to outlier activation dimensions (typically less than 1% of all features), keeps those columns in float16, and quantizes the remaining 99% to int8. The matrix multiplication is decomposed accordingly.

This mixed-precision approach preserves quality remarkably well because it protects the dimensions the model actually cares about. The cost is runtime complexity: you are doing two matrix multiplications and combining the results. The bitsandbytes library implements this and makes it accessible via a simple flag in Hugging Face Transformers.

GPTQ: Fix the Error After the Fact

GPTQ (Frantar et al., 2022) takes a different approach rooted in second-order optimization. Rather than separating outliers, it accepts that quantization will introduce error and compensates for that error in the remaining unquantized weights.

The method processes weights column by column. When it quantizes weight w_i, it computes the quantization error delta_i = q_i - w_i, then adjusts all remaining unquantized weights to compensate, using the inverse of the Hessian of the layer’s output loss:

w_j := w_j - (delta_i / [H^-1]_{ii}) * [H^-1]_{ij}

This is a direct descendant of Optimal Brain Surgeon (Hassibi & Stork, 1993), adapted to work at the scale of billion-parameter models by amortizing the Hessian computation across rows and using lazy batch updates. The Hessian is estimated from a small calibration dataset of a few hundred samples.

GPTQ runs slowly: quantizing a 70B model takes several hours on a GPU. But it runs once, offline, and the result is a quantized model that is loaded and served normally. Quality at 4-bit is consistently competitive with or better than naive int8 approaches.

AWQ: Weight the Important Ones

AWQ (Lin et al., 2023) makes a cleaner observation. If you look at which weights cause the most damage when quantized, it is not random: weights that correspond to channels with large input activations matter more, because errors in those weights get amplified by the large activations when computing the layer output. The error in the output is proportional to delta_w * x, not just delta_w.

AWQ uses a calibration dataset to identify salient weight channels, then applies a per-channel scaling transformation before quantization. If a channel’s weights are multiplied by a scale s > 1 before quantization and divided by s after reconstruction, the quantization grid effectively has finer resolution for that channel. The scale is chosen to minimize output error. The transformation is mathematically equivalent to absorbing the scale into the adjacent layer, so the deployed model has no runtime overhead.

AWQ runs much faster than GPTQ (minutes rather than hours) and achieves similar perplexity scores at 4-bit. The AutoAWQ library provides an accessible implementation.

GGUF and K-Quants: Hierarchical Scales

The GGUF ecosystem that grew around llama.cpp has its own quantization format that takes a structural approach to the outlier problem. Instead of modifying the quantization procedure, it uses a two-level scale hierarchy.

The basic GGUF quantization types (Q4_0, Q5_0, Q8_0) use one scale per block of 32 weights. A single outlier in a block still degrades precision for the whole block, but the damage is contained to 32 weights rather than the entire tensor.

The K-quants (Q2_K through Q6_K) add a super-block structure. A super-block contains 256 weights, divided into 8 sub-blocks of 32. Each sub-block gets its own scale, but those sub-block scales are themselves quantized to 6 bits and stored relative to a single float16 scale for the super-block. This two-level encoding compresses the scale metadata while preserving enough precision to localize the quantization range.

Q4_K_M, the format most people reach for first, goes one step further: it uses Q6_K quantization for attention and feed-forward gate weight tensors (the layers most sensitive to precision loss) and Q4_K for the rest. The “M” denotes this mixed strategy. A 7B parameter model in Q4_K_M format is roughly 4.1 GB; the float16 version is about 14 GB.

Here is how the quantization types roughly compare in bits per weight and relative quality for a typical 7B model:

FormatBits/weightFile size (7B)Quality vs F16
F1616~14 GBbaseline
Q8_08.5~7.7 GBnear-lossless
Q6_K6.6~6.0 GBexcellent
Q5_K_M5.7~5.1 GBvery good
Q4_K_M4.8~4.4 GBgood
Q3_K_M3.9~3.5 GBacceptable
Q2_K3.4~3.1 GBnoticeable loss

Perplexity on standard benchmarks like WikiText-2 climbs slowly from Q8_0 down to Q4_K_M, then more steeply below that. Q4_K_M tends to be where the quality-to-size tradeoff flattens out for most use cases.

Calibration Data and the Offline/Online Split

GPTQ and AWQ both require a calibration dataset to estimate which weights and channels matter. This creates a subtle dependency: the quality of your quantized model depends on how representative that calibration data is of your actual workload. Quantize a code model with general web text and you may see more quality loss on coding tasks than the perplexity numbers suggest.

llama.cpp’s K-quant approach avoids calibration entirely, working only from the weight statistics of the model itself. This makes it more general but means it cannot adapt to task-specific weight importance.

The newer GGUF importance matrix feature adds optional calibration back in: you provide a sample dataset, llama.cpp computes an importance score for each weight, and the quantizer uses those scores to prioritize precision where it matters. This is conceptually similar to AWQ’s per-channel scaling, but applied at the llama.cpp level.

What This Means When You Are Choosing a Format

For local inference with llama.cpp or Ollama, Q4_K_M is a reasonable default for models up to 13B if you are RAM-constrained. Q5_K_M gives a noticeable quality improvement for about 15% more memory. Q8_0 is worth using if you can fit it, because the gap from Q8_0 to F16 is small enough to ignore for most tasks.

For GPU inference via the Transformers ecosystem, AWQ and GPTQ at 4-bit are both solid options. AWQ tends to be faster to quantize and roughly equivalent in output quality. If you are fine-tuning rather than just running inference, QLoRA (which uses bitsandbytes NF4 quantization plus double quantization of the quantization constants themselves) lets you fine-tune a 7B model on a 16 GB GPU.

The underlying math, as Willison’s walkthrough makes clear, is a few lines of arithmetic. The engineering sitting on top of that arithmetic, the calibration loops, the hierarchical scales, the mixed-precision decompositions, exists almost entirely because transformers learned to concentrate information in outlier dimensions that naive quantization destroys. Understanding that one structural fact about transformer activations gives you a cleaner mental model of why each tool in the ecosystem makes the choices it does.

Was this interesting?