· 8 min read ·

The Outlier Problem That Reshaped How We Quantize LLMs

Source: simonwillison

Simon Willison published a detailed walkthrough of quantization from the ground up that covers the arithmetic clearly: scale factors, zero points, block quantization. It is a good foundation. What it does not fully address is why the engineering had to evolve the way it did, and why the GGUF format names you see on Hugging Face encode a surprisingly non-trivial set of design decisions. This post picks up from those first principles and traces the path to where we are now.

The Basic Arithmetic and Why It Breaks

Quantization reduces weight precision by fitting a range of floating-point values into a smaller set of integers, storing a scale factor to recover approximate originals at inference time. For symmetric INT8:

scale = max(abs(weights)) / 127
quantized = round(weights / scale).clip(-127, 127).astype('int8')
dequantized = quantized * scale

The reconstruction error per weight is bounded by scale / 2. Wide value range means large scale means coarse grid means high error. That relationship is linear in theory and catastrophic in practice once you cross a certain model size.

For INT4, the quantization grid has only 16 levels instead of 256. The error tolerance is much tighter. In 2019 and 2020, per-tensor 4-bit quantization on models under a few hundred million parameters worked reasonably well. The quality loss was manageable. Then the scale of models increased, and something unexpected happened inside the weight distributions.

Emergent Outliers at 6.7B Parameters

In 2022, Tim Dettmers, Luke Zettlemoyer, and colleagues published LLM.int8(), which documented a phenomenon in models at and above 6.7 billion parameters: a small fraction of weight and activation dimensions, typically fewer than 0.1% of all channels, carry values 10 to 100 times larger than everything else in the tensor. These are not noise or training artifacts. They encode semantically important computations; removing or heavily distorting them collapses model quality.

The problem for quantization is geometric. Suppose a weight tensor has values uniformly distributed around 0.01 in absolute magnitude, except for one outlier at 8.5. Per-tensor symmetric quantization must set its scale to accommodate that outlier: scale = 8.5 / 127 ≈ 0.067. Every other weight then gets rounded to the nearest multiple of 0.067, which means all the small weights, which carry the actual learned information, collapse to just a few integer values near zero. Mean squared error increases by roughly three orders of magnitude compared to a tensor without the outlier. And in large models, these outlier dimensions appear in every layer.

Naive INT8 already struggled with this at large model sizes. INT4 was essentially unusable for serious inference on 7B-plus models without a structural fix.

Block Quantization: Containing the Damage

The structural fix is block (or group) quantization: instead of one scale per entire weight matrix, compute a separate scale for every contiguous block of 32 or 128 weights. An outlier in one block inflates only that block’s scale, leaving adjacent blocks unaffected.

def quantize_blocked(weights, group_size=32, bits=4):
    n = len(weights)
    scales = []
    quants = []
    for i in range(0, n, group_size):
        g = weights[i:i+group_size]
        s = max(abs(g)) / (2**(bits-1) - 1)
        scales.append(s)
        quants.extend(round(g / s).clip(-(2**(bits-1)-1), 2**(bits-1)-1))
    return quants, scales

Storing one FP32 scale per 32 weights adds 32 / (32 * bits) bits of overhead, roughly 1 extra bit per weight for 4-bit quantization. In practice, a Q4 format with group-32 uses about 4.5 effective bits per weight. That overhead is the price of containing outlier damage.

This is the foundation that llama.cpp’s GGUF format builds on. But the original Q4_0 format, which used absmax (no stored minimum, forced zero point at center) with group-32, still had measurable quality degradation compared to FP16. The question was how to extract more quality from the same bit budget without simply using more bits.

K-Quants: Hierarchical Scale Storage

K-quants, contributed to llama.cpp in mid-2023 by Iwan Kawrakow, introduced a two-level structure. Rather than storing scale factors as full FP32 values, a super-block of 256 weights is divided into 8 sub-blocks of 32 weights each. Each sub-block gets a 6-bit integer scale. A single FP16 value normalizes all eight sub-block scales within the super-block.

The storage math: 256 weights at 4 bits is 128 bytes. The scales add 8 sub-blocks * 6 bits + 16 bits FP16 = 64 bits = 8 bytes. Total: 136 bytes for 256 weights, or 4.25 effective bits per weight. That is almost identical in size to a flat Q4_0 block, but the hierarchical scale structure provides much finer-grained control within each super-block. The quality improvement over Q4_0 is substantial, not because more bits are used, but because the scale metadata is allocated more intelligently.

The Q4_K_M Mixed-Precision Trick

The _M suffix in Q4_K_M denotes a mixed-layer strategy. Not all transformer layers are equally sensitive to quantization error. Attention output projections and feed-forward down projections degrade more visibly when quantized aggressively; the feed-forward middle layers are more robust.

Q4_K_M promotes those sensitive layers to Q6_K while keeping the bulk of parameters at Q4_K. The resulting model averages around 4.8 bits per weight, not the clean 4.0 the format name implies. That 0.8 bits of additional overhead buys a perplexity improvement that is disproportionate to the extra storage. On WikiText-2, a 7B model quantized to Q4_K_M lands around 5.78 perplexity versus 5.90 for Q4_0 at comparable file size, and 5.68 for FP16. The gap between Q4_K_M and FP16 is smaller than the gap between Q4_K_M and Q4_0.

Where the Cliff Is and Why

The practical perplexity table for a 7B model on WikiText-2 looks roughly like this:

FormatAvg bitsApprox sizeWikiText-2 PPLDelta from FP16
FP1616.014 GB5.680.00
Q8_08.57.7 GB5.69+0.01
Q6_K6.65.5 GB5.72+0.04
Q5_K_M5.74.8 GB5.74+0.06
Q4_K_M4.84.4 GB5.78+0.10
Q3_K_M3.93.3 GB6.13+0.45
Q2_K3.42.9 GB7.05+1.37

The cliff is sharp between Q4_K_M and Q3_K_M. Quality loss above 4 bits is gradual enough to be negligible for most tasks. Below 4 bits, the quantization grid becomes too coarse to represent weight distributions faithfully even with block structure and hierarchical scales. At Q3, the information capacity of the integer grid is hitting a floor that algorithmic improvements can only partially compensate for.

Perplexity undersells the damage in specific cases. Long-context generation accumulates small errors across many forward passes; multi-step arithmetic chains amplify rounding noise; code generation degrades faster than prose. Instruction-tuned models are generally more robust than base models at the same quantization level, probably because fine-tuning redistributes weight magnitudes toward more quantization-friendly distributions.

Calibration-Based Methods: GPTQ and AWQ

Block quantization and K-quants are post-training and data-free. GPTQ introduced calibration: a small set of representative text samples, typically 128 sequences of 2048 tokens, is run through the model during quantization. Second-order gradient information from the Hessian of the loss is used to compensate for rounding errors column by column within each weight matrix. When one weight is rounded, the remaining unquantized weights in the same row are adjusted to partially cancel the introduced error.

AWQ takes a different route. Rather than computing Hessians, it identifies the small fraction of weight channels that correspond to large input activations and multiplies a per-channel scale into those channels before quantization. Salient weights occupy more of the integer range and get finer representation; the scale is divided back out at inference by absorbing it into adjacent normalization layers at zero additional runtime cost.

Both methods generally outperform K-quants at the same bit width, particularly below 4 bits. The tradeoff is that they require calibration data and GPU resources during the quantization pass, and the resulting files are less portable than GGUF. Libraries like AutoGPTQ and AutoAWQ handle the tooling, but the workflow is more involved than downloading a pre-quantized GGUF from Hugging Face.

For CPU and edge inference via llama.cpp, the newer importance matrix quantization (IQ-quants) in GGUF takes a calibration-informed approach within the GGUF ecosystem. Users can generate an importance matrix from domain-specific data and use it to produce quantizations that assign precision where it matters for their workload. IQ4_XS matches or exceeds Q4_K_M quality in a roughly 8% smaller file.

The Paradigm Shift: Training for Quantization

All of the above is post-training quantization: take a model trained in full precision, apply a compression scheme afterward. Microsoft’s BitNet b1.58 represents a different approach. Weights are constrained to ternary values {-1, 0, +1} during training itself, with quantization as a first-class training objective rather than a post-hoc compression step.

Matrix multiplication degenerates to additions and subtractions, eliminating floating-point multiply operations entirely. A 2B parameter model trained on 4 trillion tokens runs on a Raspberry Pi 5 at usable speeds. Energy consumption at 7B scale is roughly 71% lower than FP16 equivalent.

The significant constraint is that BitNet models must be trained from scratch. Existing model weights, representing years of compute and fine-tuning, cannot be converted. The technical case for ternary-weight training is strong; the ecosystem case is weak. The GGUF universe represents thousands of community fine-tunes, specialized models, and domain-specific checkpoints built on top of base models trained at enormous expense. BitNet starts that ecosystem from zero.

FP8 hardware quantization on NVIDIA H100 with E4M3 and E5M2 formats sits in a different category: near-lossless server inference at roughly 2x the throughput of BF16, handled transparently by frameworks like vLLM and TensorRT-LLM. The hardware availability limits this to high-end server deployments, but FP8 activation quantization is now standard in production deployments at most large providers.

The Practical Takeaway

For running models locally, Q4_K_M is the sensible default: 3.3x memory reduction from FP16, perplexity within noise for conversational and writing tasks, well-supported across hardware. Running a 70B model at Q4_K_M on a 64GB Mac Studio will outperform a 7B at FP16 on nearly any task that benefits from model capacity. The bits matter less than the total model quality in the memory budget.

The format names encode real engineering. Q4_0 and Q4_K_M occupy similar file sizes but different quality levels because block structure, hierarchical scale storage, and mixed-layer precision add up. Understanding what the format names mean makes it easier to reason about what you are actually trading when you pick a quantization level for deployment.

Was this interesting?