· 7 min read ·

The Math Behind Model Quantization: Why Cutting Bits Doesn't Mean Cutting Corners

Source: simonwillison

Running a large language model locally used to mean either owning a server rack or settling for a toy-sized model. Quantization changed that. A 7B parameter model in full 32-bit precision needs roughly 28 GB of RAM. The same model quantized to 4 bits fits in under 5 GB. That difference is what makes local inference practical on consumer hardware, which is why Simon Willison’s recent deep-dive into quantization from first principles is worth paying close attention to, even if you think you already understand the concept.

The post walks through the mechanics from scratch, and that ground-up approach reveals something important: quantization is not just compression. It is a lossy numeric transformation with well-defined error characteristics, and understanding those characteristics is what separates informed model selection from guesswork.

Floating-Point Basics

Before you can understand quantization, you need a clear picture of what you are quantizing away from. Modern neural network weights are typically stored as 32-bit floats (FP32) or 16-bit floats (FP16/BF16). A 32-bit float follows the IEEE 754 standard: 1 sign bit, 8 exponent bits, and 23 mantissa bits. This gives a dynamic range spanning roughly 1.2e-38 to 3.4e+38, with about 7 decimal digits of precision.

FP16 cuts that to 1 sign bit, 5 exponent bits, and 10 mantissa bits. The dynamic range collapses dramatically, which is why training in FP16 requires loss scaling to avoid underflow. BF16 (Brain Float 16), introduced by Google, takes a different tradeoff: it keeps the 8 exponent bits from FP32 but reduces the mantissa to 7 bits. This preserves the dynamic range at the cost of precision, which works well for neural networks because weights tend to be distributed near zero with occasional large outliers.

The key insight here is that neural network weights are not random. They cluster. Most values in a transformer’s weight matrices sit in a fairly narrow numeric range, which is exactly what makes quantization viable.

Affine Quantization: The Core Mechanism

The dominant approach to quantization maps a range of floating-point values to a fixed set of integers. For 8-bit quantization, that means mapping to the range [0, 255] (unsigned) or [-128, 127] (signed). The mapping is affine:

q = round(x / scale + zero_point)
x_approx = (q - zero_point) * scale

Here, scale is a floating-point scalar that stretches or compresses the value range, and zero_point is an integer offset that shifts where zero lands in the quantized domain. Computing these two parameters requires knowing the min and max of the values you are quantizing.

For a weight matrix, you might compute:

def quantize_tensor(x, bits=8):
    qmin = -(2 ** (bits - 1))
    qmax = 2 ** (bits - 1) - 1
    scale = (x.max() - x.min()) / (qmax - qmin)
    zero_point = round(qmin - x.min() / scale)
    zero_point = max(qmin, min(qmax, zero_point))
    q = (x / scale + zero_point).round().clamp(qmin, qmax)
    return q, scale, zero_point

The reconstruction error is bounded by scale / 2 per element, which means wider value ranges produce more quantization error. This is why outlier weights are a significant concern: a few extreme values force a large scale factor, which degrades precision for all the normal values.

Symmetric quantization simplifies this by forcing zero_point = 0 and mapping to a symmetric integer range. This is slightly less accurate but faster at inference because the zero-point offset arithmetic disappears.

Per-Tensor vs Per-Channel vs Per-Block

Using a single scale and zero-point for an entire weight matrix (per-tensor quantization) is the simplest approach but produces the most error. Each row or column of a matrix can have a very different distribution, and forcing them all onto the same scale is wasteful.

Per-channel quantization assigns separate scale and zero-point values to each output channel of a weight matrix. This is the standard approach in frameworks like PyTorch’s quantization toolkit and improves accuracy significantly at a small storage overhead, since you are storing one extra floating-point value per channel.

GGUF, the format used by llama.cpp, takes this further with block quantization. Rather than quantizing per-channel, it divides each weight tensor into small blocks (typically 32 values) and computes separate quantization parameters per block. This gives much finer granularity without the memory overhead of per-weight scales.

The K-Quant Taxonomy

If you have downloaded models from Hugging Face or run llama.cpp, you have seen names like Q4_K_M, Q5_K_S, Q6_K. The naming scheme encodes several things at once:

  • The leading number (Q4, Q5, Q6) is the bit width for most weights
  • K means “k-quantization”, a scheme that uses a learned or optimized set of quantization levels (a codebook) rather than uniform spacing
  • S, M, L indicate the size/quality tier (Small, Medium, Large) within that bit width

K-quants use a two-level quantization approach. The block scale values themselves are quantized to 6 bits, and the quantized weight values use a non-uniform codebook that better matches the distribution of neural network weights. This is similar in spirit to vector quantization but applied at a finer granularity.

The practical result is that Q4_K_M (around 4.8 bits per weight on average) typically outperforms plain Q4_0 (exactly 4 bits per weight) on perplexity benchmarks, despite being only slightly larger. The community benchmark on the llama.cpp wiki shows Q4_K_M consistently within 0.1-0.3 perplexity points of F16, while Q4_0 falls further behind.

GPTQ and AWQ: Calibration-Based Approaches

The approaches above are called post-training quantization (PTQ) with no calibration. You take trained weights and quantize them directly. Two methods that produce better results at the same bit width are GPTQ and AWQ.

GPTQ (Generalized Post-Training Quantization) uses the second-order information from the Hessian of the loss to find optimal quantization parameters. It processes weight columns one by one, quantizing each column and then adjusting the remaining columns to compensate for the introduced error. This “error compensation” step is the key insight: instead of independently minimizing per-column error, GPTQ minimizes the downstream effect of each quantization decision on the remaining weights.

AWQ (Activation-aware Weight Quantization) takes a different angle. The observation is that not all weights matter equally. Weights that are multiplied by large activations contribute more to the output, so protecting those weights from quantization error matters more. AWQ identifies these “salient” weights by running a small calibration dataset through the model and then scales those weight channels up (and the corresponding activations down) before quantization, so that the quantization grid aligns better with the important values.

Both methods require a calibration dataset (typically a few hundred samples from C4 or a similar text corpus) and take significantly longer to quantize than direct PTQ, but they produce noticeably better results at 4-bit precision.

Quantization Error in Transformers: Where It Hurts

Not all parts of a transformer are equally sensitive to quantization. Attention query and key projections tend to be more sensitive than value projections. The first and last layers are almost always kept at higher precision because they have disproportionate influence on output quality. Feed-forward network layers in the middle of deep models are generally robust to aggressive quantization.

This is why mixed-precision quantization (keeping some layers at FP16 while quantizing others to INT4 or INT8) often beats uniform quantization at the same average bit width. The GGUF mixed quantization format supports this with its “importance matrix” feature, which uses a calibration pass to identify which weight blocks matter most and assigns them higher precision.

What This Means in Practice

For running models locally, whether for a Discord bot, a development tool, or just experimentation, the practical hierarchy is roughly:

  • Q8_0 or Q6_K: nearly indistinguishable from FP16 on most tasks, about 8-9 GB for a 7B model
  • Q5_K_M: excellent quality, around 5.5 GB for 7B, the sweet spot if you have the RAM
  • Q4_K_M: very good quality, around 4.8 GB for 7B, the most common practical choice
  • Q3_K_M: acceptable for less sensitive tasks, meaningful quality degradation on reasoning
  • Q2_K: small enough to notice a real difference, mainly useful when you have no other option

The perplexity numbers matter less than you might think for most applications. Perplexity measures average prediction confidence on held-out text, but it does not directly measure factual accuracy, instruction following, or coding quality. A model at Q4_K_M might have perplexity 0.2 higher than its F16 counterpart and be completely indistinguishable on real tasks.

Where quantization degrades noticeably before the perplexity numbers would suggest is long-context generation, multi-step arithmetic, and tasks requiring precise memorization. These are exactly the tasks where small cumulative errors in activation values compound across many transformer layers.

The Direction Things Are Heading

The research frontier has moved past simple PTQ into quantization-aware training (QAT), where the model learns to compensate for quantization error during training itself. LLM.int8() by Tim Dettmers introduced a mixed-precision decomposition that handles outlier activations at INT8 with minimal quality loss. Subsequent work like QuIP# and AQLM has pushed the frontier to sub-4-bit quantization using codebook-based approaches that treat groups of weights as vectors rather than scalars.

The llama.cpp project continues to improve its quantization options with each release. The introduction of imatrix (importance matrix) quantization in late 2023 was a significant step, letting users generate better-quality lower-bit quantizations with a single calibration pass using their own domain-specific data.

Simon Willison’s post on this topic is valuable precisely because it builds the intuition from the numeric level upward, rather than presenting the formulas as a fait accompli. Understanding that a scale factor is just a linear mapping, that block quantization is just many of those mappings applied locally, and that K-quants are using a learned codebook to fit that mapping better, these concepts connect into a coherent picture. Once you have that picture, the file size calculations, the quality tradeoffs, and the format names all make sense as engineering decisions rather than arbitrary parameters.

Was this interesting?