· 6 min read ·

What Quantization Actually Does to a Neural Network's Numbers

Source: simonwillison

Simon Willison recently published a ground-up explanation of quantization that walks through the core math clearly. It’s worth reading for the intuition it builds. What I want to do here is take that foundation and push into the parts that matter most when you’re actually choosing a quantized model to run locally, or deciding whether to trust a quantized model’s outputs in a production setting.

The short version of why quantization exists: a 70-billion parameter model stored in 32-bit floats needs roughly 280 GB of memory. Most people do not have 280 GB of GPU VRAM. Quantization trades some numerical precision for a dramatic reduction in memory footprint, and it turns out that for inference, the trade-off is often better than you would expect.

What floating point actually stores

To understand what you lose when you quantize, you need to understand what you start with. An IEEE 754 single-precision float (FP32) uses 32 bits: 1 sign bit, 8 exponent bits, and 23 mantissa bits. The exponent gives you range across roughly 80 orders of magnitude. The mantissa gives you about 7 decimal digits of precision within any given range.

BF16, the 16-bit format Google developed for TPUs and now widely used in ML, keeps the full 8-bit exponent from FP32 and truncates the mantissa to 7 bits. This preserves the same dynamic range as FP32 at the cost of precision. FP16 (IEEE half-precision) instead uses a 5-bit exponent and 10-bit mantissa, which gives better precision in the normal range but can overflow or underflow where BF16 would not. For neural network weights, BF16 is usually preferable because weights tend to have outlier values that benefit from the larger dynamic range.

The mechanics of integer quantization

When you quantize weights to INT8, you are mapping a range of float values onto 256 discrete integers. The two standard approaches are symmetric and asymmetric quantization.

Symmetric quantization picks a scale factor s such that the most extreme weight value maps to ±127:

s = max(abs(weight)) / 127
quantized = round(weight / s)
dequantized = quantized * s

Asymmetric quantization also stores a zero point z, which lets the quantized range shift to cover asymmetric distributions (weights clustered above zero, for instance):

s = (max(weight) - min(weight)) / 255
z = round(-min(weight) / s)
quantized = round(weight / s) + z
dequantized = (quantized - z) * s

The scale factor and zero point must be stored alongside the weights. For per-tensor quantization you store one pair for the whole weight matrix. For per-channel quantization you store one pair per row or column. Per-channel quantization costs more storage overhead but dramatically improves accuracy because different channels in a transformer layer can have very different value distributions.

Why INT4 is harder than INT8

With INT8, you have 256 possible values to represent a continuous float distribution. With INT4, you have 16. The rounding errors compound. A weight that should be 0.347 might round to the nearest of sixteen points, and across billions of weights, those errors accumulate in ways that degrade model quality measurably.

The perplexity metric, standard for measuring language model quality, shows this clearly. On the Llama 3 70B model, TheBloke’s benchmarks and the llama.cpp perplexity testing infrastructure consistently show:

  • FP16: baseline perplexity ~4.5 on standard text corpora
  • Q8_0: within 0.01 of FP16, essentially lossless
  • Q5_K_M: roughly 0.05-0.1 perplexity increase
  • Q4_K_M: roughly 0.1-0.2 perplexity increase
  • Q3_K_M: 0.3-0.5 perplexity increase, noticeable quality drop
  • Q2_K: significant degradation, 1+ perplexity points lost

These numbers vary by model and task. Instruction-tuned models tend to be more robust to quantization than base models. Code generation tasks degrade faster than general text generation.

The GGUF k-quant naming scheme decoded

When you browse Hugging Face for GGUF models, you see names like Q4_K_M, Q5_K_S, Q6_K, IQ3_XS. The naming is not arbitrary.

The first number is the target bits per weight on average. Q4 means roughly 4 bits per weight. Q8 means 8 bits.

The _K suffix indicates “k-quants,” introduced in llama.cpp around late 2023. K-quants use a two-level quantization scheme: a “super-block” contains multiple blocks, each with its own scale, and the super-block stores scales for those inner blocks quantized at higher precision. This hierarchical approach recovers accuracy that a flat INT4 scheme loses.

The final letter (_S, _M, _L) indicates size/quality tier within the k-quant family:

  • _S (small): fewer bits used for the scale quantization, smallest file, lowest quality
  • _M (medium): balanced trade-off, the most commonly recommended tier
  • _L (large): more bits for scales, approaching the quality of the next higher bit level

The IQ prefix (“importance quantization” or “i-quants”) is a newer addition that uses importance matrices derived from calibration data to assign more bits to weights that affect output more. IQ4_XS can match or beat Q4_K_M quality at smaller file sizes by being smarter about which weights get better precision.

Post-training quantization methods differ in where they apply intelligence

Not all quantization is equal even at the same bit width. The three dominant PTQ methods for LLMs each approach the problem differently.

GPTQ (from the paper by Frantar et al.) uses second-order gradient information to compensate for quantization errors. It processes weights layer by layer, and after quantizing each weight, adjusts the remaining unquantized weights in the same row to compensate. This is computationally expensive at quantization time but produces excellent quality, especially at 4 bits. The trade-off is that GPTQ models are GPU-specific and require calibration data.

AWQ (Activation-aware Weight Quantization, from Lin et al.) observes that not all weights are equally important. A small fraction of weight channels corresponding to large activation magnitudes dominate model quality. AWQ protects these salient channels by scaling them before quantization, preserving their precision without storing extra bits. It runs faster than GPTQ at quantization time and produces competitive quality.

GGUF k-quants (llama.cpp’s format) focus on CPU-first inference. They use block-based quantization with carefully chosen block sizes (typically 32 weights per block) that align with SIMD vector widths. The quantization math is simpler than GPTQ or AWQ, but the format is designed to make dequantization fast on x86 and ARM CPUs using integer SIMD instructions. This is why llama.cpp can run efficiently on a laptop CPU where GPTQ or AWQ models cannot.

The activation quantization problem

Everything above concerns weight quantization. Quantizing activations, the intermediate values computed during a forward pass, is substantially harder and is where much current research sits.

The core difficulty is that activation distributions have outliers. In transformer models, LLM.int8() by Dettmers et al. observed that a small number of activation dimensions develop very large magnitudes during training. These outliers appear systematically in the same dimensions across different inputs. A standard INT8 quantization of activations forces the scale factor to accommodate these outliers, which wastes precision on the normal values.

LLM.int8() handles this with mixed-precision decomposition: it identifies the outlier dimensions and performs those in FP16, handling the remaining dimensions in INT8. This maintains quality at INT8 but adds implementation complexity.

More recent methods like SmoothQuant migrate quantization difficulty from activations to weights by multiplying a per-channel scaling factor into the weights and dividing it out of the activations, making both distributions more uniform and quantization-friendly.

The practical decision for local inference

For running models locally with llama.cpp or Ollama, the guidance shakes out simply:

If your machine can fit Q8_0, run Q8_0. The quality loss versus FP16 is negligible and you get fast integer arithmetic. For a 7B model, Q8_0 needs about 7.7 GB; Q4_K_M needs about 4.1 GB.

If you need to drop to 4-bit to fit in VRAM, Q4_K_M is the reasonable default. Q5_K_M gives noticeably better quality if you have the headroom. For smaller models where quality degradation is more pronounced, consider IQ4_XS if your llama.cpp build supports it.

For GPU inference where you want to use Python tooling, AWQ models through AutoAWQ or GPTQ through auto-gptq are well-supported. The bitsandbytes library provides on-the-fly INT8 and INT4 quantization that requires no pre-quantized model file, which is convenient for experimentation.

What the math tells you about trust

Quantization is not magic. It is a lossy compression scheme with well-understood failure modes. The perplexity metric captures average-case quality loss but misses tail behavior, and models operating near the edge of their knowledge are more likely to be harmed by quantization than models confidently handling well-represented topics.

For applications where accuracy matters, Q8_0 is effectively safe. For Q4-level quantization, running a quality evaluation on your specific task distribution is worth doing before trusting the model in production. The theoretical understanding that Willison’s article builds is precisely what lets you reason about when a 4-bit model is good enough and when it is not.

Was this interesting?