· 7 min read ·

How LLM Quantization Actually Works, From the Bits Up

Source: simonwillison

Simon Willison published a ground-up explanation of quantization recently, and it prompted me to write down how I actually think about this stuff. Quantization is one of those topics where the high-level explanation, “we make the numbers smaller,” is technically accurate but leaves out everything you need to reason about trade-offs.

So let me start from the actual bits.

How Float32 Stores a Number

A 32-bit float uses the IEEE 754 standard: 1 sign bit, 8 exponent bits, and 23 mantissa (significand) bits. The exponent gives you range, the mantissa gives you precision. Together they can represent numbers from roughly 1.2e-38 to 3.4e+38 with about 7 decimal digits of precision.

Model weights in a freshly trained neural network are stored as float32. A 7-billion-parameter model at float32 costs 4 bytes per parameter, so 28 GB just for the weights. That does not fit in consumer GPU VRAM.

The first move is to drop to float16 or bfloat16, both 16-bit formats. Float16 has 1 sign bit, 5 exponent bits, and 10 mantissa bits. It halves memory to 14 GB but narrows the representable range significantly, which can cause numerical instability during training. That is why Google developed bfloat16: same 1+8 exponent layout as float32, but only 7 mantissa bits. You keep the range, lose some fractional precision. Most modern LLM training and inference uses bfloat16 or float16 as the baseline.

But even 14 GB is a lot. Getting below 10 GB per 7B model requires going further.

Linear Quantization: The Core Math

The idea behind integer quantization is to map a range of floating-point values onto a fixed integer grid. For INT8, that grid runs from -128 to 127 (signed) or 0 to 255 (unsigned). The mapping is linear:

q = round(x / scale) + zero_point

Where scale is (max - min) / (2^bits - 1) and zero_point shifts the range so that the real zero maps to an integer value. To dequantize:

x_approx = (q - zero_point) * scale

This is called asymmetric quantization. The quantization error for any given value is at most scale / 2, so a tighter range of weights gives better precision. If you try to cover a range of [-10.0, 10.0] in 256 steps, your scale is about 0.078. If the actual weights cluster between [-0.5, 0.5], you are wasting most of your representable values on a range that barely gets used.

Symmetric quantization simplifies this by forcing zero_point = 0, which means the scale is max(abs(min), abs(max)) / 127. It is slightly less efficient in theory but much easier to implement in hardware, which is why it dominates in practice for activations.

INT8 inference is well-supported on modern hardware. The bitsandbytes library gives you INT8 quantization for PyTorch with load_in_8bit=True, and most NVIDIA GPUs from Turing onward have dedicated INT8 tensor cores that make this genuinely faster than float16 for large matrix multiplications.

Why Block Quantization Matters

The problem with computing a single scale and zero_point for an entire weight matrix is that outlier values dominate. If most of your weights sit in [-0.3, 0.3] but a few are at ±5.0, your scale stretches to cover those outliers, and the quantization error for all the common values gets worse.

The fix is block quantization: split the weight tensor into small blocks (typically 32 or 64 values) and compute a separate scale for each block. Now outliers only hurt the resolution within their own block. This is the approach used in GGUF, the format that llama.cpp uses for running models locally.

GGUF has a large family of quantization types. A few key ones:

  • Q8_0: 8-bit, block size 32, one float16 scale per block. Excellent quality, ~1.06x the size of float16 when you account for the scale overhead.
  • Q4_0: 4-bit, block size 32, one float16 scale per block. Each weight uses only 4 bits, so 2 weights pack into 1 byte. Memory is roughly 0.5x float16.
  • Q4_K_M: A “K-quant” format that uses different quantization levels for different parts of the network. Attention layers get higher precision than feed-forward layers. The M variant uses Q6_K for half the attention weight matrices and Q4_K for the rest. In practice this gives noticeably better quality than Q4_0 at similar size.
  • Q2_K: 2-bit with some overhead. Aggressive compression, meaningful quality loss on most tasks.

The K-quant formats are notable because they are not uniform: the same format string encodes a heterogeneous quantization strategy. This reflects a real insight from the literature, that different weight matrices have different sensitivity to quantization error.

GPTQ and AWQ: Smarter Post-Training Quantization

Block quantization is greedy: it minimizes the raw numerical error in the weights but does not account for how those errors propagate through the network. GPTQ (Frantar et al., 2022) takes a more principled approach using the Optimal Brain Compression framework. It quantizes weights one at a time within each row, using a small calibration dataset to estimate the Hessian of the loss with respect to each weight. After quantizing a weight, it compensates by adjusting the remaining weights in the row to minimize the output error.

The result is that GPTQ at 4-bit often matches or beats naive 8-bit quantization in terms of downstream task quality. It is slower to quantize (minutes to hours for large models) but the resulting model loads and runs the same way. ExLlamaV2 and several other inference engines have optimized GPTQ kernels.

AWQ (Lin et al., 2023) takes a different angle: instead of compensating after the fact, it identifies the small fraction of weights that are “salient” (high activation scale) and protects them by scaling the weight channel up before quantization and the corresponding activation channel down. The salient weights get more precision effectively without changing the bit budget. AWQ is faster to apply than GPTQ and has become popular as a default for quantizing models for vLLM and similar serving stacks.

The NF4 Format in QLoRA

Tim Dettmers’ QLoRA paper introduced NF4: 4-bit NormalFloat. Rather than a linear integer grid, NF4 uses 16 quantization levels that are spaced according to the quantiles of a standard normal distribution. Since model weights are roughly normally distributed, this packs more levels where the density is highest.

NF4 is used with double quantization (the quantization constants themselves are quantized) to achieve roughly 4.5 bits per parameter effective cost. The bitsandbytes library implements this behind load_in_4bit=True with bnb_4bit_quant_type="nf4". The key use case is fine-tuning: you freeze the base model in NF4 and train LoRA adapter weights in float16, making it possible to fine-tune a 65B model on a single 48 GB GPU.

Where Quality Degrades and Why

There is a rough empirical hierarchy. Float16 and bfloat16 are lossless for practical purposes; the precision difference rarely matters. INT8 is similarly close to lossless for most tasks. Q4_K_M on a large model (13B+) is genuinely hard to distinguish from float16 on standard benchmarks. Q4_0 on a small model (7B) starts to show meaningful perplexity increases. Q2 on anything produces noticeably worse outputs.

The model size matters more than people expect. A 70B model quantized to Q4 has more absolute precision per layer than a 7B model at float16, because the larger model has redundancy that quantization error cannot overwhelm. This is part of why running a large quantized model is often better than running a small full-precision model at the same VRAM budget.

Calibration data also matters for GPTQ and AWQ. Quantizing with a calibration set drawn from the target domain (code, if you are quantizing a code model) gives better results than using a generic pile of text. The quantization is in some sense supervised by what distribution it will see at inference time.

Practical Guidance

For local inference with llama.cpp or Ollama, Q4_K_M is the default for good reason: it is close to the Pareto frontier of quality versus size. If you have the VRAM, Q6_K or Q8_0 are noticeably better and still half the size of float16. Q2_K and Q3_K_S are for when you are VRAM-constrained to the point where the alternative is not running the model at all.

For serving with vLLM or TGI at scale, AWQ-quantized models are a practical choice: they load faster than GPTQ, have good kernel support, and the quality delta from float16 is small for models above 13B.

For fine-tuning, bitsandbytes NF4 with QLoRA is the standard approach and works well. Just be aware that you are training the LoRA adapters in higher precision while the base model inference runs in 4-bit; the training dynamics are slightly different from full fine-tuning, which matters if you are pushing capability boundaries rather than doing domain adaptation.

Quantization is fundamentally a compression problem with the twist that the “decompressor” is the entire inference compute stack. Understanding the math from the bit level up makes it much easier to evaluate new formats and methods as they appear, rather than treating each quantization type as an opaque name to memorize.

Was this interesting?