· 8 min read ·

What Quantization Actually Does to Your Model Weights

Source: simonwillison

Simon Willison recently published a thorough walkthrough of quantization from first principles, and it is one of the better explanations of the topic I have come across. It prompted me to go deeper on a few things the article touches but does not fully unpack: specifically, how grouped quantization in GGUF actually works, what the accuracy costs look like in practice, and how to reason about which quantization level to use rather than just defaulting to whatever Ollama recommends.

If you run any LLMs locally, you have encountered quantization whether you realized it or not. When you pull a Q4_K_M or Q5_K_S model through Ollama or llama.cpp, those letters and numbers describe how the model’s weights have been compressed. Understanding what they mean is the difference between making an informed choice and cargo-culting whatever shows up at the top of a leaderboard.

The Core Math

A float32 value uses 32 bits to represent a number, giving you roughly 7 decimal digits of precision across a wide dynamic range. A neural network weight stored as float32 can be any value from around -3.4e38 to 3.4e38. In practice, the weights in a well-trained LLM tend to cluster in a much narrower range, often within a few standard deviations of zero.

Quantization exploits this property. Instead of storing each weight as a full float32, you map the observed range of values onto a smaller integer type, store the integers, and reconstruct approximate floats at inference time using a stored scale factor (and optionally a zero point for asymmetric quantization).

For symmetric INT8 quantization, the math is straightforward:

scale = max(abs(weights)) / 127
quantized = round(weight / scale)          # stored as int8
reconstructed = quantized * scale          # used at inference

For asymmetric quantization, you need a zero point to handle ranges that are not centered at zero:

scale = (max_val - min_val) / 255
zero_point = round(-min_val / scale)       # offset to align zero
quantized = round(weight / scale) + zero_point
reconstructed = (quantized - zero_point) * scale

The error introduced by this round-trip is called quantization error, and it is bounded by half the scale value. If your scale is 0.01, the maximum per-weight error is 0.005. That sounds small, but in a model with billions of weights, these errors accumulate. The challenge of modern quantization research is keeping those accumulated errors from visibly degrading output quality.

Why Per-Tensor Quantization Breaks Down

The naive approach assigns a single scale factor to an entire weight matrix (per-tensor quantization). This works poorly because weight distributions vary significantly across different layers and even different rows within a layer. An outlier weight with a large magnitude forces a large scale factor, which coarsens the precision available for all the other weights in the tensor.

Per-channel quantization solves part of this by assigning one scale factor per output channel (one per row in a weight matrix). This is the approach used by many early INT8 implementations and by bitsandbytes, the library that powers Hugging Face’s load_in_8bit feature.

But even per-channel quantization has limits when you push to 4 bits or below. This is where grouped quantization becomes important.

Grouped Quantization: How GGUF K-Quants Work

The GGUF format (used by llama.cpp and Ollama) uses a scheme called K-quants for its recommended quantization levels. The key insight is to apply quantization at a much finer granularity, called a block or group, typically 32 or 256 consecutive weights.

For a weight matrix with millions of values, you slice it into chunks of 256 weights each. Within each chunk, you compute a local scale factor. Outliers in one chunk only affect the precision of that chunk, not the entire matrix. This dramatically reduces quantization error compared to per-tensor or even per-channel approaches at low bit depths.

The GGUF naming convention encodes the method:

  • Q4_K_M means 4-bit quantization, K-quants method, medium variant
  • Q5_K_S means 5-bit quantization, K-quants method, small variant
  • Q3_K_L means 3-bit quantization, K-quants, large variant

The size suffix (S/M/L) refers to how the quantization is applied to different parts of the model. In practice, the K-quant variants use a mixture: some layers (usually attention matrices and output layers) are quantized more conservatively at higher bit depth, while feed-forward layers use the nominal bit depth. The medium variant is generally the sweet spot because it protects the most quality-sensitive layers without ballooning file size.

You can see the actual block structure in the ggml source code. For Q4_K, each superblock contains 256 weights split into 8 groups of 32, with a half-precision scale per group and a full-precision superblock scale.

What the Accuracy Cost Actually Looks Like

The standard metric for measuring quantization quality is perplexity on a held-out text corpus (typically wikitext-2 or similar). Lower perplexity means better next-token prediction accuracy.

For a 7B parameter model, rough perplexity degradation relative to float16 baseline looks like this:

FormatBits/WeightPerplexity Delta
F1616baseline
Q8_08.5+0.01 to +0.05
Q5_K_M5.7+0.1 to +0.2
Q4_K_M4.8+0.2 to +0.4
Q3_K_M3.9+0.5 to +1.0
Q2_K2.6+1.5 to +3.0

These numbers come from llama.cpp’s own benchmarks and community measurements on the GGUF comparison spreadsheet. The absolute values differ by model family and size, but the relative ordering is consistent.

The practical takeaway: Q4_K_M sits at a very good inflection point. You give up about 65% of the float16 memory footprint and see minimal perplexity degradation for most tasks. Going to Q3_K_M saves another 15% of memory but the perplexity hit starts to become noticeable on more demanding tasks, particularly multi-step reasoning.

Model size matters here too. Larger models are more robust to aggressive quantization because they have more redundancy. A 70B model at Q4_K_M often outperforms a 7B model at float16. This is the most important practical insight for local inference: running a larger quantized model almost always beats a smaller model at full precision.

The Hardware Angle

Quantization is not just about reducing memory footprint. It also affects inference speed, but the relationship is not simple.

On CPUs (which is how most people run local models), the speed gain comes from memory bandwidth. At 4 bits per weight instead of 16, you move four times less data from RAM to CPU during the matrix multiplications that dominate transformer inference. On a machine with 50 GB/s of RAM bandwidth, this is substantial. A Q4 model can sustain roughly three to four times the token throughput of a float16 model on the same CPU hardware.

On GPUs, the picture is more complicated. NVIDIA GPUs have dedicated INT8 tensor cores (starting with Turing/RTX 20xx) and INT4 support in Ampere and later. But llama.cpp’s CUDA backend typically dequantizes weights to float16 before the matrix multiplication rather than running true INT4 matrix ops. This means you get the memory savings but not necessarily the compute savings on GPU. True INT4 matrix multiplication requires kernels like those in GPTQ-for-LLaMA or ExLlamaV2, which compute directly in quantized form using custom CUDA kernels.

ExLlamaV2 is worth mentioning because it takes a different approach to quantization than GGUF. It uses EXL2 format, which allows non-uniform bits-per-weight across different layers based on measured sensitivity. You can specify a target average bit depth (like 4.5 bpw) and ExLlamaV2 will allocate more bits to layers that matter more and fewer bits to layers where the model is robust to compression.

GPTQ and AWQ: Post-Training Quantization with Calibration Data

The quantization methods in GGUF and bitsandbytes are essentially round-to-nearest: compute scale, round weights, done. A more sophisticated family of methods use a small calibration dataset to minimize the reconstruction error more carefully.

GPTQ (from Frantar et al., 2022) applies optimal brain compression ideas from classic neural network pruning literature. It processes weight matrices column by column, and after quantizing each weight, it updates the remaining unquantized weights in that column to compensate for the introduced error. This error-compensation step is what makes GPTQ produce better quality at the same bit depth compared to round-to-nearest. The downside is that GPTQ quantization takes significant time and compute to run (hours for large models), and it requires a calibration dataset whose distribution should match your inference distribution.

AWQ (Activation-Aware Weight Quantization, Lin et al., 2023) takes a different angle: it observes that roughly 1% of weights are much more important than the rest because they correspond to activation channels with large magnitudes. Rather than treating all weights equally, AWQ scales up those important weights before quantization (so the quantizer allocates more precision to them) and scales down the activations correspondingly, which is mathematically equivalent but dramatically reduces the quantization error on the weights that matter.

For practical use, both GPTQ and AWQ models are widely available on Hugging Face in pre-quantized form. If you are using a GPU and want the best quality at 4 bits, a pre-quantized AWQ model through vLLM or a GPTQ model through ExLlamaV2 will generally beat a GGUF Q4_K_M model in quality, though the difference shrinks as models get larger.

Choosing in Practice

The decision tree I have settled on for local inference:

If running on CPU: GGUF Q4_K_M is the default choice. It has the best balance of quality and speed. Use Q5_K_M if you have memory headroom and care about output quality on reasoning tasks. Avoid anything below Q3_K unless you are genuinely memory-constrained.

If running on GPU with enough VRAM for the whole model: Float16 is your baseline. Use GPTQ or AWQ only if you need to fit a larger model than fits in float16, or if you want to run a model that was only released in quantized form.

If running on GPU where you need quantization to fit: ExLlamaV2 with EXL2 format gives the best quality-per-gigabyte for NVIDIA GPUs. GGUF with GPU offload works but leaves some performance on the table.

For production inference at scale: Look at vLLM with AWQ or TensorRT-LLM with INT8 SmoothQuant. These are purpose-built for throughput and handle quantization more carefully than general-purpose loaders.

The broader point is that quantization is not a single thing. It is a family of techniques with different trade-offs across quality, speed, hardware requirements, and ease of use. Simon Willison’s ground-up walkthrough gives you the foundation to understand what these techniques share; knowing where they diverge is what lets you make the right call for a specific deployment target.

Was this interesting?