From Float to Integer: What Quantization Actually Does to a Language Model
Source: simonwillison
Simon Willison recently published a thorough ground-up explanation of quantization that’s worth reading if you want to understand what’s happening inside your GGUF files. It prompted me to go deeper on the parts that I think matter most for developers who are actually making decisions about which quantized model to download.
The short version: quantization is why a 70-billion-parameter model that would require 140 GB of GPU memory in full precision can run on a machine with 48 GB. The long version involves some arithmetic that, once you understand it, changes how you pick quantization formats.
What Floating Point Costs You
Every weight in an LLM is stored as a number. In a freshly trained model, those numbers are typically 32-bit floats (FP32) or 16-bit brain floats (BF16). BF16 uses 8 bits for the exponent and 7 bits for the mantissa, which gives it the same range as FP32 but lower precision. A 7-billion-parameter model in BF16 takes about 14 GB. In FP32 it’s 28 GB.
Quantization asks: do we actually need that much precision? The weights in a trained model cluster in relatively narrow ranges. Most of them are small values near zero, with occasional outliers. That distribution means a lot of the representational capacity of a 16-bit float is going unused. If you could map those weights to 8-bit or 4-bit integers, you’d cut memory consumption by 2x or 4x respectively, with some loss in precision.
The mapping is called linear quantization. For a symmetric scheme:
scale = max_abs_value / (2^(bits-1) - 1)
quantized = round(original / scale)
dequantized = quantized * scale
For 8-bit symmetric quantization, you find the largest absolute value in a tensor, compute a scale factor that maps that value to 127, and round everything else accordingly. Dequantization is just multiplying by the same scale. The error introduced is the rounding: each weight gets snapped to the nearest representable integer, and you lose whatever was in between.
Asymmetric quantization adds a zero point to handle tensors where the distribution isn’t centered on zero:
scale = (max_value - min_value) / (2^bits - 1)
zero_point = round(-min_value / scale)
quantized = round(original / scale + zero_point)
dequantized = (quantized - zero_point) * scale
This matters when a layer’s weights are all positive or all negative, which happens more than you’d expect.
The Outlier Problem
Naive per-tensor quantization, where you compute one scale factor for an entire weight matrix, works acceptably for 8-bit but falls apart at 4-bit. The reason is outliers.
Transformer weight matrices contain a small number of values that are significantly larger than the rest. If you scale the entire tensor to accommodate those outliers, you compress all the small values into a narrow range of integers, losing precision everywhere. A tensor where most weights are in [-0.1, 0.1] but has a single outlier at 2.0 will have its scale dominated by that outlier, and the common small values will mostly all map to 0 or ±1.
The standard fix is per-channel or per-group quantization. Instead of one scale for the whole matrix, you compute separate scales for each row, column, or small block of weights. Each block gets its own scale factor calibrated to its own range, so outliers in one block don’t destroy precision in another.
This is the core insight behind llama.cpp’s K-quant formats, which Georgi Gerganov introduced in mid-2023. They replaced the original Q4_0 and Q8_0 formats with something considerably more sophisticated.
What K-Quants Actually Do
The K in Q4_K_M stands for a specific quantization technique using super-blocks and sub-blocks. The structure looks like this:
- Weights are grouped into super-blocks of 256 values
- Each super-block is divided into sub-blocks of 32 values
- Each sub-block has its own scale, stored at higher precision (6-bit)
- The super-block has an overall scale stored in FP16
So a Q4_K model stores each weight in 4 bits, but the scale factors themselves are stored with enough precision that the dequantization is accurate. The overhead from scale storage is small relative to the savings from 4-bit weights.
The M, L, and S suffixes (as in Q4_K_M, Q4_K_L, Q4_K_S) indicate medium, large, and small variations that apply different quantization levels to different layers. Attention layers and feed-forward layers have different sensitivity to quantization. The M variant uses Q6_K for some of the more sensitive layers while keeping Q4_K elsewhere, giving a better quality-per-byte tradeoff than uniformly quantizing everything at 4 bits.
This is why Q4_K_M became the community default rather than Q4_0. The two formats use approximately the same storage per weight, but Q4_K_M’s block-wise scaling and mixed precision give it measurably better output quality, especially for longer generations where rounding errors compound.
GPTQ and AWQ Take a Different Path
GGUF’s K-quants are designed for CPU-friendly inference with llama.cpp. The post-training quantization methods popular on Hugging Face Hub, GPTQ and AWQ, take a different approach.
GPTQ, from researchers at IST Austria, uses second-order information from the Hessian matrix to minimize the quantization error. Rather than just rounding each weight to the nearest integer, it adjusts the remaining weights to compensate for the rounding error introduced in already-quantized weights. This requires running a calibration set through the model during quantization, which takes compute but results in lower error than simple rounding.
AWQ (Activation-Aware Weight Quantization) from MIT observes that not all weights are equally important. Weights that correspond to channels with large activation magnitudes contribute more to the output and should be quantized more carefully. AWQ identifies these important weights and scales them before quantization to improve their precision, then rescales activations to compensate. The result is 4-bit quantization that better preserves the computationally significant parts of the model.
Both GPTQ and AWQ tend to produce higher-quality outputs than GGUF at the same bit width, but they’re less portable. GGUF runs anywhere llama.cpp runs, including CPU-only machines. GPTQ and AWQ typically require a GPU and specific inference frameworks like vLLM or TGI.
Reading the Quality Numbers Honestly
Benchmarks comparing quantization levels usually show something like this for a 7B model:
| Format | Size | Relative MMLU |
|---|---|---|
| BF16 | 14 GB | 100% |
| Q8_0 | 7.7 GB | ~99.7% |
| Q5_K_M | 5.1 GB | ~99.3% |
| Q4_K_M | 4.1 GB | ~98.8% |
| Q3_K_M | 3.3 GB | ~97.5% |
| Q2_K | 2.7 GB | ~94% |
Those percentages look reassuring until you remember that MMLU measures factual recall and reasoning on multiple-choice questions. The degradation shows up more in long-form generation, code completion, and tasks that require the model to maintain coherence over hundreds of tokens. Q4_K_M is genuinely good for most use cases. Q2_K is noticeably worse in practice than a 6% MMLU drop suggests.
The useful heuristic: Q5_K_M is the conservative choice if you have the memory; Q4_K_M is the practical default; anything below Q4 should be a last resort driven by hardware constraints, not a normal operating mode.
The Bitsandbytes Path
For developers using the Hugging Face transformers ecosystem rather than llama.cpp, bitsandbytes offers a third approach: loading models in 4-bit or 8-bit on-the-fly without pre-quantizing the model files. The load_in_4bit=True parameter in from_pretrained() handles quantization at load time using NF4 (Normal Float 4), a non-linear quantization format designed specifically for normally-distributed weights.
NF4 is interesting because it exploits the fact that neural network weights follow a roughly normal distribution. Instead of uniformly spacing the 16 representable values across the numeric range, NF4 spaces them according to quantiles of a normal distribution, putting more representational capacity where most weights actually are. This gives better quality than uniform 4-bit quantization with no additional storage overhead.
Where This Is Going
The current frontier is per-layer and mixed-bit quantization at a finer granularity than K-quants provide. Work like QuIP# and AQLM uses lattice-based and codebook approaches to push below 4-bit while maintaining usable quality. Some of these techniques achieve near-Q4 quality at 2 bits per weight by using lookup tables instead of direct integer storage.
Hardware is catching up too. Recent ARM processors include INT4 matrix multiplication instructions, and NVIDIA’s Blackwell architecture has native FP4 support. When hardware can compute directly in quantized formats without dequantizing first, the performance gap between quantized and full-precision inference narrows significantly.
For now, if you’re running models locally, Q4_K_M is the format that won by being good enough at everything that matters. Understanding why it works the way it does makes you better at knowing when to spend the extra memory on Q5 or Q6, and when accepting Q3 is a reasonable tradeoff. The math is not that complicated once you see it laid out, and the engineering decisions in GGUF’s K-quants are genuinely clever solutions to a real precision problem.