Most explanations of quantization start from the wrong end. They lead with “here’s how to save disk space” when the real story is about memory bandwidth, and that shift in framing changes everything about which tradeoffs make sense.
The Bottleneck Is Not What You Think
When an LLM runs an inference pass, the model spends most of its time doing matrix multiplications: multiplying the current activation vectors by the model’s weight matrices to produce the next layer’s activations. For a typical 7B-parameter model there are around 32 transformer layers, each with several weight matrices totaling billions of parameters.
For most consumer hardware, those matrix multiplications are memory bandwidth bound, not compute bound. Modern CPUs and GPUs can perform floating-point arithmetic much faster than they can load data from memory. The bottleneck is the time spent streaming weight matrices into the processor’s compute units, not the time spent multiplying.
This has a non-obvious consequence: if you reduce the size of your weights, inference gets faster even if the hardware has no native support for the reduced-precision format. You can store weights in INT4, load them twice as fast, dequantize them to float16 on the fly, and still come out ahead on time. The actual matrix multiply still happens in full precision; you are just spending less time on the memory load.
That is why quantization is primarily an inference story, not just a storage story. Simon Willison’s walkthrough of the fundamentals covers the basic mechanics clearly, but understanding the bandwidth angle reframes why the engineering looks the way it does.
The Math: Affine Quantization
The simplest form of quantization maps a range of floating-point values to a fixed set of integers. For INT8, you have 256 possible values (0 to 255 for unsigned, or -128 to 127 for signed). The mapping is:
quantized = round(float / scale + zero_point)
float ≈ (quantized - zero_point) * scale
Where scale = (max - min) / 255 and zero_point positions integer zero to align with float zero. This is affine quantization, also called asymmetric quantization. The symmetric variant fixes zero_point to zero, which simplifies the math at the cost of slightly reduced range utilization.
For INT8, this works extremely well. Quantization error is small enough that most models see less than 0.2% perplexity increase, which is genuinely undetectable in practice. The LLM.int8() paper from Tim Dettmers in 2022 pushed INT8 into mainstream use by solving the outlier problem: some weight matrices contain a few values with much larger magnitude than the rest, causing the scale factor to be dominated by outliers and wasting most of the INT8 range on values that rarely appear. Dettmers’ solution was to identify these outlier channels, keep them in float16, and quantize everything else to INT8, achieving near-zero quality loss at half the memory footprint.
Why Naive INT4 Is Worse Than It Should Be
Drop to INT4 and the problem gets harder. You have only 16 possible values for the entire weight tensor. Apply a single scale factor to a weight matrix and you are asking those 16 integers to represent a distribution that might span several orders of magnitude.
Naive round-to-nearest INT4 (called Q4_0 in GGUF) typically degrades perplexity by 4-6% on a 7B model. A 4-6% perplexity increase is noticeable in practice: models start repeating themselves more, hallucinate more frequently, and lose coherence over longer outputs.
Block quantization is the fix. Rather than computing one scale factor per weight matrix, which might be 4096 x 4096 = sixteen million values, you compute a separate scale factor for every small block of weights, typically 32 or 256 values. Each block gets its own scale, stored at higher precision (float16 or float32). The per-block scale factors add a small overhead, roughly 0.5 bits per weight for a block size of 32, but they dramatically reduce quantization error by ensuring the 16 INT4 levels are always well-distributed within each local range.
This is the “K” in llama.cpp’s K-quants, introduced via a community contribution in mid-2023. Q4_K_M uses blocks of 256 weights with float16 scale factors, bringing perplexity degradation down from roughly 5% to about 2.7% while still averaging 4.5 bits per weight. At that degradation level, most users in subjective comparisons cannot distinguish a Q4_K_M model from its float16 original.
The “M” suffix means medium, selecting a configuration that keeps some sensitive layers at higher precision. Q4_K_S (small) quantizes more aggressively; Q4_K_L (large) is slightly more conservative.
NF4: Designing the Data Type Around the Data
The QLoRA paper (Dettmers et al., 2023) introduced a different approach: instead of fitting a generic integer format to the weight distribution, design a number format whose quantization levels match the actual distribution of weights in a trained model.
Pre-trained model weights follow an approximately normal distribution. Most weights cluster near zero, with exponentially fewer as you move toward larger magnitudes. A uniform INT4 wastes most of its 16 levels on the tails of this distribution.
NormalFloat4 (NF4) places its 16 quantization levels at positions derived from the normal distribution’s quantiles, so each level represents approximately the same probability mass. Each of the 16 buckets covers roughly the same number of actual weight values. This is the optimal encoding, in an information-theoretic sense, for normally distributed data.
The practical result: NF4 achieves measurably better quality than INT4 at the same bit width, specifically for fine-tuning use cases where you quantize weights and then train adapters on top. bitsandbytes implements NF4 natively, and it is the default format for QLoRA fine-tuning runs.
The GGUF Naming Scheme, Decoded
GGUF is llama.cpp’s file format, replacing the original GGML format in late 2023. When you download a model from Hugging Face and see filenames like model-Q4_K_M.gguf, the name encodes a set of decisions:
- Q4: weights stored at approximately 4 bits per parameter
- K: uses k-quant block quantization with per-block scale factors
- M: medium variant, some layers (typically attention) kept at higher precision because they are more sensitive to quantization
Common formats and approximate sizes for a 7B model:
| Format | Avg bits/weight | 7B size | Notes |
|---|---|---|---|
| Q8_0 | 8 | ~7.7 GB | Safe baseline, nearly lossless |
| Q6_K | 6 | ~5.5 GB | Very small quality loss |
| Q5_K_M | 5 | ~4.5 GB | Excellent quality-per-byte |
| Q4_K_M | 4.5 | ~4.1 GB | Most popular general-purpose choice |
| Q3_K_M | 3.5 | ~3.3 GB | Noticeable degradation, low-RAM use |
| Q2_K | 2.6 | ~2.7 GB | Significant degradation |
| IQ4_XS | ~4 | ~4.0 GB | Importance-matrix quants, calibration-based |
The IQ (importance-matrix) variants are newer, introduced in late 2023. They use a calibration dataset to score which weights matter most for output quality, then apply finer quantization to important weights and coarser quantization to less important ones. The result is often slightly better perplexity than K-quants at the same file size, at the cost of requiring calibration data to generate.
The Bigger-Model Strategy
The practical implication that often surprises people: a quantized large model usually outperforms an unquantized small model. A Q4_K_M 70B model has roughly 40 billion parameters represented at 4.5 bits, coming to around 28 GB. That is the same size as a float16 7B model, and the 70B Q4 wins the quality comparison by a significant margin on most benchmarks.
This is why practical advice for local inference has shifted away from “pick the highest-precision format you can afford” toward “pick the largest model you can fit, then quantize aggressively.” Parameter count matters more than precision once you are above roughly Q4_K_M quality.
Perplexity on WikiText-2 (Llama-2-7B, lower is better) makes the block quantization contribution concrete:
- FP16 baseline: 5.47
- Q8_0: 5.48
- Q4_K_M: 5.62
- Q4_0 (naive, no block quant): 5.72
- Q3_K_M: 5.97
- Q2_K: 7.26
The improvement from Q4_0 to Q4_K_M, attributable entirely to block quantization, is almost as large as the degradation from Q4_K_M all the way down to Q3_K_M. Block quantization is doing real work.
What the Tooling Looks Like
For local CPU and GPU inference, llama.cpp and its wrappers (Ollama, LM Studio) handle everything. You pick a GGUF file, and the format encodes what dequantization code to run at inference time. Converting from float16 safetensors to any GGUF format takes a few minutes on a CPU using the llama-quantize tool included with llama.cpp.
For PyTorch-based inference and fine-tuning, bitsandbytes provides INT8 and NF4 quantization that integrates directly with Hugging Face Transformers. You pass load_in_4bit=True to from_pretrained() and the library handles dequantization, mixed-precision outlier handling, and the rest.
GPTQ and AWQ are alternatives that require a calibration pass but achieve somewhat better quality than round-to-nearest at the same bit width, particularly at 3-bit and below. GPTQ uses second-order (Hessian) information to minimize quantization error layer by layer; AWQ identifies salient weights via activation magnitudes and scales up important channels before quantizing, with no overhead at inference time.
The ecosystem has matured considerably since 2023. Model authors and the community now upload GGUF and GPTQ variants to Hugging Face within hours of a new release. The format choices have largely converged: Q4_K_M for CPU inference, AWQ or GPTQ 4-bit for GPU inference, NF4 via bitsandbytes for fine-tuning. Below 4 bits, quality degradation becomes significant enough to measure for your specific use case. Above 6 bits, you are paying in size for gains that most tasks will not surface. The 4-5 bit range, with block quantization doing its job, is where most of the interesting engineering lives, and where the gap between the naive arithmetic and the practical results is widest.