Simon Willison recently published a thorough walkthrough of quantization from the ground up, and it is one of the cleaner explanations of the topic I have seen. It covers the core math, walks through code, and explains the GGUF naming scheme. What it leaves as an exercise is the question of why the field converged on the techniques it uses now, and what the tradeoffs actually look like when you are deciding which quantization level to use in practice. That is what I want to work through here.
The Memory Problem Is Not What You Think
The naive explanation for quantization is that it makes models smaller. That is true, but it undersells the mechanism by which smaller files translate to faster inference.
LLM inference on a single GPU is almost always memory-bandwidth bound, not compute bound. The forward pass through a transformer spends the bulk of its time loading weights from VRAM into registers and doing matrix-vector multiplications against the current token’s activations. The GPU’s compute units spend a substantial fraction of their cycles waiting for data to arrive, not performing arithmetic. For a 7B model at FP16, that is 14 GB of weights you need to sweep through for every generated token.
Halving the bit-width of stored weights halves the number of bytes that need to cross the memory bus per forward pass. On an Apple M2 with its unified memory architecture, the difference between a Q4_K_M and a FP16 model is roughly 2x throughput, not because the arithmetic is faster, but because fewer bytes move. This is why quantized models can be faster than their full-precision equivalents even on hardware where the integer arithmetic itself carries some overhead.
What You Are Losing (And Where)
Quantization maps a continuous floating-point value to one of a discrete set of representable integers, then stores a scale factor to convert back. The symmetric form is the simplest version:
scale = max(abs(weights)) / 127 # for INT8
quantized = round(weights / scale).clip(-127, 127)
recovered = quantized * scale
The reconstruction error per weight is bounded by scale / 2. The scale is determined by the maximum absolute value in the tensor. This is the source of most quantization degradation: a single outlier weight with a large magnitude forces a coarse scale that degrades precision for all other weights in the same tensor.
This would be a manageable problem except that large transformer models develop what Tim Dettmers and colleagues identified in the LLM.int8() paper (2022) as emergent outlier features. Models above roughly 6.7 billion parameters have a small fraction of dimensions, fewer than 0.1% of channels, where activation magnitudes are 10 to 100 times larger than the rest. These are not noise. They are load-bearing features that the model uses to represent important distinctions, and suppressing them by folding them into a coarse quantization grid causes visible degradation.
The practical consequence: naive INT4 quantization of a 7B model degrades perplexity by 15 to 20 percent. That is the gap you are trying to close with everything that came after.
Group Quantization: The Core Insight
The fix that enabled practical 4-bit inference is simple in retrospect: instead of computing one scale factor per tensor or layer, compute one per small block of weights, typically 32 to 128 values. An outlier in one block ruins that block’s precision, but nothing else.
def quantize_grouped(weights, group_size=32, bits=4):
n_groups = len(weights) // group_size
scales, quantized = [], []
for i in range(n_groups):
group = weights[i * group_size:(i + 1) * group_size]
scale = np.max(np.abs(group)) / (2 ** (bits - 1) - 1)
q = np.round(group / scale).clip(-(2**(bits-1)-1), 2**(bits-1)-1).astype(np.int8)
scales.append(scale)
quantized.append(q)
return np.array(quantized), np.array(scales)
The storage overhead is real but small. One FP32 scale per 32 weights adds 1 bit of overhead per weight at 4-bit precision, yielding an effective 4.125 bits per weight rather than 4.0. The quality improvement is substantial.
How K-Quants Pushed Further
The GGUF formats used by llama.cpp implement a refinement called K-quants, introduced by Iwan Kawrakow in a pull request in mid-2023. The naming (Q4_K_M, Q5_K_S, and so on) looks cryptic but follows a pattern: the number is the target bits per weight, K means it uses the K-quant method, and the suffix (S/M/L) is a quality tier.
The K-quant architecture has two levels. A superblock covers 256 weights divided into 8 sub-blocks of 32 weights each. Sub-block scales are stored as 6-bit integers rather than full FP32 or FP16 values, and a single FP16 super-scale normalizes them. This hierarchical structure reduces scale storage overhead compared to per-32-weight FP32 scales.
More importantly, K-quants use a non-uniform quantization grid derived from k-means clustering on representative weight distributions, rather than the uniform integer grid of basic quantization. Transformer weight distributions cluster near zero with heavier-than-Gaussian tails. A uniform 4-bit grid wastes precision on values near the extremes that rarely appear; a learned grid concentrates steps where the weights actually are. The effect is meaningful: Q4_K_M typically shows around 0.10 to 0.15 perplexity increase over FP16 on WikiText-2 benchmarks, compared to roughly 0.22 for the older Q4_0 format.
The block size of 32 is not arbitrary. It aligns with SIMD vector widths on both x86 (AVX2 processes 32 bytes at once) and ARM (NEON/SVE), making the dequantization step during inference fast on both CPU architectures.
The Quality Tiers in Practice
For a 7B model, the tradeoffs look roughly like this:
| Format | Effective bits | Size (7B) | PPL delta vs FP16 |
|---|---|---|---|
| Q8_0 | 8.5 | ~7.7 GB | ~0.01 |
| Q6_K | 6.6 | ~5.5 GB | ~0.03 |
| Q5_K_M | 5.7 | ~4.8 GB | ~0.05 |
| Q4_K_M | 4.8 | ~4.1 GB | ~0.12 |
| Q3_K_M | 3.9 | ~3.3 GB | ~0.45 |
| Q2_K | 3.4 | ~2.9 GB | ~1.37 |
The cliff between Q3 and Q4 is real. The perplexity scale is not linear in any perceptually meaningful sense, and a 0.45 increase in WikiText-2 perplexity corresponds to noticeably degraded reasoning on complex tasks. Q4_K_M became the de facto default for consumer inference because it sits at the knee of this curve: below it, you start paying a visible quality tax; above it, you are paying in memory without a proportional quality return.
The IQ-quant formats (IQ4_XS and friends), added to llama.cpp later, use an importance matrix from a calibration pass to allocate bits where they matter most. IQ4_XS matches or exceeds Q4_K_M quality at roughly 8% smaller file sizes, at the cost of requiring a calibration step to generate.
GPTQ and AWQ: The GPU Path
For GPU inference, the dominant post-training quantization methods are GPTQ and AWQ.
GPTQ (Frantar et al., 2022) processes weight matrices column by column. After quantizing each weight, it computes the Hessian of the output error with respect to remaining unquantized weights and adjusts them to compensate for the already-committed rounding errors. This is rooted in the Optimal Brain Surgeon literature from the 1990s, applied at the scale of billion-parameter models with Cholesky decomposition to make the Hessian computation tractable. The result is 2 to 5 percent perplexity degradation at INT4, compared to 15 to 20 percent for naive rounding. The cost is minutes to hours of GPU time and 128 calibration samples. auto-gptq automates this.
AWQ (Lin et al., MIT Han Lab, 2023) takes a different angle. Rather than error-compensating during quantization, it identifies the roughly 1% of weight channels that correspond to large activation magnitudes, then scales those channels up before quantization so they occupy more of the integer range. The corresponding activations are scaled down by the same factor so the output is unchanged. At inference time, the scaling is absorbed into adjacent normalization layers, so there is zero overhead. AutoAWQ implements this for HuggingFace models.
In practice, both methods land at similar quality for most tasks. AWQ shows a slight edge on instruction-following benchmarks, likely because its salient channel identification generalizes better across diverse inputs than GPTQ’s single-calibration-set Hessian.
The QLoRA Connection
One downstream impact of all this infrastructure is QLoRA (Dettmers et al., 2023). The insight is that you can freeze a base model in 4-bit NF4 quantization (a normal-distribution-aware format introduced in the same paper) and fine-tune only small LoRA adapter matrices at full precision. The frozen base model contributes its capabilities through dequantized forward passes; the adapters learn the task-specific adjustments.
This made fine-tuning 70B-class models on consumer hardware feasible, and it drove an enormous amount of the open fine-tuning ecosystem in 2023 and 2024. bitsandbytes is the library that provides the quantized layer types HuggingFace Transformers exposes through load_in_4bit=True.
Where the Research Frontier Is
The most interesting recent development is BitNet b1.58 from Microsoft Research, which takes quantization to its logical extreme: models trained from scratch with ternary weights {-1, 0, +1}. Rather than post-training quantization, the model learns to operate within ternary constraints via a straight-through estimator for gradients. The payoff is that matrix multiplication becomes additions and subtractions with no floating-point multiply, yielding a reported 71% energy reduction at 7B scale compared to FP16.
The BitNet-b1.58-2B-4T model released in April 2025 reaches competitive performance with LLaMA 3.2 and Gemma 3 at 2B parameters, runs on a Raspberry Pi 5, and weighs around 0.8 GB. The constraint is significant: there is no path to convert an existing FP16 model to ternary weights post-training. You commit to this at pretraining time.
On the server side, NVIDIA H100 hardware added native FP8 tensor cores (E4M3 for inference, E5M2 for training). This delivers roughly 2x FLOP throughput versus BF16 with near-zero quality loss and is now standard at major inference providers via vLLM and TensorRT-LLM. It is irrelevant to consumer hardware but explains why the cloud inference cost trajectory keeps dropping.
The Practical Decision
For running models locally, the hierarchy has stabilized. Q8_0 or Q6_K for maximum quality when you have the VRAM. Q4_K_M as the reliable default. Q3 only when you have a hard memory constraint and can accept visible quality degradation. Q2 for experiments, not production use.
There is also a frequently overlooked model-size versus quantization tradeoff. A 70B model at Q4_K_M occupies roughly 35 GB and outperforms a 7B model at FP16 on most tasks despite heavy quantization. If you have 32 GB of unified memory, a 34B at Q4_K_M will generally serve you better than a 7B at Q8_0. Maximize the model size that fits at the highest feasible quantization tier rather than optimizing either dimension in isolation.
The engineering here is genuinely impressive. Practical 4-bit inference for billion-parameter models required solving the outlier problem, the block quantization insight, SIMD-aligned data structures, learned codebooks, and calibration-based importance allocation, all piled on top of each other. Simon Willison’s walkthrough covers the fundamentals well; the accumulated engineering decisions that built on those fundamentals are what made it deployable.