· 6 min read ·

The Non-Linear Economics of LLM Quantization

Source: simonwillison

Simon Willison published a thorough ground-up explanation of quantization this week, and it is worth using as a launchpad to dig into something the tutorials usually gloss over: the quality loss from quantization is not linear, and the point where it becomes painful is very specific and knowable.

When I first started running models locally with llama.cpp and Ollama, I treated quantization levels like a simple slider between quality and size. Smaller number, smaller file, worse output. That mental model is wrong in ways that matter.

What You Are Actually Storing

A language model is, at its core, a large collection of floating-point numbers called weights. A 7-billion-parameter model stores 7 billion of them. In full 32-bit float (FP32), that is 28 GB. In 16-bit (FP16 or BF16), 14 GB. Neither fits in consumer GPU VRAM.

Quantization replaces those high-precision floats with smaller integers, then stores a scaling factor alongside them to recover approximate original values at inference time. The simplest version looks like this:

import numpy as np

def quantize_block(weights: np.ndarray, bits: int = 8) -> tuple:
    """Symmetric linear quantization of a weight block."""
    n_levels = 2 ** (bits - 1)  # e.g., 128 for int8
    abs_max = np.abs(weights).max()
    scale = abs_max / n_levels
    quantized = np.round(weights / scale).astype(np.int8)
    return quantized, scale

def dequantize_block(quantized: np.ndarray, scale: float) -> np.ndarray:
    return quantized.astype(np.float32) * scale

# Example
weights = np.array([0.42, -1.3, 0.07, 2.1, -0.88], dtype=np.float32)
q, s = quantize_block(weights, bits=8)
reconstructed = dequantize_block(q, s)
print(f"Original:      {weights}")
print(f"Quantized:     {q}")
print(f"Reconstructed: {reconstructed}")
print(f"Max error:     {np.abs(weights - reconstructed).max():.6f}")

At 8 bits per weight, the reconstruction error is small enough to be nearly irrelevant. You have 256 distinct values to represent the full range of a weight tensor. At 4 bits you have 16. At 2 bits, 4. The representational poverty at the low end is where things break.

The Granularity Problem

The naive version above uses one scale per entire tensor. If your tensor has weights ranging from -2.1 to 2.1, but 90% of them cluster between -0.1 and 0.1, a single scale wastes most of your representational budget on the outliers. This is the fundamental problem with per-tensor quantization.

The solution is to divide weights into smaller blocks and compute a separate scale per block. GGUF’s K-quant formats use block sizes of 32 or 256 weights, with a second-level “super-block” scale applied on top. This hierarchical approach, introduced to llama.cpp in mid-2023 by Iwan Kawrakow, is the reason K-quant formats (Q4_K_M, Q5_K_M, etc.) so dramatically outperform their older non-K counterparts at the same nominal bit width.

The quality jump between Q4_0 and Q4_K_M is not because they store more bits per weight. They do not. It is because K-quants allocate precision more intelligently within the same budget.

Where the Perplexity Cliff Actually Is

Perplexity on WikiText-2 is the standard quality benchmark for quantized models. Lower is better. For a Llama-2-7B baseline:

FormatBits/weightPerplexityDelta vs FP16
FP1616.05.68
Q8_08.05.69+0.01
Q6_K6.65.72+0.04
Q5_K_M5.75.74+0.06
Q4_K_M4.85.78+0.10
Q4_04.05.90+0.22
Q3_K_M3.36.13+0.45
Q2_K2.67.05+1.37

From FP16 down to Q4_K_M, you lose about 0.1 perplexity points while cutting memory from 14 GB to 4.4 GB. Going one more step down to Q3_K_M costs you more than four times the quality hit for a comparatively modest additional size reduction. Q2_K is genuinely painful.

The cliff is between Q3 and Q4. Everything above Q4_K_M is essentially safe. Everything below Q3 requires a specific justification.

Weights vs. Activations: The Harder Problem

Most local inference tools, including the entire GGUF ecosystem, quantize only weights. Activations, the intermediate computed values during inference, remain in higher precision. This is a deliberate choice: activations vary per input and contain outlier values that break simple quantization schemes.

Tim Dettmers’s LLM.int8() paper identified the core issue: transformer models develop a small number of “emergent” feature dimensions that carry very large activation magnitudes. These outliers appear reliably starting at around 6.7 billion parameters, which is not a coincidence given where most of the interesting open-source models land. His solution was mixed-precision decomposition: detect the outlier dimensions, keep them in FP16, quantize the rest to int8, and do two separate matrix multiplications. The overhead is real but the quality loss is near zero.

This is implemented in the bitsandbytes library, accessible from HuggingFace Transformers with:

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    load_in_8bit=True,  # LLM.int8() mixed precision
    device_map="auto",
)

The NF4 (NormalFloat4) format from the same library takes a different approach for 4-bit quantization: instead of uniform integer levels, it uses levels spaced according to the normal distribution, since model weights empirically follow a normal distribution. This is the format used in QLoRA for fine-tuning quantized models, where you freeze most weights at 4-bit NF4 and only train small adapter matrices at full precision.

GPTQ and AWQ: When You Have a Calibration Dataset

RTN (round to nearest) is the simplest quantization: just round each weight to the nearest representable value. It is what the basic GGUF formats do. The more sophisticated methods use a small dataset to minimize the actual effect on model outputs.

GPTQ (Frantar et al., 2022) uses second-order information, specifically the Hessian of the output with respect to each weight, to compensate for quantization errors layer by layer. It processes about 128 calibration samples, which is fast enough to run on a single GPU in a few hours for a 7B model. The result is measurably better than RTN at the same bit width.

AWQ (Lin et al., 2023) takes a cleaner approach. The observation is that only about 1% of weights are responsible for most of the quality: those corresponding to large activation channels. AWQ scales those weights up before quantization (making them more accurately representable) and then compensates by scaling the activations down. No Hessian computation is needed, so it is faster to run than GPTQ while achieving comparable quality.

Both methods are available through HuggingFace’s ecosystem via auto-gptq and autoawq. For GPU-based inference, ExLlamaV2 with its EXL2 format offers significantly better throughput than standard GPTQ kernels through a custom CUDA implementation.

The Size-Quality Trade-off Is Model-Dependent

Smaller models are generally more sensitive to quantization than larger ones. A 70B model at Q4_K_M will outperform a 7B model at FP16 on most tasks, and the 70B at Q4_K_M fits in about 43 GB, which runs on a Mac Studio with 64 GB unified memory or across two consumer GPUs.

This changes the calculus significantly. If you have 16 GB of VRAM, you can run a 7B model at FP16 or a 70B model at Q2_K. The 70B at Q2_K will likely win despite the severe quantization because the underlying model has more capacity to begin with. You are looking for the combination of model size and quantization level that maximizes quality within your memory budget, not just minimizing quantization.

The llama.cpp perplexity benchmarks maintained in the project’s GitHub discussions are the most comprehensive source for this kind of cross-model comparison data.

IQ-Quants: Importance-Weighted Compression

The most recent development in the GGUF ecosystem is importance-based quantization (IQ-quants), which takes inspiration from AWQ’s insight about weight importance. Before quantizing a block, llama.cpp computes an importance matrix that identifies which weights matter most and allocates more precision to them. The IQ4_XS format achieves quality close to Q4_K_M at roughly 8% smaller file size.

The progression from RTN to K-quants to IQ-quants is a consistent story: the bits-per-weight number matters less than how intelligently you allocate those bits.

Why This Matters for Local Inference

Running models locally has gone from a GPU-lab exercise to something achievable on a MacBook in about two years. The engineering enabling that shift is not primarily hardware: it is the combination of efficient runtimes like llama.cpp and increasingly sophisticated quantization schemes that preserve model quality while compressing aggressively.

The practical upshot for most users: Q4_K_M is the safe default for memory-constrained situations. Q8_0 if you have headroom and care about subtle quality differences. Avoid Q2_K unless storage is the binding constraint and you have accepted the quality hit deliberately. And when choosing between a smaller model at high precision and a larger model at aggressive quantization, the larger model usually wins.

Understanding the math, even at a surface level, makes these decisions less arbitrary. The scale-and-zero-point mechanics are simple enough to fit in a few lines of NumPy; the interesting engineering happens in the granularity of those scales and in which weights you protect.

Was this interesting?