· 7 min read ·

What Quantization Actually Does to a Model Weight

Source: simonwillison

Running a 70B parameter model on consumer hardware was implausible three years ago. Today it’s routine, and quantization is the reason. Simon Willison’s recent walkthrough of quantization from first principles is a good prompt to go deeper on the mechanics, because the intuition most people carry around for this is fuzzier than it needs to be.

Quantization is not just “making the model smaller.” It is a specific mathematical transform applied to the weights of a neural network, and understanding what that transform does at the bit level explains why some quantization schemes are nearly lossless and others visibly degrade output.

Where the Memory Goes

A transformer language model is mostly weights: large two-dimensional matrices stored in the layers. Each weight is a floating point number. In the standard training representation, that means 32 bits per value. A 7B parameter model stored in float32 occupies about 28 GB of memory. The same model in float16 is 14 GB. On a 16 GB consumer GPU, that is already tight once you account for the KV cache during inference.

Floating point numbers use a split representation: sign, exponent, and mantissa. IEEE 754 float32 uses 1 sign bit, 8 exponent bits, and 23 mantissa bits. The exponent determines the magnitude of the number; the mantissa stores the fractional precision within that magnitude. Float16 shrinks to 5 exponent bits and 10 mantissa bits, which means far less range and precision. The Google-introduced bfloat16 takes a different trade: it keeps the full 8 exponent bits from float32 but cuts the mantissa to 7 bits, preserving dynamic range at the cost of precision. This is why bfloat16 is preferred for training (gradient magnitudes vary wildly) while float16 sees more use in inference.

But even float16 may be more precision than the weights actually need. Most weights in a trained model cluster around small values near zero, with a distribution that rarely uses the full dynamic range available. This is the opening that quantization exploits.

Linear Quantization: The Math

The simplest form of quantization maps a range of floating point values onto a set of integers. For 8-bit quantization, you have 256 possible integer values, typically either 0..255 (unsigned) or -128..127 (signed).

The “absmax” or symmetric approach works like this: find the maximum absolute value in the tensor, call it alpha, then define a scale factor s = alpha / 127. Every weight x gets mapped to q = round(x / s). To recover the original weight approximately, you compute x_approx = s * q.

import numpy as np

def quantize_int8_absmax(weights):
    alpha = np.max(np.abs(weights))
    scale = alpha / 127.0
    quantized = np.round(weights / scale).astype(np.int8)
    return quantized, scale

def dequantize_int8(quantized, scale):
    return quantized.astype(np.float32) * scale

The asymmetric or zero-point variant adds an offset to handle tensors whose range is not centered around zero. Here you map [x_min, x_max] to [0, 255]:

def quantize_int8_zeropoint(weights):
    x_min, x_max = weights.min(), weights.max()
    scale = (x_max - x_min) / 255.0
    zero_point = round(-x_min / scale)
    quantized = np.round(weights / scale + zero_point).clip(0, 255).astype(np.uint8)
    return quantized, scale, zero_point

The reconstruction error from this process is bounded by scale / 2 per weight. Smaller scale means more precision. The problem is that scale is determined by the extremes of the distribution, so a single large outlier value forces a coarse scale that degrades precision for all the other weights.

The Outlier Problem and Block-wise Quantization

Transformer models are known to produce outlier activations, and the weight matrices that interact with those activations tend to contain outlier values too. A single weight with magnitude 10.0 in an otherwise narrow distribution around 0.01 will set a scale so large that the small weights lose all meaningful precision.

Block-wise quantization addresses this directly: instead of computing one scale for an entire tensor, you divide the tensor into small blocks (typically 32 or 64 consecutive values along a row) and compute a separate scale per block. This way, an outlier in one block only damages that block rather than the entire layer.

This is the approach llama.cpp uses in its GGUF format. The quantization type Q4_0 stores 4-bit weights in blocks of 32, with one float16 scale per block. The storage is (32 * 4 / 8) + 2 = 18 bytes per 32-weight block, working out to 4.5 bits per weight including the scale overhead.

Reading GGUF Type Names

The naming scheme for GGUF quantization types is initially opaque but has a consistent logic.

Q4_0: 4-bit, “type 0” (absmax, no min stored). Q4_1: 4-bit, “type 1” (scale and min stored, slightly more accurate but larger). Q5_0, Q5_1 and Q8_0 follow the same pattern.

The “K-quants,” introduced by Iwan Kawrakow (ikawrakow), add another layer. Types like Q4_K, Q5_K, Q6_K use a super-block structure: a block of 256 weights is divided into eight sub-blocks of 32 weights each. The sub-block scales are themselves quantized to 6-bit integers, and a float16 super-block scale stores the scaling factor for those sub-scales. This compresses the metadata overhead while maintaining per-block accuracy.

The _S, _M, _L suffixes on types like Q4_K_S and Q4_K_M indicate different per-tensor precision assignments. In Q4_K_M, some tensors (particularly the feed-forward layers) are quantized at Q5_K rather than Q4_K to preserve quality in the parts of the model that matter most. The llama.cpp quantization documentation has a breakdown of which layers get which treatment.

For most local use, Q4_K_M is the practical sweet spot: around 4.8 bits per weight, near-float16 quality on most benchmarks, and fits 7B models comfortably in 8 GB VRAM.

Calibration-Based Quantization: GPTQ and AWQ

Post-training quantization methods like GPTQ and AWQ take a different approach. Instead of using purely local statistics (max, min per block), they run a calibration dataset through the model and use the resulting activations to make smarter quantization decisions.

GPTQ is based on Optimal Brain Compression, which applies second-order information to minimize quantization error. For each weight matrix layer by layer, it approximates the Hessian of the loss with respect to that layer’s weights, then uses that information to both quantize weights and compensate for quantization error by adjusting the remaining unquantized weights. The result is that GPTQ can achieve 4-bit weights with quality closer to float16 than naive block-wise quantization, particularly on larger models where the calibration data can capture more of the distribution.

AWQ (Activation-aware Weight Quantization) takes a simpler but effective observation: across a weight matrix, some input channels are far more salient than others, as measured by average activation magnitude on calibration data. If a channel is activated heavily, its corresponding weights are more sensitive to quantization error. AWQ identifies these salient channels and scales them up before quantization (and divides activations by the same factor during inference), effectively giving those weights more of the quantization range. This is cheaper to compute than GPTQ and achieves comparable results.

A third format worth knowing is EXL2, which assigns different bit widths to different rows within a single weight matrix based on calibration data. You can target a specific average bits-per-weight (3.7, 4.25, 5.0, etc.) and ExLlama will distribute precision where the calibration data says it matters most. This fine-grained allocation often beats GPTQ at the same average bit width.

What the Numbers Mean in Practice

Quantization quality is usually measured by perplexity on a held-out text corpus, where lower is better. For a Llama-3-8B class model, representative numbers look roughly like this:

FormatBits/weightPerplexity delta from FP16
Q8_08.5~0.001 (lossless)
Q6_K6.6~0.01
Q5_K_M5.7~0.05
Q4_K_M4.8~0.15
Q3_K_M3.9~0.5
Q2_K3.3~2.5 (significant)

These are approximate and model-dependent; larger models tolerate lower bit widths better than smaller ones because they have more redundancy in their weights.

Speed is less intuitive than you might expect. Quantized inference is often faster than float16 not because integer arithmetic is faster, but because the bottleneck at inference time is typically memory bandwidth rather than compute. Loading a Q4 weight requires reading half the bytes of a float16 weight, which translates directly to throughput on both GPU VRAM and system RAM. On Apple Silicon, where inference runs on unified memory with substantial memory bandwidth, this effect is pronounced.

The Underlying Principle

What all of these schemes share is the recognition that neural network weights are not uniformly important or uniformly distributed. The distribution of values within a layer, the salience of different channels as measured by activations, the sensitivity of different layers to quantization error: all of these vary, and quantization schemes that account for this structure do better than those that treat all weights identically.

The progression from Q4_0 to Q4_K_M to GPTQ to EXL2 is a progression in how much of that structure the scheme is able to exploit. The cost is complexity and calibration overhead. For most use cases, the K-quants strike the right balance: they require no calibration data, are fast to produce, and get most of the quality of the more expensive approaches at practical bit widths.

Was this interesting?