From Float32 to Q4_K_M: What LLM Quantization Actually Does to Weights

Simon Willison published a ground-up walkthrough of quantization recently, and it is worth reading on its own. But after spending a fair amount of time picking between Q4_K_M and Q5_K_M files on Hugging Face without fully understanding what I was trading off, I wanted to go deeper on the parts that the high-level explanations tend to skip over.

The core problem quantization solves is simple: a 7 billion parameter model stored as 32-bit floats requires around 28 GB of memory. Most consumer hardware does not have that. Halve the precision to float16 and you get to 14 GB, which still exceeds a lot of VRAM budgets. Quantization takes this further by representing weights as integers, and the interesting engineering is in doing that without destroying the model’s ability to reason.

The Arithmetic

At its core, quantization maps a range of floating-point values onto a smaller set of integer values. The simplest version is symmetric quantization:

import numpy as np

def quantize_symmetric(weights: np.ndarray, bits: int = 8):
    max_val = np.max(np.abs(weights))
    scale = max_val / (2 ** (bits - 1) - 1)
    quantized = np.round(weights / scale).clip(
        -(2 ** (bits - 1)), 2 ** (bits - 1) - 1
    ).astype(np.int8)
    return quantized, scale

def dequantize(quantized: np.ndarray, scale: float) -> np.ndarray:
    return quantized.astype(np.float32) * scale

For asymmetric quantization, you add a zero point to shift the integer range to match the actual distribution of values:

def quantize_asymmetric(weights: np.ndarray, bits: int = 4):
    w_min, w_max = weights.min(), weights.max()
    n_levels = 2 ** bits - 1
    scale = (w_max - w_min) / n_levels
    zero_point = round(-w_min / scale)
    quantized = np.round(weights / scale + zero_point).clip(0, n_levels).astype(np.uint8)
    return quantized, scale, zero_point

The error introduced by this process is called quantization error, and for 8-bit integers it is typically small enough that model output is nearly indistinguishable from the float16 baseline. With 4 bits, things get more interesting.

Why LLMs Break Naive Quantization

The straightforward approach is to quantize all the weights in a layer together, using a single scale factor for the entire tensor. This works adequately for many neural network architectures, but LLMs have a property that makes it fail at 4 bits: outliers.

Researchers found in 2022 that transformer models develop a small number of weight channels with values an order of magnitude larger than the rest. These outliers are not anomalies; they are load-bearing features the model has learned to use for specific computations. But their presence forces the quantization scale to span a much wider range, which means all the other weights get shoved into a coarser grid.

Consider a weight tensor where 99.9% of values fall between -0.5 and 0.5, but a handful reach ±10. With symmetric INT4 quantization across the whole tensor, your scale factor has to accommodate those ±10 values. That gives you 15 integer levels spread across a range of 20, which is a step size of about 1.33. All your small weights, which had meaningful differences at the 0.01 level, now collapse to the same few integers.

# Demonstrating the outlier problem
weights = np.random.randn(1024).astype(np.float32) * 0.1
normal_mse = lambda w: np.mean((w - dequantize(*quantize_symmetric(w, 4)[:2])) ** 2)

print(f"MSE without outliers: {normal_mse(weights):.6f}")
weights_with_outlier = weights.copy()
weights_with_outlier[0] = 10.0
print(f"MSE with one outlier: {normal_mse(weights_with_outlier):.6f}")
# typical output: 0.000012 vs 0.008400

The error increases by three orders of magnitude because of a single value.

Group Quantization Fixes This

The solution is to use separate scale factors for small groups of weights rather than one scale for the entire tensor. If you quantize in blocks of 32 weights, the outlier in one block only ruins the precision of those 32 values, not the entire layer.

def quantize_grouped(weights: np.ndarray, group_size: int = 32, bits: int = 4):
    n = len(weights)
    assert n % group_size == 0
    groups = n // group_size
    all_quantized = []
    scales = []
    for i in range(groups):
        g = weights[i * group_size:(i + 1) * group_size]
        scale = np.max(np.abs(g)) / (2 ** (bits - 1) - 1)
        q = np.round(g / scale).clip(-(2**(bits-1)), 2**(bits-1)-1).astype(np.int8)
        all_quantized.append(q)
        scales.append(scale)
    return np.concatenate(all_quantized), np.array(scales, dtype=np.float32)

The cost is that you now store one float32 scale per 32 weights, adding 1 bit of overhead per weight at group size 32. For 4-bit weights, that overhead is 25%, bringing the effective bit-width to around 4.125 bits per parameter. Still a substantial improvement over float16.

This is the foundation that GGUF and llama.cpp build on.

K-Quants: Optimal Placement Instead of Uniform Grids

Standard group quantization still uses a uniform grid, meaning the quantization levels are evenly spaced between the minimum and maximum. K-quants, introduced to llama.cpp by contributor Iwan Kawrakow, replace the uniform grid with a learned codebook derived from k-means clustering.

The insight is that LLM weight distributions are not uniform. Most weights cluster around zero, with fewer large values. A uniform grid wastes precision on the sparse high-magnitude region and packs too few levels into the dense low-magnitude region. K-means clustering finds the quantization levels that minimise the total reconstruction error for the actual distribution of values in each block.

The storage format gets more elaborate. K-quants use a superblock structure: a superblock of 256 weights contains 8 sub-blocks of 32 weights each. Each sub-block has its own quantization scale (stored at reduced precision), and the superblock has a higher-precision shared scale that normalises the sub-block scales. This hierarchical approach lets you store sub-block scales more cheaply without losing the ability to represent large dynamic ranges.

Decoding the GGUF Naming Convention

When you see a filename like model-Q4_K_M.gguf, each part means something specific:

Q4: 4 bits per weight on average
_K: K-quant method (k-means codebook rather than uniform grid)
_M: Medium variant

The M/S/L suffixes control which layers get elevated to higher precision. Attention output and value projection layers are more sensitive to quantization errors than feed-forward weights; the M variant promotes these layers to Q6_K (6-bit K-quant) while keeping the bulk of weights at Q4_K. The S (Small) variant keeps everything at 4 bits for maximum compression; L (Large) is more conservative.

This per-layer precision mixing is why Q4_K_M consistently outperforms Q4_0 (naive 4-bit) on perplexity benchmarks despite the average bit-width being similar. The extra bits are spent where they matter most.

What the Perplexity Numbers Mean

Perplexity on the WikiText-2 dataset is the standard benchmark for comparing quantization quality. Lower is better. For a Llama 2 7B model, representative numbers look like this:

Format	Avg bits/weight	Approx size	WikiText-2 PPL
F16	16	13.5 GB	~5.25 (baseline)
Q8_0	8.5	7.2 GB	~5.26
Q6_K	6.6	5.5 GB	~5.28
Q5_K_M	5.7	4.8 GB	~5.32
Q4_K_M	4.8	4.1 GB	~5.40
Q3_K_M	3.9	3.3 GB	~5.73
Q2_K	3.4	2.9 GB	~6.59

Perplexity is an exponential measure, so the jump from Q4_K_M to Q2_K is much more dramatic than the raw numbers suggest. A perplexity of 5.40 vs 6.59 translates to noticeably worse output quality on open-ended generation tasks.

Q8_0 and Q6_K are essentially lossless for practical purposes. Q4_K_M is the sweet spot for most use cases: a 3.3x memory reduction from F16 with a perplexity increase that most users will not notice in conversation. Q3_K_M is where degradation starts to become apparent on reasoning tasks. Q2_K is useful only when memory is the hard constraint and quality is secondary.

GPTQ and AWQ: Different Approaches for GPU Inference

GGUF and llama.cpp are primarily designed for CPU inference (though GPU layers are supported). For pure GPU inference, GPTQ and AWQ are more common.

GPTQ quantizes layer by layer and uses second-order gradient information (the Hessian of the loss with respect to weights) to compensate for quantization error. When it quantizes a weight suboptimally, it adjusts the remaining weights in the row to reduce the total error. This makes it more accurate than post-hoc group quantization at equivalent bit widths, at the cost of a slower quantization process.

AWQ takes a different approach. Rather than modifying weights after quantization, it identifies the 1% of weight channels that are most important based on activation patterns, then scales those channels up before quantization so they get finer-grained representation. The scaling is mathematically equivalent to scaling the corresponding activation down, so the model output is preserved while the important weights get more precision.

Both produce models that can be loaded with libraries like bitsandbytes or AutoGPTQ and run on consumer GPUs. The tradeoff is that GPTQ and AWQ require a calibration dataset and take hours to quantize a model, while GGUF quantization from a float16 source takes minutes.

The Practical Takeaway

For local inference with llama.cpp, Q4_K_M is the right default for most 7B to 13B models. You get a model that fits in 4 to 6 GB of RAM, runs at acceptable speed on CPU or GPU, and produces output that is subjectively indistinguishable from the float16 original on most tasks. Q5_K_M is worth the extra memory if you are doing extended reasoning or math, where the additional precision in the K-quant codebook makes a measurable difference.

The deeper point is that these are not arbitrary compression settings. The naming encodes specific engineering decisions about where precision matters and where it can be sacrificed. Understanding the arithmetic behind those decisions makes the tradeoffs legible rather than opaque, and that legibility matters when you are trying to squeeze a capable model onto real hardware.