· 6 min read ·

From Float32 to Four Bits: The Engineering Behind LLM Quantization

Source: simonwillison

Running a 7-billion-parameter model at full float32 precision requires roughly 28 GB of memory: 7 billion parameters multiplied by 4 bytes each. Move to bfloat16 and that drops to 14 GB. At int8 it is 7 GB, and at int4 it is 3.5 GB. That arithmetic is why quantization is not a curiosity but a practical requirement for running large language models on consumer hardware.

Simon Willison’s recent piece on quantization from the ground up walks through the mathematical foundation of how this works. It is worth building on that foundation with the practical engineering details of how quantization gets deployed across the most common formats today, and where the quality losses actually come from.

What Quantization Does to a Weight Tensor

A neural network’s parameters are stored as floating-point numbers. In training, these are usually bfloat16 or float32, formats that can represent a wide dynamic range with high precision. Quantization maps these values to a lower-precision integer representation, typically int8 (256 possible values) or int4 (16 possible values).

The simplest version is absmax quantization. Take a weight tensor, find the absolute maximum value, and use that to scale all values into the range [-127, 127] for int8:

import numpy as np

def absmax_quantize(weights: np.ndarray) -> tuple[np.ndarray, float]:
    scale = np.max(np.abs(weights)) / 127.0
    quantized = np.round(weights / scale).astype(np.int8)
    return quantized, scale

def absmax_dequantize(quantized: np.ndarray, scale: float) -> np.ndarray:
    return quantized.astype(np.float32) * scale

This is lossless in concept but lossy in practice. Any value in the original tensor that falls between two quantization steps gets rounded to the nearest one. For int8, the granularity is fine enough that errors are small. For int4, you only have 16 buckets across the entire range of a tensor, and the errors compound.

The real problem with per-tensor absmax quantization is outliers. A single large activation can push the scale factor high enough that the majority of values, which cluster near zero, all collapse to the same quantized integer. You lose most of the information in the tensor trying to preserve its extremes.

Block Quantization and Why It Fixes the Outlier Problem

The solution is to not quantize the entire tensor with a single scale factor. Instead, divide it into blocks, typically of 32 or 64 values, and compute a separate scale factor for each block. This localizes the damage from outliers: a block with an extreme value only degrades within that block, not across the whole layer.

This is the core technique behind GGUF’s quantization levels used by llama.cpp. A format labeled Q4_0 uses 4-bit weights with a block size of 32. Q4_1 adds an explicit zero-point offset per block, giving asymmetric quantization that handles non-centered distributions more accurately.

The K-quant formats (Q4_K_M, Q5_K_S, Q6_K, and so on) go further. They use a mixed strategy where the most important parts of the model, typically the attention and feed-forward projection layers, get slightly higher precision while less critical weights get compressed more aggressively. The “M” and “S” suffixes in formats like Q4_K_M and Q4_K_S denote “medium” and “small” quality variants, reflecting how the mixed-precision budget is allocated across layers.

A rough memory guide for a 7B parameter model:

FormatSizeNotes
F32~28 GBTraining precision
BF16~14 GBTypical inference
Q8_0~7 GBNear-lossless
Q5_K_M~5 GBGood quality/size tradeoff
Q4_K_M~4.1 GBCommon default choice
Q3_K_M~3.3 GBNoticeable quality drop
Q2_K~2.7 GBSignificant degradation

These numbers reflect actual model files distributed through Hugging Face and Ollama’s model library. The Q4_K_M format has become a reliable default for running models on machines with 8 to 16 GB of unified memory.

GPTQ and AWQ: Using a Calibration Dataset

GGUF-style quantization is applied purely from the weight values themselves. GPTQ and AWQ take a different approach: they use a small calibration dataset to guide the quantization process.

GPTQ (Generalized Post-Training Quantization) quantizes weights layer by layer, using the Hessian of each layer’s output with respect to its inputs to determine which weights are most sensitive to rounding error. Weights that matter more get quantized more carefully, and the remaining error is partially compensated by adjusting other weights in the same layer. This produces better quality at the same bit width compared to naive absmax quantization, at the cost of requiring a calibration run over a representative text corpus.

AWQ (Activation-aware Weight Quantization) observes that roughly 1% of weights, those corresponding to large activation magnitudes, are disproportionately responsible for model quality. AWQ protects those weights by scaling the input channel before quantization, which effectively allocates more of the precision budget to the high-importance weights without increasing average bit width. AWQ tends to produce models that run faster than GPTQ on hardware with INT4 kernel support because it avoids some of GPTQ’s dequantization overhead at inference time.

Both formats integrate cleanly with the Hugging Face ecosystem:

from transformers import AutoModelForCausalLM
import torch

# Load a GPTQ-quantized model via auto-gptq
from auto_gptq import AutoGPTQForCausalLM

model = AutoGPTQForCausalLM.from_quantized(
    "TheBloke/Llama-2-7B-GPTQ",
    device="cuda:0",
    use_triton=True,
)

# Or use bitsandbytes for on-the-fly 4-bit loading
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

NF4: Exploiting the Normal Distribution of Weights

The bitsandbytes library introduced NF4 (Normal Float 4-bit) for QLoRA fine-tuning. Neural network weights after training converge toward a roughly normal distribution, and NF4 exploits this. Instead of uniform integer steps, NF4 uses quantization levels spaced to match the quantiles of a standard normal distribution. More levels cluster near zero where most weight values live, and fewer levels cover the sparse tails.

NF4 is not the same as int4. The bucket boundaries are:

[-1.0, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848,
 -0.0911,  0.0,   0.0796,  0.1609,  0.2461,  0.3379,
  0.4407,  0.5626, 0.7230,  1.0]

Each weight is mapped to its nearest value in this set and stored as a 4-bit index. This makes NF4 a form of non-linear quantization rather than scalar quantization. It performs better than int4 for weights that genuinely follow a normal distribution, which is most of them once training has converged.

Where Quality Actually Degrades

The standard metric for evaluating quantization quality is perplexity on a held-out text corpus, usually WikiText-2 or C4. Lower perplexity means the model assigns higher probability to the correct next token. The pattern across formats is consistent: Q8 quantization is nearly indistinguishable from full bfloat16 inference. Q6 and Q5 are close. Q4_K_M starts to show measurable perplexity increases on reasoning-heavy tasks. Q3 and Q2 degrade noticeably on most benchmarks.

The degradation is not uniform across model sizes. Larger models quantize more gracefully than smaller ones because they have redundancy to absorb rounding errors. A 70B model at Q4 will often outperform a 7B model at Q8 on benchmark tasks, not because quantization is helping, but because the base model carries substantially more representational capacity.

Context length also matters in ways that are easy to overlook. Quantization errors tend to compound over long sequences, so a Q4 model may be adequate for short completions but noticeably worse than Q8 for inputs spanning many thousands of tokens. This is worth measuring directly for your specific use case rather than assuming the short-context benchmark numbers transfer.

Picking a Format

For CPU or Apple Silicon inference via Ollama or llama.cpp, Q4_K_M is a solid starting point. It fits most 7B models into 8 GB of memory with acceptable quality, and the K-quant mixed-precision strategy gives it better per-layer quality than Q4_0 at essentially the same size. Step up to Q5_K_M if you have the memory headroom and care about quality on complex reasoning tasks.

For GPU inference in Python, bitsandbytes with NF4 is the easiest path for loading large models on consumer GPUs without requiring a pre-quantized checkpoint. AWQ models from Hugging Face Hub are worth trying if you want better inference throughput without a quality penalty over NF4.

For production deployment on NVIDIA hardware, GPTQ with Exllama v2 kernels currently offers strong throughput for 4-bit inference, though this space moves quickly and the tooling around AWQ has been closing the gap.

The mathematical foundation underneath all of this is a rounding problem. The engineering challenge is in the details: how you partition tensors into blocks, how you order the quantization decisions, and what you know about the statistical structure of the weights you are compressing. Getting those details right is the difference between Q4 models that are useful and Q4 models that are not.

Was this interesting?