· 8 min read ·

What Your Computer Is Actually Doing When It Loads a 4-bit Model

Source: simonwillison

Running a large language model on a laptop was not supposed to be possible. A 7-billion-parameter model with 32-bit weights occupies roughly 28 GB of memory. Most consumer hardware has nowhere near that. Quantization is why it works anyway, and Simon Willison’s walkthrough prompted me to think through what the technique is actually doing at each step, because the high-level description, “we store weights in fewer bits,” glosses over several genuinely interesting engineering decisions.

The core idea in one paragraph

Neural network weights are floating-point numbers. During training they are typically stored as 32-bit floats (FP32), or increasingly as 16-bit (BF16 or FP16). Quantization maps those continuous values onto a smaller set of discrete values, most commonly 8-bit integers (INT8) or 4-bit integers (INT4). The mapping involves two parameters: a scale factor and a zero point. Given those, you can convert between the quantized integer representation and an approximation of the original float.

The formula is straightforward:

original ≈ scale × (quantized - zero_point)

And the inverse:

quantized = round(original / scale) + zero_point

For symmetric quantization, zero_point is zero, and you only need the scale. For asymmetric quantization, you track both, which handles weight distributions that are not centered around zero.

Why floating point is overkill for inference

During training you need high-precision gradients because small differences accumulate across millions of update steps. During inference you are just doing a forward pass: multiply weights by activations, add biases, apply nonlinearities. The model is not learning anything. The question is how much precision it actually needs for its outputs to remain useful.

The answer depends on the layer and the model architecture, but a consistent empirical finding is that most weights in a trained transformer carry redundant precision. The important information is concentrated in the relative magnitudes, not in the exact floating-point values. This is what makes aggressive quantization viable.

There is a useful way to think about this: a weight that is 0.314159 and a weight that is 0.31 are almost certainly doing the same job in the network. The error introduced by rounding is usually smaller than the noise present in the training data that produced those weights in the first place.

Absmax quantization: the simplest possible scheme

The most naive approach is absmax quantization. Take all the values in a tensor, find the absolute maximum, and use it as your scale factor:

import numpy as np

def absmax_quantize_int8(weights: np.ndarray):
    scale = np.max(np.abs(weights)) / 127.0
    quantized = np.round(weights / scale).astype(np.int8)
    return quantized, scale

def dequantize(quantized: np.ndarray, scale: float):
    return quantized.astype(np.float32) * scale

This maps the full range of values uniformly into [-127, 127]. The problem is that transformer weight distributions are not uniform. They typically have a tall peak near zero with a few large outliers. If your maximum value is 15.0 but 99% of your weights are between -0.5 and 0.5, you have wasted almost all of your 8-bit range on values that almost never appear. Everything near zero gets quantized to the same few integers, destroying the nuance the model learned.

Block-wise quantization fixes the outlier problem

The solution used by bitsandbytes and most modern quantization libraries is to compute scale factors not over the entire tensor but over small blocks, typically 64 or 128 consecutive values. Each block gets its own scale, so a layer with a few outliers in one region does not contaminate the precision of the rest.

def blockwise_quantize(weights: np.ndarray, block_size: int = 64):
    blocks = weights.reshape(-1, block_size)
    scales = np.max(np.abs(blocks), axis=1, keepdims=True) / 127.0
    quantized = np.round(blocks / scales).astype(np.int8)
    return quantized, scales.flatten()

The overhead is storing one scale per block. For 64-element blocks with FP32 scales, the overhead is 4 bytes per 64 bytes of INT8 data, about 6.25%. For INT4 data it is 4 bytes per 32 bytes, roughly 12.5%. This is the reason why the actual compression ratio you get from “4-bit quantization” is closer to 4.5 or 5 bits per weight rather than a clean 4x.

NF4: matching the quantization grid to the data distribution

Bitsandbytes introduced NF4 (Normal Float 4) in the QLoRA paper (Dettmers et al., 2023), and it is a clever idea. Instead of using uniformly spaced quantization levels, NF4 places levels at quantiles of a normal distribution. Since pretrained weights are approximately normally distributed, this puts more quantization levels where there is more data, and fewer where there is less.

The 16 NF4 values are precomputed constants:

NF4_LEVELS = [
    -1.0, -0.6961928009986877, -0.5250730514526367, -0.39491748809814453,
    -0.28444138169288635, -0.18477343022823334, -0.09105003625154495, 0.0,
     0.07958029955625534,  0.16093020141124725,  0.24611230194568634,
     0.33791524171829224,  0.44070982933044434,  0.5626170039176941,
     0.7229568362236023,   1.0
]

Quantizing to NF4 means finding the closest value in this lookup table. Dequantizing is just a lookup. The result is that NF4 loses less information than INT4 for normally distributed weights, because the quantization error is more evenly distributed across the actual data.

GGUF and llama.cpp: a different philosophy

Where bitsandbytes operates at the Python/CUDA layer and is tightly coupled to PyTorch, llama.cpp implements its own quantization stack in C and C++, with formats defined in the GGUF file format. The GGUF quantization types have evolved considerably and offer more granular choices.

The naming convention tells you the scheme:

  • Q4_0: 4-bit, 32-element blocks, no zero point
  • Q4_1: 4-bit, 32-element blocks, with zero point (min value stored per block)
  • Q4_K_M: 4-bit K-quants, “medium” mixed precision
  • Q5_K_M: 5-bit K-quants, medium
  • Q8_0: 8-bit, 32-element blocks

The K-quant variants (introduced in llama.cpp in mid-2023) are more sophisticated. They use super-blocks: a 256-element super-block contains multiple 32- or 64-element sub-blocks. Scale factors for the sub-blocks are themselves stored in 6-bit precision rather than full float. Additionally, K-quants apply mixed precision across the model, keeping attention layers at higher precision than feed-forward layers, since attention weights tend to be more sensitive to quantization error.

For most users, Q4_K_M has become the default recommendation because it sits at a good point on the quality/size curve. A 7B model in Q4_K_M is around 4.1 GB, fits comfortably in 6 GB of VRAM, and scores within 1-2 perplexity points of the FP16 baseline on standard benchmarks.

GPTQ: post-training quantization with second-order information

GPTQ (Frantar et al., 2022) takes a different approach. Rather than quantizing weights independently, it processes each layer while minimizing the quantization error at the layer’s output. It uses the Hessian of the layer’s loss (approximated using calibration data) to decide which weights can be quantized aggressively and which need higher precision.

The core insight is that weights are not equally sensitive to quantization. Some weights have high curvature, meaning small errors in them cause large changes in the output. GPTQ compensates by quantizing the less sensitive weights first and propagating the residual error to the remaining weights before quantizing them.

This makes GPTQ quantization slower to perform (you need calibration data and a non-trivial optimization) but often produces better results at the same bit width compared to round-to-nearest approaches. The AutoGPTQ library is the standard implementation.

AWQ: paying attention to activation magnitudes

AWQ (Activation-aware Weight Quantization) (Lin et al., 2023) observes that a small fraction of weight channels are much more important than others, and their importance correlates with the magnitude of the activations that pass through them. Rather than protecting those channels by keeping them at higher precision (which complicates the hardware implementation), AWQ scales up those important channels before quantization and scales down the activations correspondingly. The network’s output is unchanged, but the important weights now span a larger portion of the quantization range and lose less information.

In practice AWQ and GPTQ perform similarly, with AWQ often being faster to quantize because it does not require per-weight Hessian computation.

What you actually lose

Perplexity is the standard way to measure quantization degradation. Lower is better. For a typical 7B model:

FormatBits/weightPPL (WikiText-2)Size (7B)
FP1616~5.714 GB
Q8_08.5~5.77.2 GB
Q4_K_M4.8~5.94.1 GB
Q4_04.5~6.13.8 GB
Q3_K_M3.9~6.53.3 GB
Q2_K2.6~7.82.7 GB

The numbers vary by model, but the shape is consistent: Q8_0 is nearly free, Q4_K_M costs about 0.2 perplexity points, and going below 3 bits starts to hurt meaningfully. There is a cliff around 2 bits where models become noticeably degraded for open-ended generation even if benchmark scores hold up.

Perplexity does not capture everything. Quantized models sometimes degrade on specific domains or tasks even when aggregate perplexity looks fine. This is why the llama.cpp quantization comparison tables track multiple benchmarks and why it is worth testing quantized models on your specific use case rather than trusting a single number.

The hardware side

One reason the ecosystem is fragmented across GGUF, GPTQ, and AWQ is that different hardware handles them differently. Modern GPUs have INT8 tensor cores that can execute 8-bit matrix multiplications directly, giving a real throughput benefit over dequantize-then-multiply approaches. INT4 tensor cores exist in newer architectures (NVIDIA Hopper and Ada) but are less universally available.

On CPU, llama.cpp uses AVX2/AVX-512 SIMD instructions to process quantized blocks efficiently. The 4-bit GGUF formats are specifically designed around these instruction widths, which is part of why the block sizes are powers of two.

For Apple Silicon, Metal shaders in llama.cpp handle quantized matrix multiplication directly, which is why running a Q4_K_M model on an M-series Mac is faster than you might expect given that neural network inference is not what Apple originally optimized those chips for.

The right mental model

Quantization is not lossless compression. It is a controlled approximation that exploits the fact that neural networks are trained with more precision than they need for inference. The interesting engineering is in figuring out where the precision matters, which is why K-quants, NF4, GPTQ, and AWQ all exist as distinct techniques rather than one winning format.

When you download a Q4_K_M file from Hugging Face and run it through Ollama or llama.cpp, the loading code is reading those block scales, reconstructing approximate weight values on the fly, and doing matrix multiplications with approximated numbers. The fact that the output is coherent says something about the robustness of the representations that emerged from training, not just about the cleverness of the quantization scheme.

Was this interesting?