· 6 min read ·

What the GGUF Naming Scheme Is Actually Telling You

Source: simonwillison

Simon Willison recently published a ground-up explanation of quantization covering the core math behind reducing model weight precision. It’s a good foundation. What I want to do here is extend that foundation outward into the practical engineering of the GGUF format and the specific design decisions inside llama.cpp that determine what you’re actually getting when you download a Q4_K_M file.

The basic math, briefly

A neural network weight is stored as a floating-point number. In full precision that’s 32 bits per weight (FP32), or more commonly today 16 bits (FP16 or BF16). A 7B parameter model in BF16 takes roughly 14 GB of memory. That’s too large to fit in most consumer GPU VRAM, so quantization maps those floats to smaller integers.

The simplest form is affine (asymmetric) quantization. For a range of weights, you compute a scale factor s and a zero point z, then store each weight as an integer q where:

q = round(w / s) + z
w_approx = (q - z) * s

Symmetric quantization drops the zero point by assuming weights are centered at zero, which holds reasonably well for model weights in practice. You compute:

s = max(|w|) / (2^(bits-1) - 1)
q = round(w / s)

The reconstruction error is w - w_approx, and the goal of every quantization scheme is to minimize that error while using fewer bits.

At INT8, you store each weight in 8 bits and the quality loss is nearly imperceptible in most benchmarks. At INT4, things get interesting.

Why per-tensor quantization breaks down at INT4

In early INT8 quantization schemes, a single scale factor covered an entire weight matrix, or at best an entire row (per-channel). This is fine at 8 bits because the 256 representable values give enough resolution to cover a matrix’s value range without much rounding error.

At 4 bits you have only 16 representable values. If a weight matrix contains a few outlier values, the scale factor must be set wide enough to cover them, which means most of the 16 levels cluster around values that rarely appear. The effective precision collapses.

The fix is per-group quantization: instead of one scale per row, you compute one scale per small block of consecutive weights, typically 32 or 128 values. Each group gets its own scale, so outliers in one group don’t degrade the resolution everywhere else. The scale factors themselves take up overhead, but at group sizes of 32 the overhead is modest (one FP16 scale per 32 INT4 weights adds about 1.6 bits per weight on average).

This is the mechanism underlying all the modern Q4_K and Q5_K formats in GGUF. The K stands for k-quants, introduced into llama.cpp in mid-2023 by Kawrakow, representing a significant quality jump over the original Q4_0 and Q4_1 formats.

Decoding the GGUF naming scheme

GGUF filenames follow a loose convention: Q[bits]_[method]_[size]. The bits part is self-explanatory. The method and size letters are less obvious.

Q4_0 is the original 4-bit format with a single FP16 scale per 32-weight block and no zero point. Q4_1 adds a zero point (making it asymmetric), using two FP16 constants per block. Both are superseded by k-quants for most use cases.

Q4_K_S, Q4_K_M, and Q4_K_L are all 4-bit k-quant variants. The S/M/L suffix describes how aggressively the quantization is applied across different layer types:

  • S (Small): Everything quantized to 4-bit, minimal quality preservation. Smallest file size.
  • M (Medium): Attention and feed-forward layers get different treatment. Some layers (notably the attention key/value projections and the final output layer) are quantized to 6-bit instead of 4-bit, because these layers are more sensitive to precision loss.
  • L (Large): More layers bumped up to higher precision.

The Q6_K format stores weights at 6 bits per value with k-quant grouping, which is visually close to Q8_0 in perplexity benchmarks but uses less memory. Q8_0 is symmetric 8-bit with one scale per 32 weights and is generally considered lossless for practical purposes.

For a 7B model, the approximate sizes are:

FormatSizePerplexity delta vs FP16
Q8_0~7.7 GB~0.0
Q6_K~6.1 GB~0.02
Q5_K_M~5.1 GB~0.05
Q4_K_M~4.4 GB~0.1–0.15
Q4_K_S~4.1 GB~0.2
Q3_K_M~3.5 GB~0.5
Q2_K~2.8 GB~1.5+

The perplexity deltas are measured on standard text corpora; a delta above ~0.5 tends to correlate with noticeable degradation in reasoning tasks.

The calibration question

Some quantization methods, particularly GPTQ and AWQ (Activation-Aware Weight Quantization), require a calibration dataset. During quantization, a small set of real text prompts is passed through the model, and the quantizer observes which weights most strongly affect the output. AWQ in particular identifies a small fraction (roughly 1%) of weights that carry disproportionate influence and keeps them at higher precision, with the rest quantized aggressively.

GGUF’s k-quants don’t use activation-based calibration in the same way. The layer-type heuristic (bumping up attention projections) is a static approximation of the same idea: some structural positions in a transformer are more sensitive than others regardless of the specific data distribution. This trades calibration accuracy for deployment simplicity, and for most practical use cases the difference is small.

Importance-matrix quantization, now available in llama.cpp via the --imatrix flag, brings activation-based calibration to GGUF. You generate an importance matrix by running the model over a representative dataset, then use that matrix to guide which individual weights within each group get rounded up versus down. The quality gains are most visible at aggressive quantization levels (Q3 and below) and for instruction-tuned models where task-specific weight patterns matter more.

Where Apple Silicon fits

The MLX framework from Apple takes a different approach to quantization for Apple Silicon. MLX uses grouped quantization similar to k-quants but expressed in its own format, and it can quantize models in-place at load time rather than requiring pre-quantized files. The mlx_lm.convert utility will quantize a HuggingFace model to 4-bit on the fly and cache it locally.

Because Apple’s unified memory architecture shares RAM between CPU and GPU, the memory savings from quantization translate directly into the ability to run larger models on consumer hardware. A 70B model in Q4_K_M fits in 40 GB of unified memory, which is within reach of the M2 Ultra and M3 Ultra Mac Studio configurations. The same model in BF16 would need roughly 140 GB.

MLX’s quantization is generally held to be comparable quality to GGUF k-quants at the same bit width, though the two formats aren’t directly interchangeable and benchmarks comparing them across identical models are sparse.

The practical decision

For most people running local models, Q4_K_M is the right default. It sits at the inflection point where memory savings are substantial (roughly 70% smaller than FP16) and quality loss is small enough to be invisible in conversational use. If you have enough VRAM or RAM to run Q5_K_M, it’s worth it for tasks that require precise instruction following or multi-step reasoning. Below Q4 you’re trading meaningful quality for marginal memory savings, and the calculus only favors that if the alternative is not running the model at all.

The underlying math that Simon’s article walks through, scale factors and rounding error and the geometry of mapping a continuous distribution onto a small set of integers, is what determines all of these tradeoffs. The GGUF naming scheme is just an encoding of the engineering decisions that the llama.cpp team made while navigating that math at different operating points. Reading the name tells you roughly where on that tradeoff curve a given file sits.

Was this interesting?