The Math Behind Running LLMs on Your Own Hardware

canirun.ai does something simple and useful: you tell it your GPU and it tells you which local LLMs your hardware can handle. For most people that is all they need. If you want to understand why the tool gives the answers it gives, you need to look at the arithmetic underneath it. The constraints that determine whether a model fits on your hardware are not mysterious. They come from a small set of formulas that stay constant even as models and tooling keep evolving.

This post works through that math. Weight storage, quantization, KV cache, and hardware tiers are all knowable from first principles. Once you have the formulas in your head, a tool like canirun.ai turns from a black box into something you could reproduce yourself on the back of a napkin.

Weight Memory: The Baseline

The most basic cost of running a model is storing its weights. Every parameter in a neural network needs to be loaded into memory before a single token can be generated. The formula is:

memory_bytes = num_parameters × bytes_per_parameter

A 7 billion parameter model stored in float16 (2 bytes per value) costs 7,000,000,000 × 2 = 14 GB. Add 10-20% overhead for activation buffers, runtime allocations, and framework state, and you land closer to 16 GB just to have the model resident in memory.

Why GPU memory specifically? A CPU can technically run inference through system RAM, but GPUs perform the matrix multiplications that drive transformer inference orders of magnitude faster. A modern GPU processes thousands of parallel operations per clock cycle; a CPU handles a handful. For interactive use, VRAM is the resource that matters. If your GPU does not have enough, you are either CPU-offloading with a significant throughput penalty, or unable to run the model at all.

Quantization: The Main Lever

Float16 is precise but wasteful for inference. Most model weights do not need 16-bit precision to produce good output. Quantization reduces each weight to fewer bits, compressing the model significantly at a modest cost to output quality.

The most widely used quantization ecosystem for local inference is llama.cpp, which introduced the GGUF file format and a family of K-quants that mix different precision levels across layers. The Q4_K_M format averages roughly 4.5 bits per weight. For a 7B model that works out to approximately 7,000,000,000 × 4.5 / 8 ≈ 3.9 GB, less than a quarter of the float16 cost.

Here is how the numbers look across quant levels for Llama 3.1 8B:

Format	Bits/Weight	Approx Size	Min VRAM (8K context)
float16	16	~16 GB	~18 GB
Q8_0	8	~8.5 GB	~10 GB
Q5_K_M	5	~5.7 GB	~7 GB
Q4_K_M	4.5	~4.9 GB	~6.5 GB
Q3_K_M	3.5	~3.9 GB	~5.5 GB

The jump from float16 to Q4_K_M is substantial, and the quality degradation on most benchmarks is surprisingly small. Q3 starts to show more noticeable degradation on complex reasoning tasks, but for many applications it remains usable. For GPU-native workflows, GPTQ and AWQ offer alternatives that quantize specifically for GPU execution and tend to preserve accuracy better than post-hoc GGUF quantization, at the cost of reduced flexibility for CPU offloading.

KV Cache: The Hidden Memory

Weight memory is predictable. KV cache is where people get surprised. Every token in a model’s context window requires storing intermediate key and value tensors for every attention layer, and that storage scales linearly with context length.

The formula for KV cache memory is:

kv_size = 2 × n_layers × n_kv_heads × head_dim × seq_len × bytes_per_element

Llama 3.1 8B uses grouped query attention (GQA) with 32 layers, 8 KV heads, and a head dimension of 128. At 8K context in float16:

2 × 32 × 8 × 128 × 8192 × 2 ≈ 1 GB

Manageable. But Llama 3.1 supports up to 128K tokens of context. At that length the same formula yields approximately 16 GB, so even a well-quantized model that fits comfortably in VRAM at short contexts can overflow memory at long ones.

GQA is worth noting specifically because it is the reason newer models are friendlier to local inference than older ones at the same parameter count. Earlier architectures used multi-head attention (MHA) with 32 KV heads instead of 8. Run that same calculation with n_kv_heads = 32 and the cache is four times larger. A lot of the observation that modern models run better locally traces back to this single architectural decision.

llama.cpp exposes --cache-type-k and --cache-type-v flags that quantize the KV cache itself, typically to q8_0 or q4_0, halving or quartering its memory footprint at a minor precision cost. For long-context use cases on constrained hardware, these flags are worth knowing about.

Hardware Tiers in Practice

The formulas above sort hardware into three practical tiers.

Entry tier (8 GB VRAM), covering cards like the RTX 3060 and RTX 4060 as well as 16 GB Apple M-series Macs: this tier handles 7-8B models at Q4_K_M comfortably, with short-to-medium context windows. It covers coding assistants, chat, and summarization tasks without issue.

Mid tier (16-24 GB VRAM), covering the RTX 3090, RTX 4090, and 32-48 GB unified memory M-series Macs: this opens up 13-34B models at reasonable quantization, and enables partial CPU offloading for 70B models. Partial offloading keeps some layers in VRAM and the rest in system RAM; generation is slower, but viable for non-interactive batch tasks.

High tier (48 GB and above), covering professional cards like the RTX 6000 Ada, A100, and H100: 70B parameter models run comfortably at Q4_K_M or above without any offloading.

Apple Silicon deserves a separate note. The unified memory architecture means a 64 GB Mac has 64 GB of memory accessible at GPU bandwidth, rather than the separation between VRAM and system RAM on discrete GPU systems. A 70B model at Q4_K_M runs to roughly 40 GB; no consumer discrete GPU comes close to accommodating that in a single card. For large models without a server-grade GPU budget, high-RAM Apple Silicon is currently the most cost-effective path to running them locally.

Tools That Make This Practical

llama.cpp is the foundation of most local inference setups. Written in C++, it runs on both CPU and GPU, ships its own OpenAI-compatible HTTP server, and supports GGUF natively. Most of the other tools in this space wrap or depend on it.

Ollama provides Docker-style model management on top of llama.cpp, handling model downloads, versioning, and process lifecycle with a minimal CLI. It is the fastest path from nothing to a running model. LM Studio adds a GUI layer for discovery and running models, which is useful when you are exploring hardware limits interactively rather than scripting inference.

For GPU-native workflows, ExLlamaV2 offers the EXL2 format with strong quality-per-bit characteristics on CUDA hardware. One welcome trend across the ecosystem: official model repositories on Hugging Face increasingly ship pre-quantized GGUF files directly, reducing dependence on third-party quantizers and improving provenance clarity.

The Efficiency Trend

The math here is fixed. What changes is how much capability fits within a given parameter count, and that ratio has been improving faster than VRAM prices have been falling. Phi-3.5 Mini at 3.8 billion parameters competes meaningfully with 13B models from roughly a year ago. DeepSeek’s R1 distillation work puts reasoning-capable models at sizes that fit on entry-tier hardware. The threshold between too small to be useful and genuinely useful keeps moving down the parameter scale.

canirun.ai needs to keep updating its model database as new releases push the capability frontier. The arithmetic it relies on does not change. If you understand the weight memory formula, the quantization table, and the KV cache calculation, you can work out whether any model at any quant level fits your hardware before any compatibility tool has even heard of it.