· 6 min read ·

The Hardware Math Behind Running AI on Your Own Machine

Source: hackernews

When a site like canirun.ai lands on the Hacker News front page with 394 upvotes, it signals something about where local AI inference stands. The question “can I actually run this on my hardware?” now has a precise, calculable answer, and the audience for that answer has grown large enough to justify a single-purpose compatibility checker.

The premise is simple: enter your specs, get a verdict on which models you can run. But the more useful question is what the tool is actually measuring, and why memory constrains local inference far more than compute does.

The Weight Memory Calculation

A language model at inference time is a large collection of floating-point weights that must be loaded into memory for every forward pass. The baseline calculation is straightforward. A model with N billion parameters stored in 16-bit floating point (FP16 or BF16) requires approximately N × 2 gigabytes of VRAM:

  • 7B model in FP16: ~14 GB
  • 13B model in FP16: ~26 GB
  • 70B model in FP16: ~140 GB

A single RTX 4090 has 24 GB of VRAM. Running Llama 3.3 70B at full precision would require roughly six of them. Full-precision inference is a non-starter for consumer hardware at this scale, which is where quantization becomes the enabling technology.

Quantization: The Numbers That Make Local Inference Viable

Quantization reduces the number of bits used to represent each weight. The dominant format in the local AI ecosystem is GGUF, developed and maintained by the llama.cpp project, which defines a tiered set of quantization levels:

FormatBits/weight7B size70B size
FP1616~14 GB~140 GB
Q8_08~7 GB~70 GB
Q5_K_M5~5 GB~48 GB
Q4_K_M4~4.1 GB~40 GB
Q3_K_M3~3.3 GB~32 GB
Q2_K2~2.8 GB~26 GB

The _K_M suffix denotes k-quants, which apply different quantization levels to different weight matrices rather than treating all weights uniformly. Attention projection weights get more bits; feed-forward weights, which are more numerous but less sensitive, get fewer. The result is that Q4_K_M loses meaningfully less quality than a naive 4-bit quantization would suggest.

In practice, Q4_K_M and Q5_K_M sit at the best quality-to-size tradeoff for most use cases. A Q4_K_M 7B model occupies about 4.1 GB of VRAM, which fits on hardware as modest as an older RTX 3060 12GB.

The KV Cache: The Variable Most Checkers Miss

Model weights are only part of the memory budget. The KV (key-value) cache scales with context length and can exceed the model weights themselves at long contexts.

During autoregressive decoding, a transformer retains the key and value tensors for every token in the current context window. The memory required is approximately:

kv_cache_bytes = 2 × num_layers × num_kv_heads × head_dim × seq_len × element_size

For Llama 3.1 8B with its 128K context window filled entirely, this approaches 16 GB at FP16, more than the quantized weights themselves. A model that fits comfortably at a 4K context will OOM at 32K.

Ollama and llama.cpp default to 16-bit KV cache, but llama.cpp exposes --cache-type-k and --cache-type-v flags to drop the cache to 8-bit or even 4-bit, which trades a small amount of quality for a significant reduction in memory overhead at long contexts. This is the knob to reach for when you load a model successfully but hit memory limits as soon as you paste in a long document.

Tools and Their Tradeoffs

Four tools cover the bulk of local inference use cases:

llama.cpp is the foundation of the ecosystem. It is a C/C++ inference implementation with backends for CUDA, Metal, ROCm, Vulkan, and plain CPU. The key parameter for hybrid inference is -ngl (number of GPU layers): you can offload any number of transformer blocks to GPU while running the remainder on CPU RAM.

# Offload 28 of 32 layers to GPU, run remaining 4 on CPU
./llama-cli -m ./llama-3.2-7b-q4_k_m.gguf -ngl 28 -c 4096 -p "Explain attention mechanisms"

This matters for machines with 8-10 GB of VRAM trying to run a model that needs 12 GB in total: partial offload keeps the fast GPU layers on VRAM while accepting slower CPU throughput for the rest.

Ollama wraps llama.cpp in a daemon with a REST API and a model registry. Running ollama run llama3.3 fetches an appropriately quantized variant for your hardware automatically. It is the lowest-friction path to local inference. The tradeoff is reduced control over memory and context parameters that matter for edge cases.

LM Studio provides a GUI frontend, targeting users who prefer not to work in the terminal. It added MLX inference support for Apple Silicon in 2024, which improved Mac performance considerably over the Metal backend in llama.cpp alone.

vLLM is oriented toward serving with concurrent users rather than single-user local inference. Its PagedAttention mechanism manages KV cache memory in pages rather than contiguous blocks, enabling continuous batching across concurrent requests. For a researcher running a local API server shared among several team members, vLLM’s throughput advantages over llama.cpp are substantial.

Hardware Tiers and What They Actually Run

8-12 GB VRAM (RTX 3060 12GB, RTX 4060 Ti, RTX 3080 10GB): 7B-8B models in Q4 or Q5 run at 30-60 tokens/second, faster than most people read. Llama 3.2 8B, Mistral 7B v0.3, Gemma 3 9B, and Phi-4 Mini all live comfortably here.

16-20 GB VRAM (RTX 3080 20GB, RTX 4070 Ti Super): 13B models fully on GPU in Q4-Q5. Some 20B-class models fit at aggressive quantization. Qwen 2.5 14B is a good target for this tier.

24 GB VRAM (RTX 3090, RTX 4090): The current enthusiast sweet spot. Runs 34B models quantized without compromise, handles 70B models at Q2-Q3 with quality degradation, or splits 70B Q4 across GPU and CPU for slower but higher-quality inference. The 4090’s 1.0 TB/s memory bandwidth also makes it faster per token than the 3090 at comparable VRAM capacity.

40-80 GB VRAM (A100, H100, dual 3090 via NVLink): Full Q4 inference on 70B models with headroom for long contexts. DeepSeek R1 70B and Llama 3.3 70B at Q4_K_M each require around 40 GB, fitting exactly in a single A100 80GB.

Apple Silicon as a Special Case

Apple’s unified memory architecture changes the VRAM capacity constraint entirely. On conventional desktops, GPU VRAM is physically separate from system RAM; a machine with 128 GB of DDR5 and an RTX 4090 can still only load models up to 24 GB into fast GPU memory.

On M-series chips, the CPU and GPU share the same memory pool. An M2 Ultra with 192 GB of unified memory holds a full 70B Q4 model with room for a long context window. Memory bandwidth runs around 400-800 GB/s depending on the chip variant, which is lower than a 4090’s 1.0 TB/s but the capacity advantage is decisive for large models.

The MLX framework from Apple ML Research is designed specifically for this architecture and often outperforms llama.cpp on Metal for quantized inference on Apple Silicon. An M3 Max 128 GB typically runs Llama 3.3 70B at Q4 around 12-18 tokens/second, which is interactive enough for most purposes.

CPU-Only Inference

Running entirely on CPU with llama.cpp -ngl 0 is viable but significantly slower: roughly 2-8 tokens/second for a Q4 7B model on a modern 8-core CPU, and under 1 token/second for a 70B model. The memory requirement shifts entirely to system RAM, which is often abundant on modern workstations even when VRAM is limited.

For batch generation tasks where latency does not matter, CPU inference is entirely practical. A server with 128 GB of RAM and no GPU can run a 70B model at Q4_K_M (about 40 GB) with no special hardware and process long-form content generation overnight.

What Hardware Checkers Are Actually Measuring

A tool like canirun.ai automates the arithmetic above: look up VRAM for the GPU model, compare against model sizes at common quantization levels, return a pass or fail. The more thorough implementations account for KV cache overhead at the context length you specify and flag partial GPU offload as an option when a model does not fit entirely in VRAM.

The fact that this tooling exists and attracts attention reflects a genuine shift in who is running models locally. The combination of accessible tooling like Ollama, a growing ecosystem of smaller high-quality models, and the steady improvement in per-parameter quality from better training methods has made local inference a mainstream option rather than a researcher-only workflow. The hardware math has not changed; what has changed is that smaller models now deliver results that were only available from large cloud-served models two years ago, and that compression of capability into consumer VRAM budgets is what makes the compatibility question worth asking.

Was this interesting?