· 6 min read ·

The Arithmetic Intensity Threshold That Makes LLM Batch Size Critical

Source: huggingface

The HuggingFace walkthrough on continuous batching from first principles, originally published in November 2025, builds the scheduling mechanics carefully. What it compresses is the hardware physics underneath: the performance difference between low and high batch sizes in LLM serving is not a matter of modest percentage gains. There is a threshold, determined by GPU hardware, and whether you clear it depends entirely on your scheduler’s ability to sustain a specific concurrent sequence count.

The Roofline Model, Briefly

GPUs operate under two independent performance limits: floating-point throughput and memory bandwidth. Which one constrains a workload depends on its arithmetic intensity, measured in FLOPS per byte of data accessed from memory.

For an A100 SXM 80GB, the relevant numbers are:

  • Peak FP16 throughput: 312 TFLOPS
  • HBM bandwidth: approximately 2 TB/s
  • Ridge point: 312T / 2T = 156 FLOPS per byte

A workload performing more than 156 FLOPS for every byte it reads is compute-bound; the GPU’s arithmetic units are the bottleneck. Below 156, the GPU waits on memory transfers and the arithmetic units sit largely idle. Large matrix multiplications in training are compute-bound at scale. Autoregressive decode is the opposite.

What Happens During a Single Decode Step

Autoregressive decode generates one token per forward pass. The dominant cost is running each of the model’s weight matrices against the current hidden state vector. Unlike training, where the hidden state is a matrix (batch of sequences), at batch size one the hidden state is a single vector. The computation is a series of matrix-vector multiplications rather than matrix-matrix multiplications.

For a model with P parameters in FP16:

  • Memory accessed per decode step: approximately 2P bytes (reading all weight matrices through HBM)
  • FLOPS per decode step: approximately 2P (one multiply-add per parameter)
  • Arithmetic intensity: approximately 1 FLOP per byte

For LLaMA-2 13B, the numbers are concrete:

  • Weight data read per step: 26 GB
  • Time to read at A100 bandwidth: 26 GB / 2 TB/s = 13 milliseconds
  • Compute time for 2 × 13B FLOPS at 312 TFLOPS: 83 microseconds

The compute completes in 0.08 ms. The memory transfer takes 13 ms. The arithmetic units are active for less than 1% of the step. The GPU is functioning as an extremely expensive memory bus.

How Batch Size Changes the Physics

When B sequences decode simultaneously, the weight matrices are read once and applied to B token vectors in a batched matrix multiplication. Memory accesses remain roughly constant (the weights are the dominant term); FLOPS scale with B.

Arithmetic intensity at batch size B: approximately B FLOPS per byte.

The ridge point gives the compute saturation threshold directly: B must reach 156 for the A100 to become compute-bound. Below that threshold, throughput scales approximately linearly with B because more useful work is extracted from the same memory bandwidth. Above it, throughput plateaus at the hardware compute ceiling.

For LLaMA-2 13B on one A100:

B = 1:    ~76 tokens/sec   (99% of time spent on memory transfers)
B = 10:   ~760 tokens/sec
B = 50:   ~3,800 tokens/sec
B = 100:  ~7,600 tokens/sec
B = 156:  ~11,800 tokens/sec  (ridge point, compute saturates)
B = 300:  ~12,100 tokens/sec  (diminishing returns)

Going from batch size 1 to batch size 156 multiplies throughput by roughly 156x. That is not an incremental optimization. Below the ridge point, every inference system you deploy is burning GPU budget generating a fraction of the tokens it could.

Why Static Batching Missed the Threshold

The goal is clear: sustain batch sizes near 156. Static batching failed in two specific ways.

First, head-of-line blocking reduced effective batch size over time. When the longest sequence in a static batch needed 1,500 tokens and most others finished in 50, the batch shrank to a handful of active sequences for most of its lifetime. Average batch size across the full run was far below the nominal size at admission.

Second, KV cache pre-allocation consumed GPU memory at maximum sequence length. The KV cache stores key and value projections for all prior tokens so decode steps can run attention without recomputing the full sequence from scratch. For LLaMA-2 13B:

KV cache per token = 2 (K and V) × 40 layers × 40 heads × 128 head_dim × 2 bytes
                   = 819,200 bytes ≈ 800 KB per token

Per sequence at maximum 2048 output tokens: ~1.6 GB

An A100 80GB with LLaMA-2 13B weights loaded in FP16 uses 26 GB for weights, leaving roughly 54 GB for KV cache. At maximum pre-allocation:

Maximum concurrent sequences = 54 GB / 1.6 GB ≈ 33

33 is 21% of the 156 needed to saturate the hardware. Systems were architecturally locked into the memory-bandwidth-bound regime. Actual average outputs, typically 200-400 tokens rather than 2048, made this worse: requests held 1.6 GB for 300 tokens of actual generation. The vLLM paper measured 60-80% KV cache waste in pre-PagedAttention systems.

Continuous scheduling alone, without solving the memory problem, would cycle through undersized batches faster without approaching the hardware’s potential.

PagedAttention as the Threshold Enabler

PagedAttention divides KV cache into fixed-size blocks of 16 tokens each. Each sequence’s KV cache is a list of non-contiguous block pointers rather than a contiguous pre-allocated buffer. Blocks are allocated one at a time as tokens are generated, never reserved speculatively for outputs that may not arrive. A custom CUDA kernel follows block table indirection to gather K and V vectors from scattered physical addresses during attention computation.

With actual average outputs of 300 tokens:

KV per sequence (actual): 800 KB/token × 300 tokens ≈ 240 MB

Achievable concurrent sequences: 54 GB / 240 MB ≈ 225

225 exceeds the 156 threshold. Combined with iteration-level scheduling that fills vacated slots after every forward pass, the system can operate in the compute-bound regime under sufficient load. The vLLM paper reported 2x to 24x throughput improvement over static-batching baselines depending on workload characteristics and sequence length distribution. The larger gains correspond to workloads where the static baseline was furthest below the ridge point.

Grouped-Query Attention Shifts the Calculation

Grouped-query attention (GQA), used in LLaMA-2 70B, Mistral 7B, and most models released after 2023, reduces KV cache size by sharing a small number of K and V heads across multiple query heads. LLaMA-2 70B uses 8 KV heads with 64 query heads per group:

KV per token (LLaMA-2 70B, GQA) = 2 × 80 layers × 8 heads × 128 dim × 2 bytes
                                 ≈ 320 KB per token

Compared to a hypothetical 70B model with full multi-head attention and 64 KV heads, the GQA variant uses 8x less KV cache memory. For a fixed GPU memory budget, this directly multiplies the achievable concurrent sequence count. Multi-query attention (MQA), with a single K and V head shared across all queries, takes this further and was used in some Falcon variants and CodeLlama.

GQA was motivated partly by inference memory constraints: reducing KV footprint is one of the few architectural decisions that improves the arithmetic intensity achievable at a given hardware configuration without requiring more GPUs.

The Economic Implication

The arithmetic intensity calculation explains something beyond raw performance benchmarks: the cost per token in a serving system.

At batch size 1, a single A100 generates roughly 76 tokens per second with LLaMA-2 13B. At batch size 156+, the same hardware generates roughly 11,800 tokens per second, a 155x difference for the same hardware cost per hour. Per-token costs at low batch sizes are structurally high because the GPU is idle for 99% of each decode step.

API pricing for LLM inference has dropped roughly 90-95% since mid-2023. That price collapse tracks the broad deployment of continuous batching and paged KV cache management. The economics shifted when systems could reliably operate near the arithmetic intensity threshold rather than at a small fraction of it.

For anyone planning capacity, this has a practical consequence: adding more hardware does not help if scheduling constraints prevent reaching the ridge point. An underloaded system with continuous batching enabled will be just as memory-bandwidth-bound as a static-batching system, because arithmetic intensity scales with concurrent sequences, not with peak hardware. The HuggingFace walkthrough covers how the scheduling mechanics produce high utilization; the hardware math above is why getting there matters as much as it does.

Was this interesting?