Google's TPU Split and the Memory Bandwidth Wall That Made It Inevitable
Source: hackernews
Google’s eighth-generation TPU announcement is not primarily a story about raw FLOPS. It is a story about two fundamentally different problems that have been sharing silicon for too long.
The announcement introduces a pair of chips, tentatively referred to by their URL slugs as TPU-8T and TPU-8P, one optimized for training and one for the inference workloads that define the agentic era. The framing around “agentic” is deliberate marketing, but the hardware reasoning underneath it is sound.
How We Got Here
Google’s TPU lineage is worth tracing because the split at the eighth generation is the logical endpoint of a trajectory that started over a decade ago. The original TPU v1 (2016) was an inference-only chip, a matrix multiply unit bolted onto the side of a datacenter to accelerate already-trained models. With v2 and v3, Google broadened the scope to include training, and for several generations a single architecture family handled both.
The first visible crack appeared with TPU v5 in 2023, which came in two distinct variants: v5p (“performance”) aimed at large-scale training, and v5e (“efficiency”) tuned for inference and smaller training runs. The v5p delivers around 459 TFLOPS of BF16 compute per chip; the v5e sits at roughly 197 TFLOPS but with a lower cost per token for serving workloads. Trillium (TPU v6e) followed in 2024, delivering approximately 4.7x the peak compute of v5e, again focused on the serving side.
The eighth generation formalizes what v5 started: these are no longer variants of the same chip. They are different chips built against different constraints.
The Memory Bandwidth Wall
To understand why the split is architecturally justified, you have to understand what actually limits inference for large language models. Training is a compute-bound problem: you process large batches, perform forward and backward passes, and the GPU or TPU spends most of its time doing matrix multiplications at high utilization. The bottleneck is FLOPS.
Inference for autoregressive models is a memory-bandwidth-bound problem. For every single token you generate, you need to load the full KV cache for the current sequence back into on-chip memory. With a 70 billion parameter model and a 128k context window, the KV cache for a single request can reach into the tens of gigabytes. When you have thousands of concurrent users, each with their own active context, the aggregate memory bandwidth demand is enormous.
This creates a hardware contradiction. A training-optimized chip wants high compute density, high-bandwidth interconnects between chips for gradient synchronization, and relatively modest memory capacity per chip (since the weights are distributed anyway). An inference-optimized chip wants massive HBM capacity, high memory bandwidth per FLOP, and low-latency access patterns suited to the irregular, scattered reads of KV cache retrieval.
These are not just different points on a trade-off curve. They pull in opposite directions at the silicon level. More transistors devoted to compute units means fewer devoted to memory interfaces. Higher clock speeds for throughput worsen the latency characteristics that matter for interactive inference.
What Agentic Workloads Actually Demand
The “agentic era” framing in Google’s announcement deserves unpacking, because it captures something real about how inference workloads have changed.
A single-turn chatbot query has a short context, a quick response, and then it is done. An agentic loop looks different: an orchestrating model maintains a growing context window across dozens of tool calls, spawns subagents with their own contexts, and runs for minutes or hours rather than seconds. Gemini 2.0’s one-million token context window is not a demo feature; it is a prerequisite for tasks that require holding large codebases, long document histories, or multi-session memory in context.
For hardware, this means:
- KV cache sizes grow quadratically with context length. A sequence twice as long requires four times the KV cache memory for the attention layers. At one million tokens, the KV cache for a single request with a large model can exceed the total HBM capacity of many current chips.
- Prefill and decode have different hardware signatures. Prefill (processing the input prompt) is compute-bound and looks like training. Decode (generating each output token) is memory-bandwidth-bound. Agentic workflows mix both continuously, with long prefill phases as tool outputs are incorporated into context.
- Batching is harder. High-throughput inference relies on batching many requests together to amortize memory loads. Agentic requests have wildly different context lengths and generation patterns, making efficient batching significantly more complex.
A chip tuned purely for training throughput handles none of these well.
The Interconnect Layer
One underappreciated aspect of Google’s TPU designs is the Inter-Chip Interconnect (ICI), the high-speed fabric that connects chips into pods. For training, ICI needs to sustain all-reduce operations across thousands of chips with predictable latency, because gradient synchronization stalls the entire training step. For inference pods serving agentic workloads, the requirements shift toward disaggregated serving architectures.
Disaggregated prefill and decode is a technique gaining traction in production serving systems, including those running on TPUs. The idea is to route prefill requests (which are compute-intensive) to one pool of chips, and decode requests (which are memory-bandwidth-intensive) to another, with KV cache state transferred between them over the interconnect. This architecture lets you independently scale each phase based on actual demand patterns.
For this to work efficiently, the interconnect needs different characteristics than it does for training: lower latency for KV cache transfers, support for more irregular communication patterns, and the ability to handle variable-length payloads without the regular, predictable structure that all-reduce operations have. A chip designed with this serving topology in mind will allocate its ICI budget differently than one designed for gradient synchronization.
Comparison with the NVIDIA Approach
NVIDIA has taken a different path: rather than splitting by chip, they have pushed toward chips that are extremely capable at both workloads and let software orchestration handle the differences. The H100 and B200 are general-purpose accelerators with high compute density and large HBM capacity, relying on frameworks like TensorRT-LLM and vLLM to extract inference efficiency through software.
NVIDIA has also experimented with inference-specific hardware: the H200 is essentially an H100 with its HBM3e upgraded to improve the memory bandwidth that inference workloads are bottlenecked on. But it remains a single chip family rather than a formal split.
Google’s approach of building purpose-built silicon for each workload type has the advantage of tighter optimization, but it introduces operational complexity. Cloud customers need to think about which chip type their workload maps to, and hybrid workloads that mix training and serving on the same hardware become harder to schedule. The upside is that purpose-built silicon can achieve efficiency levels that general-purpose chips structurally cannot.
For Google specifically, the economics point toward the split making sense. Google is simultaneously one of the largest trainers of foundation models and one of the largest serving operators. They run both Gemini training runs and the production API that handles Gemini requests at scale. Having the right chip for each use case compounds over billions of requests.
What This Means for Developers
For most developers building on Google Cloud, the practical implication is straightforward: when you reach for a TPU to run inference on a large model, you will be reaching for a different chip than when you run a fine-tuning job. The Cloud TPU documentation will eventually reflect per-chip guidance on which workloads belong where.
More interesting is the signal this sends about where Google sees the load shifting. The explicit framing around agentic workloads suggests that Google’s internal capacity planning shows a significant increase in long-context, multi-turn, tool-using inference relative to batch training. The chip roadmap follows the workload distribution.
For anyone building infrastructure on top of Google Cloud or thinking about where the TPU ecosystem is heading, the eighth generation is less about the headline performance numbers and more about the acknowledgment that training and inference have grown into separate hardware problems. That acknowledgment, built into silicon, is worth paying attention to.