Google's TPU Split: What Two Chips for the Agentic Era Actually Means
Source: hackernews
Google’s eighth-generation TPU announcement has an architectural detail embedded right in the product URL: two chips, TPU 8T and TPU 8P. That split is not a product marketing decision. It is an admission, made in silicon, that training workloads and agentic inference workloads have diverged far enough that a single chip cannot serve both efficiently.
How We Got Here
The TPU lineage is worth tracing because each generation reflects Google’s priorities at the time. The original TPU v1 in 2016 was inference-only: 8-bit integer matrix multiplications on 28nm silicon at roughly 92 TOPS, built specifically to cut the cost of serving Google’s ranking models. It used a systolic array architecture, a grid of processing elements that pass data through without touching main memory on each step, which proved extremely well suited to the dense matmuls at the core of neural network computation.
TPU v2 added training capability with HBM memory and bfloat16 floating point. That format, 8 exponent bits and 7 mantissa bits, was a deliberate bet that dynamic range mattered more for training stability than precision. The bet paid off: bfloat16 is now standard across NVIDIA, AMD, and Intel hardware.
By TPU v4 in 2021, Google was building 4,096-chip pods linked by optical ICI interconnects that enabled all-to-all collective operations across thousands of devices, which transformer training needs for efficient tensor parallelism. Then TPU v5 split into two variants: v5e for cost-optimized training and serving, v5p for large pretraining runs. Trillium (v6e) followed in 2024 with roughly 4.7x the ML compute per chip compared to v5e, doubled HBM capacity, and a 67% improvement in energy efficiency.
The split into labeled T and P variants at generation 8 is the formalization of a trend that has been building since v5.
Training and Inference Are Not the Same Problem
Training is computationally regular. You process fixed-size batches, run forward and backward passes, update weights, repeat. The bottleneck is raw FLOP throughput and inter-chip bandwidth for gradient communication. A well-tuned training run can sustain over 50% of theoretical peak utilization, and XLA’s compiler does solid work keeping systolic arrays fed through operation fusion and tiling.
Agentic inference is structurally different. An agent receives a prompt, generates a response, calls a tool, waits for results, continues generating, potentially spawns sub-agents. Each step is a separate forward pass. Independent agents are at different positions in their reasoning chains at any given moment, which limits effective batching. The metrics that matter shift from aggregate throughput to per-request latency: time to first token and inter-token latency for a live user or downstream agent.
Long context makes the memory situation acute. A model maintaining a million-token context window carries a KV cache, one key-value pair per attention head per layer per token, that can reach tens of gigabytes per active sequence. During autoregressive decoding, each new token requires reading that entire cache. The binding constraint becomes memory bandwidth, not FLOP throughput. A chip optimized for training with high peak FLOPS but moderate HBM bandwidth will sit largely idle during decoding, waiting on memory reads.
The PagedAttention technique from vLLM addresses the software side of KV cache management by treating cache memory like virtual memory pages, avoiding fragmentation and enabling more efficient multiplexing across requests. Google’s own JetStream inference server takes a similar approach tuned for TPU memory hierarchies. But software optimizations only go so far when the hardware itself is optimized for a different workload shape. The 8P chip is presumably designed with a higher memory bandwidth ratio relative to peak FLOPS, more on-chip SRAM for hot cache pages, and whatever interconnect topology best serves semi-independent serving instances rather than tightly coupled training pods.
The Broader Industry Pattern
Google is not alone in making this split explicit. NVIDIA’s data center lineup has bifurcated similarly: H100, H200, and B200 at the training end with NVLink for multi-GPU scaling; L40S and inference-optimized Blackwell configurations at the serving end. The B200’s 192GB of HBM3e would have seemed extravagant for a training chip a few years ago, but it makes sense once the target workload is long-context agent serving with large KV caches.
What Google can do with co-designed chips and compiler stack that NVIDIA cannot is expose the full hardware description to XLA. The compiler can emit instructions that exploit specific memory access patterns, custom collective primitives, or on-chip routing that would be invisible to a CUDA kernel targeting a general GPU. MaxText on the training side and JetStream on the inference side are built with TPU characteristics as first-class constraints rather than afterthoughts.
The ICI interconnect topology also differs from NVLink. ICI is designed as a torus across hundreds or thousands of chips within a pod, which serves pipeline and tensor parallelism during training. For a serving chip like the 8P, the relevant topology is different: many relatively independent nodes serving separate requests, with less need for tight synchronization across thousands of devices. A torus optimized for training is wasteful overhead for serving; the 8P presumably uses a topology that reflects that.
The Bet Embedded in the Announcement
The “agentic era” framing in the announcement is a product signal as much as a technical one. Google is stating that the dominant inference pattern has shifted from short-context, single-turn completions toward multi-step, long-context, tool-using agents. That changes the hardware optimization target substantially: less about peak token throughput on fixed-length prompts, more about sustaining decent throughput across variable-length conversations with large, frequently updated KV caches.
From an infrastructure cost perspective, this matters because inference costs dwarf training costs over a model’s lifetime. Pretraining a frontier model costs hundreds of millions of dollars; running it for millions of users over years costs considerably more. If inference is dominated by memory bandwidth rather than FLOP throughput, buying more FLOPS per dollar is the wrong optimization. Specialized silicon lets Google price both workloads more efficiently, which directly affects the margins on Gemini API access and Cloud TPU rentals.
For teams building on Google Cloud, the two-chip split is largely transparent. You declare your intent, training or serving, when provisioning TPU VMs, and the underlying hardware changes accordingly. The Cloud TPU documentation reflects these distinctions in the configuration options.
The more open question is whether the 8P serving chip will be competitive with NVIDIA’s inference-optimized Blackwell parts on the latency and memory bandwidth numbers that determine infrastructure vendor selection. Google has the advantage of vertical integration from chip through compiler through framework through API; NVIDIA has a much larger ecosystem of tooling, profiling, and optimization experience. The eighth generation is Google’s clearest statement yet that they intend to compete on both sides of that tradeoff simultaneously.