· 7 min read ·

Session-Pinned KV Cache: What WebSockets Actually Change for Agent Latency

Source: openai

When OpenAI published their writeup on speeding up agentic workflows with WebSockets, the headline was straightforward: persistent connections reduce overhead in the Codex agent loop. But the mechanism behind the improvement is worth pulling apart, because it touches on something most API consumers never think about: what the server actually throws away between requests, and what it costs to reconstruct.

The Compounding Cost of Agentic Turns

A single-turn chat request is relatively cheap. You send a prompt, the model runs a forward pass over every token, generates a response, done. But agents don’t make single-turn requests. The Codex agent loop, like most coding agents, works by alternating between model output and tool execution: generate a plan, run a shell command, read the output, generate the next step, repeat. Each iteration appends to the conversation history. By turn five or ten, you’re sending thousands of tokens of prior context on every request just to give the model the state it needs.

Without any caching, the model’s attention mechanism recomputes key-value pairs for every token in that prefix on every single call. That’s where most of the latency goes at scale. The transformer architecture requires each new token to attend over every prior token, so a 4,000-token context costs roughly 4,000 times more attention work per layer than a single token. That work is the KV cache.

What the KV Cache Actually Is

In inference, the key-value cache is the materialized result of running the attention layers over your input prefix. For a context of N tokens, each transformer layer produces N key vectors and N value vectors. Computing these is expensive; reusing them across multiple forward passes (as in autoregressive generation) is what makes streaming generation fast at all.

The problem for agentic APIs is that in a traditional HTTP model, each request is stateless. The server receives your full prompt, runs attention over all of it, generates tokens, and then discards the intermediate KV tensors. On your next request, it starts over. Prompt caching, as implemented in OpenAI’s API and Anthropic’s cache_control feature, addresses this by storing KV cache entries keyed to a hash of the token prefix. If your next request starts with the same prefix, the server can retrieve the cached KV tensors and skip recomputation.

This works well for static system prompts. It works less well for the dynamic, ever-growing conversation history of an agent loop, where the prefix changes on every turn by definition. You can still cache the stable early portion (system prompt, initial context), but the tail of the conversation, which is often the longest part by mid-session, gets recomputed every time.

What Connection-Scoped Caching Changes

The WebSocket approach shifts the caching model from hash-based prefix matching to connection identity. When the Codex client opens a WebSocket to the Responses API, the server can associate a KV cache with that specific connection. Each subsequent request over that connection arrives with a direct pointer to the cached state from the previous turn, rather than requiring the server to hash the prompt, look up a cache table, and reconstruct partial state.

This is a meaningful architectural difference. Hash-based caching requires the client to send the full prefix each time so the server can identify the cache entry. Connection-scoped caching allows the server to store a delta: only the new tokens appended since the last turn need full attention computation. The prefix KV tensors stay pinned in memory, associated with the open connection, and the server extends them incrementally.

The result is that time-to-first-token (TTFT) in an agent loop drops significantly. Instead of paying O(N) attention cost over the full conversation history on every turn, you pay O(delta) for just the new input tokens, plus retrieval cost for the existing cache. OpenAI’s post describes this in the context of Codex, where multi-step coding tasks involve long, stable contexts that accumulate tool outputs over time.

The HTTP Overhead Is Also Real

Beyond the KV cache story, persistent connections eliminate per-request TCP and TLS overhead. A fresh HTTPS connection requires a TCP three-way handshake plus a TLS handshake, typically adding 100-300ms of round-trip latency before the first byte of payload even arrives, depending on geography. For single large requests, this is a small fraction of total latency. For an agent that makes twenty short tool-augmented turns in sequence, it adds up.

WebSockets pay that connection establishment cost once, at session open. All subsequent messages use the already-established connection. For Codex running inside a cloud environment close to OpenAI’s infrastructure, this is a modest gain. For users running agents from regional endpoints or through NAT layers, the difference is more pronounced.

The WebSocket protocol also has lower framing overhead than HTTP/1.1. Each HTTP request carries a full set of headers; a WebSocket frame adds as little as 2 bytes of overhead for small messages. For the tool result payloads that flow back through the Codex loop (file contents, shell output, test results), this is negligible, but it contributes to a cleaner latency profile at high message rates.

How This Compares to Anthropic’s Approach

Anthropic’s prompt caching uses explicit cache_control breakpoints in the message structure. You mark specific message boundaries as cacheable, and Anthropic’s servers store KV tensors at those boundaries. The cache has a five-minute TTL for standard caching and up to an hour for extended caching on certain tiers.

This puts control in the developer’s hands. You decide what to cache, which means you can cache a long static system prompt, a retrieved document corpus, or a set of tool definitions without caching the dynamic conversation tail. The tradeoff is that you have to think about it explicitly and structure your prompts accordingly.

OpenAI’s connection-scoped caching is more automatic. The connection is the unit of caching, and the server manages what to retain. This is simpler to use (you open a WebSocket and it just works) but gives you less control over cache boundaries. If your agent loop produces diverse tool outputs that invalidate useful prefix caches, you may not be able to fine-tune what gets retained.

Neither approach is strictly better. For a coding agent with long, stable context growth like Codex, connection-scoped caching fits naturally because each turn genuinely extends the prior state. For a retrieval-augmented agent that loads a large document corpus on each request but varies the question, explicit cache breakpoints are more useful.

The Responses API Context

The Responses API, which OpenAI launched in early 2025, was designed explicitly around agentic use cases. Unlike the Chat Completions API, it has first-class support for tool use, streaming, and now persistent connections. The API can manage tool execution on the server side for built-in tools like web search and code execution, which means the agent loop itself can run closer to the model.

WebSocket support fits this architecture well. If the server is managing both model calls and tool execution within a session, maintaining connection-scoped state is a natural extension. The client opens a connection, sends a task, and receives a stream of events (tool calls, intermediate results, final output) over the same connection. The KV cache spans the entire session because the session is the unit of work.

This is closer to how the Realtime API works for audio, where persistent bidirectional connections are mandatory because audio input and output are inherently streaming. Bringing the same model to text-based agents makes architectural sense.

What This Means for Building Agents

If you’re building an agent that makes sequential API calls with growing context, the practical guidance is straightforward. For OpenAI, using the Responses API over WebSockets rather than polling the REST endpoint will reduce TTFT on later turns of a long session. The benefit scales with session length: short two or three turn interactions won’t see much difference, but agents that run for twenty or more turns with substantial tool output will.

Here’s what the connection setup looks like conceptually for the Responses API WebSocket endpoint:

const ws = new WebSocket('wss://api.openai.com/v1/realtime?model=gpt-4o', {
  headers: {
    'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
    'OpenAI-Beta': 'realtime=v1'
  }
});

ws.on('open', () => {
  // Session is established; KV cache will be maintained per-connection
  ws.send(JSON.stringify({
    type: 'response.create',
    response: {
      modalities: ['text'],
      instructions: systemPrompt
    }
  }));
});

ws.on('message', (data) => {
  const event = JSON.parse(data);
  // Handle streaming events: response.text.delta, tool_calls, etc.
});

The key difference from REST polling is that you keep this connection open across all turns. Each new user message or tool result goes over the same socket, and the server’s cached state from prior turns remains available without re-serialization.

For Anthropic users, the equivalent optimization is using cache_control markers on your system prompt and any stable retrieved context. The extended cache (up to one hour on supported tiers) is worth enabling for long-running coding sessions where the same codebase context is re-sent on each turn.

The Broader Direction

What this architectural move signals is that the distinction between stateless inference APIs and stateful agent runtimes is shrinking. Traditional LLM APIs were designed around the request-response model of the web: send everything, get a response, repeat. That model is friction for agents, which are fundamentally stateful processes.

Connection-scoped KV caching is one step toward treating an agent session as a first-class object on the server side. If the server maintains not just the cache but also the tool state, execution context, and session history, the API starts looking more like a process boundary than a stateless endpoint. OpenAI’s Codex implementation, where the server orchestrates multi-step coding tasks with WebSocket streaming, is a concrete instance of that shift.

The latency numbers matter for user experience, but the architectural shift matters more for what kinds of agents become practical to build. Agents that need low turn-to-turn latency to feel responsive, like interactive pair programming tools, were genuinely hard to build on polling REST APIs. WebSocket sessions with pinned KV state make them viable.

Was this interesting?