· 6 min read ·

Session Affinity and KV Cache Locality: What WebSockets Actually Change for Agent Loops

Source: openai

When you run an agent that calls a model dozens of times in a loop, the latency profile looks nothing like a single chat turn. Each round trip compounds: the model reads accumulated context, produces a tool call, your code executes the tool, and then the full updated context goes back to the model. The bottleneck isn’t always generation speed. Often it’s the prefill phase, where the server has to compute attention over every token in the conversation before it can emit a single output token.

OpenAI’s announcement about WebSocket support in the Responses API addresses this directly, using the Codex agent as the motivating case. The short version: a persistent WebSocket connection gives the server a reason to keep your session’s KV cache resident in GPU memory between turns, rather than evicting it the moment your HTTP response closes.

Why prefill is expensive for agents

The KV cache stores the key and value matrices computed during multi-head attention for each input token. When a transformer processes a 10,000-token context, it computes these matrices for every token and every layer. If you make a second request with that same context plus 200 new tokens, recomputing the original 10,000 tokens from scratch is wasteful — that work was already done. A warm KV cache means the server only needs to run attention over the 200 new tokens and use the cached matrices for the rest.

For a single-turn chat interface, this is a nice optimization. For an agent loop, it’s the difference between viable and sluggish. Codex’s loop looks roughly like:

for each iteration:
    build context (system prompt + tool schemas + history + current observation)
    call model → get tool invocation
    execute tool (read file, run shell command, write patch)
    append result to history
    repeat

The system prompt and tool schemas are static. The history grows by a few hundred tokens per turn. Without cache reuse, every iteration pays the full prefill cost of the accumulated context, which grows linearly while the compute cost grows worse than linearly due to the quadratic nature of attention.

OpenAI already provides automatic server-side prompt caching that kicks in for repeated prefixes. But that cache is probabilistic across a fleet: load balancers route requests to any available server, and your cached KV state only exists on whichever machine happened to compute it. If you get routed to a different machine, you pay full prefill again.

What a WebSocket connection actually pins

A WebSocket upgrade establishes a persistent TCP connection to a specific server. For the Responses API, this means OpenAI can guarantee session affinity: all turns in your agent loop go to the same inference process. That process can keep your accumulated KV cache in GPU memory for the duration of the connection, with no risk of eviction from a load balancer routing you elsewhere.

The connection-scoped cache goes beyond prefix caching. It doesn’t just cache a static system prompt prefix — it caches the full KV state of the conversation as it grows, turn by turn. Each new turn only needs to prefill the delta.

Establishing the connection looks similar to OpenAI’s Realtime API pattern:

import asyncio
import websockets
import json

async def run_agent_loop():
    uri = "wss://api.openai.com/v1/responses"
    headers = {"Authorization": f"Bearer {api_key}"}
    
    async with websockets.connect(uri, extra_headers=headers) as ws:
        # initial context sent once
        await ws.send(json.dumps({
            "model": "gpt-4o",
            "system": SYSTEM_PROMPT,  # large, static, cached for the session
            "tools": TOOL_SCHEMAS,
            "input": initial_observation
        }))
        
        async for message in ws:
            event = json.loads(message)
            if event["type"] == "response.done":
                tool_call = extract_tool_call(event)
                result = execute_tool(tool_call)
                
                # only the delta goes over the wire
                await ws.send(json.dumps({
                    "type": "conversation.item.create",
                    "item": {"role": "tool", "content": result}
                }))

The critical shift here: the large static portions of context (system prompt, tool schemas) travel once over the wire and get computed once. Subsequent turns send only the new observations and receive only the new output. Compare this to the HTTP Responses API where every request carries the full conversation history in the request body, and every response starts with a cold prefill.

The connection overhead trade-off

HTTP/1.1 over TLS carries overhead: DNS resolution, TCP handshake, TLS negotiation. For GPT-4o-class models, this overhead is small relative to generation time on long outputs. But for an agent doing rapid tool calls — short outputs, immediate round-trips — it adds up. A 150ms connection setup cost on every turn, across 40 iterations, is six seconds of pure overhead that has nothing to do with model intelligence.

WebSocket amortizes that overhead across the entire session. The handshake happens once. Every subsequent turn pays only the network round-trip and the actual inference time.

There’s also a bandwidth dimension. Sending a 15,000-token conversation in every HTTP request means re-serializing and transmitting tokens that the server already has. With connection-scoped state, only the incremental content moves.

Comparing with Anthropic’s approach

Anthropic’s prompt caching uses explicit cache_control breakpoints. You mark specific blocks in your request as cacheable, and Claude caches the KV state up to that breakpoint for five minutes. The cache key is deterministic: the same prefix bytes always hit the same cached state.

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": large_static_context,
                "cache_control": {"type": "ephemeral"}  # cache this prefix
            },
            {
                "type": "text",
                "text": current_observation  # not cached, changes each turn
            }
        ]
    }
]

This approach is stateless from the server’s perspective — you can still be routed to any server, but Anthropic’s infrastructure ensures the cached prefix is available fleet-wide. The trade-off is that it requires you to structure your requests explicitly around the cache boundary, and the five-minute TTL means long-running agents need to keep making requests before it expires.

OpenAI’s connection-scoped approach is more aggressive: the server owns the full conversational state, not just a cached prefix. This is simpler to use (no cache_control annotations), but it introduces server affinity. If your WebSocket connection drops, you lose the cached state and pay full prefill on reconnect. For an agent running for hours with a deep history, that reconnect cost can be substantial.

What this means for agent architecture

The practical implication is that connection lifetime becomes a first-class concern for agent design. With HTTP, your agent is stateless: any request can be retried, rerouted, or resumed from a checkpoint without penalty beyond the current request’s prefill cost. With WebSocket connection-scoped caching, your session is stateful on the server. Reconnections are expensive, and you need explicit handling for connection failures mid-loop.

For Codex-style workloads — focused, bounded coding tasks that complete in minutes — this trade-off is straightforward. The session is short enough that reconnection risk is low, and the latency wins compound meaningfully across dozens of tight tool-call iterations. For longer autonomous agents that might run for hours, the right approach might be a hybrid: use WebSocket connections for bursts of rapid tool calls, and fall back to HTTP with explicit prompt caching for slower phases where the connection overhead is less significant.

The Responses API’s WebSocket support also aligns with a broader shift in how OpenAI is positioning the API for agentic use. The Responses API itself is structured around multi-turn sessions with built-in tool invocation, rather than the stateless completion model of the original Chat Completions API. WebSocket support is the natural infrastructure complement to that design: if the API is intended for persistent sessions with evolving context, persistent connections with session-local caching are the right transport.

The underlying insight is that agentic workloads aren’t just many independent chat turns. They’re computation pipelines where context accumulates, tools produce output that feeds back into context, and the model’s job is to orchestrate that pipeline over many iterations. Infrastructure that treats each iteration as an independent event imposes overhead that doesn’t exist in the workload’s logic. WebSocket connection-scoped caching closes that gap.

Was this interesting?