Session-Pinned KV Cache: What WebSockets Actually Change for Agent Loops
Source: openai
The agent loop problem has always been a throughput problem dressed up as a latency problem. When you’re running something like Codex, each turn in the loop is a full round trip: build the context window, send it over HTTP, wait for the model to reconstruct its key-value cache from scratch, generate a response, parse tool calls, run the tools, and start again. For a single-turn chat interface that rhythm is fine. For an agent that might take 15 to 30 tool-calling steps to complete a task, it adds up fast.
OpenAI’s recent writeup on WebSockets in the Responses API makes this concrete using their Codex agent as the case study. The core insight is not that WebSockets are faster than HTTP in some abstract sense, but that a persistent connection gives the server a stable identity to hang state off. Specifically, the KV cache.
What the KV Cache Is and Why It Gets Wasted
Transformer inference works by computing attention over sequences of tokens. For each token in the input, the model computes key and value vectors across every attention head in every layer. When you generate the next token, you need those same vectors again, so instead of recomputing them, the runtime stores them in memory: the KV cache.
For a single forward pass this saves a lot of compute. The problem is that with stateless HTTP, every new request starts cold. Even if you send the same 8,000-token system prompt and accumulated conversation history on turn 12 of your agent loop, the model infrastructure has no guarantee that the previous computation is still sitting warm in accelerator memory. Some providers implement prompt caching, which stores cached prefixes keyed by a hash of the token sequence, but this is a best-effort optimization and it requires your prefix to match exactly. Any variation, including inserting tool results into the middle of the context, breaks the prefix match and you’re back to a cold start.
Connection-scoped caching is different. When the server ties KV cache entries to a live WebSocket session rather than to a content hash, it can maintain and incrementally extend that cache across every turn of the conversation. Turn 12 doesn’t have to reprocess turns 1 through 11; it picks up from where turn 11 left off and processes only the new tokens. This is closer to how you’d implement stateful inference yourself if you had direct access to the GPU memory.
The HTTP Tax on Long Agentic Chains
To see why this matters in practice, consider what a Codex-style coding agent actually does. A representative task might look like:
- Receive a high-level instruction
- Read relevant files (tool call)
- Reason about the structure, possibly read more files
- Write proposed changes (tool call)
- Run tests (tool call)
- Inspect test output, iterate
Each numbered step is a separate model call. Each model call, under HTTP, requires transmitting the full conversation context accumulated so far. By step 6 you might be sending 20,000 tokens of context just to get a response that adds 500 new tokens. With connection-scoped caching over WebSockets, steps 2 through 6 only need to transmit the incremental additions: the new tool results and the model’s previous reply. The cached context is already on the server.
The latency improvement compounds. Prompt processing for large contexts is not instantaneous even with hardware-accelerated prefill. Eliminating that prefill for cached prefixes reduces time-to-first-token, which is the metric that most directly determines how fast an agent loop feels in practice.
How the Responses API WebSocket Protocol Works
The Responses API introduced in early 2025 replaced the older Chat Completions API as OpenAI’s primary interface for structured, multi-turn generation. Unlike the Realtime API, which targets audio and live speech scenarios, the Responses API over WebSockets targets text-based agentic use where you need low-latency turns and stateful context.
A WebSocket connection is established once at the start of a session. The client sends messages using the same JSON structure as REST requests, but the connection remains open between turns. The server can stream response tokens back over the same socket, and critically, it keeps the session’s KV cache resident as long as the connection is alive.
A minimal Python client establishing this kind of connection looks roughly like:
import asyncio
import websockets
import json
async def run_agent_loop():
url = "wss://api.openai.com/v1/responses"
headers = {"Authorization": f"Bearer {API_KEY}"}
async with websockets.connect(url, extra_headers=headers) as ws:
# First turn — cold start, full context
await ws.send(json.dumps({
"model": "codex-1",
"input": [{"role": "user", "content": system_prompt + initial_task}]
}))
async for message in ws:
event = json.loads(message)
if event["type"] == "response.completed":
tool_results = execute_tools(event["output"])
# Subsequent turns — server reuses KV cache for prior context
await ws.send(json.dumps({
"input": tool_results
}))
The key difference from the REST equivalent is that subsequent send calls on the same connection carry only the new input. The server infers the full conversation state from the session, not from a context blob the client ships each time.
Connection Scoping vs. Prompt Caching
OpenAI has offered prompt caching for a while. It works by detecting shared token prefixes across requests and reusing cached KV entries when the prefix matches. This is useful but limited. It works well for fixed system prompts that never change. It works poorly when tool call outputs are interleaved throughout the conversation, because those change every turn and break the prefix invariant.
Connection-scoped caching sidesteps the prefix constraint entirely. The server treats a given WebSocket connection as a single evolving context and extends the cache incrementally. There is no prefix matching requirement because the server already knows the full prior state of the session. You get the caching benefit even when the conversation structure does not resemble a static prefix.
This distinction matters most for agents that use dense tool calling with varied outputs. A coding agent reading different files on every turn will produce a conversation history that looks like noise from a prefix-matching perspective, but is still entirely cacheable on a connection-scoped basis.
Codex as the Test Bed
OpenAI used Codex as the primary benchmark for this work for good reason. Codex is not an assistant, it is an autonomous agent that takes software tasks and completes them, running shell commands and reading and writing files. Its loop can run for minutes on a single task, accumulating a long conversation with many tool-call turns. That is exactly the workload where stateless HTTP creates the most overhead.
The improvements OpenAI reports are specifically in API overhead and model latency, which is the distinction between time spent processing the context and time spent generating the response. Eliminating repeated context prefill cuts API overhead; the model latency number also improves because a warm KV cache allows the inference stack to start generating sooner. For Codex specifically, where a task might involve 20 or more model calls, the cumulative effect is substantial.
Architectural Implications for Agent Frameworks
Most existing agent frameworks, including LangChain, LlamaIndex, and various homegrown orchestrators, are built around stateless API calls. They manage conversation history as a list that grows each turn and serialize the full list into each request. Adopting WebSocket-based sessions requires rethinking this model.
The framework needs to maintain a live connection rather than constructing a request payload. It needs to track whether a connection is still alive and handle reconnections gracefully, because a dropped connection loses the cached session and the next request starts cold again. Error handling gets more complex; a timeout mid-generation needs to be handled differently when the client is holding a persistent socket.
There is also a resource implication on the server side. Connection-scoped caching means the server must keep KV cache entries alive for the duration of the connection. For large context windows across many concurrent sessions, this is significant accelerator memory. OpenAI is presumably managing this with connection timeouts and eviction policies, but the tradeoff is real: lower latency per turn at the cost of higher per-connection memory overhead.
The Broader Pattern
What OpenAI describes here is a specific instance of a pattern that shows up whenever you try to apply stateless protocols to stateful workloads. Stateless HTTP worked well for chat interfaces because each conversation was short and infrequent. As the workload shifts toward long-running autonomous agents making dozens of model calls per task, the overhead of statelessness becomes the dominant cost.
WebSockets are not new technology. They have been used for live data feeds, multiplayer games, and collaborative editing for over a decade. Applying them to LLM inference sessions is the straightforward move once you recognize that agent loops have the same temporal locality properties as any other stateful client-server interaction. The KV cache is just another form of server-side session state, and persistent connections are the natural way to keep session state warm.
For developers building production agent systems today, this is worth integrating as soon as OpenAI makes the WebSocket Responses API generally available. The latency gains are not marginal optimizations; for agents doing real work over many turns, eliminating repeated context transmission and cold-cache prefill changes the practical ceiling for how long and how complex a task the agent can handle within reasonable time bounds.