Persistent Connections and the KV Cache: What WebSockets Actually Change for Agentic Loops
Source: openai
When you build something that talks to a language model in a loop, the protocol overhead you tolerate on a single request compounds badly. Each tool call, each intermediate reasoning step, each pass through an agent’s control flow adds another round trip. OpenAI’s recent work on WebSocket support in the Responses API, demonstrated through the Codex coding agent, is fundamentally about what happens when you treat a stateless HTTP API like it’s stateless between every turn, and what you gain when you stop doing that.
The Cost of a Connection Per Request
HTTP requests to OpenAI’s API go over HTTPS with TLS 1.3. Each new request negotiates the connection: TCP handshake, TLS handshake, HTTP framing, the works. With HTTP/2 multiplexing, you can amortize the TCP layer across requests to the same host, but the model-side cost does not persist across requests in the standard HTTP flow. Every request arrives as a fresh context, and the model re-processes tokens from scratch unless prompt caching applies.
OpenAI’s prompt caching mechanism, available on both the Chat Completions and Responses APIs, does reduce this cost: if your request prefix matches a previously cached prefix exactly, you pay a discounted per-token rate and skip recomputation for those tokens. But prompt caching is keyed by prefix content and is subject to eviction. It is a best-effort cache shared across all clients, not a reserved slot for your session. Under load, your cached prefix can be evicted between turns, and you are back to paying full input token cost for your entire system prompt and conversation history on every call.
Connection-scoped caching is a structurally different concept. When your client holds an open WebSocket connection to the model server, the server can pin the KV cache for that connection’s context. The model does not re-process your system prompt on turn 12 just because turns 1 through 11 have already completed. The loaded attention state stays resident for as long as the connection is alive. For a single-turn chatbot this distinction barely registers. For a Codex-style agent that lists a directory, reads files, writes a patch, runs tests, reads the error output, and patches again, all within the same task, it matters considerably.
How Transformer KV Caching Works and Why Connection Affinity Matters
To understand why this is non-trivial, it helps to understand what the KV cache is. In a transformer, each attention layer computes queries, keys, and values over the input tokens. For the tokens that make up your system prompt and conversation history, those key and value tensors can be computed once and stored. On subsequent forward passes that extend the same prefix, the model loads those cached tensors rather than recomputing them. This is the prefill optimization that makes long-context models tolerable to serve.
Content-addressed prompt caching, as OpenAI implements it for standard HTTP requests, works by hashing your token prefix and looking up whether a matching cache entry exists on the serving infrastructure. This works well when the same prefix recurs frequently across many clients, such as a popular system prompt. It works less reliably for prefixes that grow by one tool call result per turn, because each successive turn produces a new hash. The cache might hold your prefix from two turns ago, but not from one turn ago, because that result just arrived and hasn’t been cached yet or landed on a different server node.
With a persistent WebSocket connection, the server maintains cache state that is associated with the connection itself, not with a content hash. There is no lookup: the model simply resumes from where it left off. This is closer in spirit to how KV caching works in a local inference runtime like llama.cpp, where you control the KV cache directly and can reuse it across calls in the same process. The WebSocket approach brings that locality to a remote API.
The Responses API WebSocket Mode
The Responses API, introduced in early 2025 as a more structured successor to Chat Completions for agentic workloads, supports streaming via Server-Sent Events over HTTP by default. The WebSocket transport opens a persistent bidirectional channel. You connect once, then send input messages and receive streamed response events without re-establishing the connection for each turn.
The wire format uses the same event types as SSE streaming: response.created, response.output_item.added, response.content_part.delta, and related events. The structural difference is that your client sends subsequent inputs over the same socket rather than opening a new HTTPS request, which gives the server the connection identity it needs to maintain pinned cache state.
A simplified agent loop over WebSocket looks like this:
import asyncio
import websockets
import json
async def run_agent(task: str):
uri = "wss://api.openai.com/v1/responses"
headers = {
"Authorization": f"Bearer {OPENAI_API_KEY}",
"OpenAI-Beta": "responses-websocket-v1"
}
async with websockets.connect(uri, extra_headers=headers) as ws:
await ws.send(json.dumps({
"type": "response.create",
"response": {
"model": "gpt-4.1",
"instructions": SYSTEM_PROMPT,
"input": [{"type": "message", "role": "user", "content": task}],
"tools": TOOL_DEFINITIONS
}
}))
async for raw in ws:
event = json.loads(raw)
if event["type"] == "response.output_item.done":
item = event["item"]
if item["type"] == "function_call":
result = await execute_tool(item["name"], item["arguments"])
# Tool result goes back on the same connection
await ws.send(json.dumps({
"type": "response.create",
"response": {
"input": [{
"type": "function_call_output",
"call_id": item["call_id"],
"output": result
}]
}
}))
elif event["type"] == "response.completed":
if not has_pending_tool_calls(event):
break
The system prompt is sent once at connection time. Every tool result goes back over the same socket. There is no re-serialization of conversation history, no re-transmission of instructions on every turn, and no reliance on a shared cache hitting warm on a growing prefix.
The Codex Agent Loop in Concrete Terms
Codex, OpenAI’s coding agent, runs tasks that involve many sequential tool calls: reading directory structures, reading file contents, running shell commands, applying patches, re-running tests, and interpreting failures. A non-trivial task can span 20 to 40 tool call turns within a single session.
With HTTP-per-request and prompt caching at its best, you still pay the connection setup overhead for each turn, and you are vulnerable to cache misses as your prefix grows with each appended tool result. OpenAI’s system prompt for Codex is substantial: it includes environment context, available tools with their schemas, behavioral instructions, and format constraints. Even with prompt caching, that prefix might be 2,000 to 4,000 tokens. At 40 turns, paying full prefill cost on cache misses adds up.
With WebSocket transport and connection-scoped caching, the system prompt is processed once at the start of the session. The growing conversation history is appended incrementally to the existing KV state rather than recomputed from scratch each turn. Time to first token on turn 15 is not materially different from turn 2, because the model is not re-reading its own instructions on every call. OpenAI describes this as reducing both API overhead and model latency: the API overhead being the eliminated connection setup cost, and the model latency being the KV cache hit rate improvement.
Precedent: The Realtime API
This is not OpenAI’s first persistent-connection API. The Realtime API, which launched in late 2024, uses WebSockets as its primary transport because for voice interaction, WebSockets are non-negotiable. You need low-latency bidirectional streaming for turn detection, audio chunking, and VAD event handling. HTTP request-response simply cannot serve that use case.
The architecture OpenAI built for the Realtime API, persistent connections with session state and incremental context accumulation, appears to be the foundation the Responses API WebSocket mode builds on. The Realtime API uses an input_audio_buffer concept for accumulating audio before committing a conversation turn; the Responses API WebSocket mode applies a structurally similar accumulation pattern for text and tool outputs. The event schemas are closely related, suggesting shared infrastructure rather than independent implementations.
What the Responses API work adds is bringing this persistent-connection model to the text and tool-use domain where Codex and similar coding agents live. Voice latency requirements forced the architecture early; agent loop latency requirements are now motivating the same solution in a different serving context.
Why This Is Familiar If You Build Bots
Anyone who has written a Discord bot knows that WebSocket connections are the baseline, not an optimization. The Discord gateway is a persistent WebSocket; you connect once and receive events for the lifetime of your bot process. The alternative, polling REST endpoints, is technically available but is worse for event handling latency by design.
The model API world has historically been different because inference requests are computationally expensive and stateless serving is easier to scale horizontally. Each HTTP request can land on any available GPU worker; no session affinity is required. A persistent WebSocket session needs to stay pinned to the same backend or have its KV state transferred between nodes, which is real infrastructure complexity.
What the Responses API WebSocket work represents is OpenAI accepting that infrastructure complexity because the latency and cache benefits justify it for agentic workloads. The trade-off is the same one Discord made for gateway versus polling, just applied at the model-serving layer rather than the event-delivery layer.
What This Does Not Solve
Connection-scoped caching helps with within-session latency. It does not help with the cold start on the first turn, where your full system prompt still requires processing. It does not help with tasks that span multiple sessions or are interrupted and resumed, because the KV state is tied to the connection lifetime. For long-running tasks that outlive a typical connection, or for agents that check in periodically over hours, you fall back to content-addressed caching or full recomputation.
There is also a resource question. Pinning KV cache state per connection means holding GPU memory proportional to the prefix length for each active connection. At large scale, this is a meaningful constraint on how many concurrent sessions a deployment can hold. It is presumably why the feature is being demonstrated through Codex rather than being universally available without restriction.
Finally, WebSockets add client-side complexity relative to SSE. Server-Sent Events over HTTP are straightforward to implement and debug: open a request, read newline-delimited JSON events, handle connection drops by retrying. WebSocket session management, reconnection with state recovery, message ordering guarantees during tool execution timeouts, and heartbeat handling are all additional concerns. For a production agent, this is manageable but not trivial.
When to Use It
If you are building an agent that runs a loop with many sequential model calls per task, the WebSocket transport in the Responses API is worth adopting. The latency improvement is not marginal on long tasks: eliminating dozens of connection setups and replacing speculative prompt cache hits with guaranteed KV locality changes the execution feel of the agent in a measurable way.
For single-turn applications, short conversations, or low-frequency interactions, the overhead of maintaining a persistent connection probably is not worth it. Use SSE over HTTP, rely on prompt caching for your system prompt, and keep the implementation simple. The WebSocket mode solves a problem that only becomes painful at the scale of a real agentic loop, and it is calibrated for exactly that use case.