· 6 min read ·

Persistent Connections and KV Cache Locality: What WebSockets Actually Fix in Agent Loops

Source: openai

When you build something that talks to a language model in a loop, the protocol overhead you tolerate on a single request compounds badly. Each tool call, each intermediate reasoning step, each pass through an agent’s control flow adds another round trip. OpenAI’s recent work on WebSocket support in the Responses API, demonstrated through the Codex agent, is fundamentally about what happens when you treat a stateless HTTP API like it’s stateless between every turn, and what you gain when you stop doing that.

The Cost of a Connection Per Request

HTTP requests to OpenAI’s API typically go over HTTPS with TLS 1.3. Each new request negotiates the connection: TCP handshake, TLS handshake, HTTP headers, the works. With HTTP/1.1 keep-alive or HTTP/2 multiplexing, you can amortize the TCP layer across requests on the same host. But the model-side cost, specifically the KV cache, doesn’t persist across requests in the standard HTTP flow at all. Every request arrives as a fresh context, and the model re-processes the tokens from scratch unless prompt caching applies.

OpenAI’s prompt caching mechanism (available on the Chat Completions and Responses APIs) does reduce this: if your request prefix matches a previously cached prefix exactly, you pay a lower per-token rate and skip re-computation for those tokens. But prompt caching is keyed by prefix content and expires. It’s a best-effort cache across all clients, not a reserved slot for your session. Under high load, your cached prefix can be evicted between requests, and you’re back to paying full input token cost for your entire system prompt and conversation history every turn.

Connection-scoped caching is a different concept. When your client holds an open WebSocket connection to the model, the server can pin the KV cache for that connection’s context. The model doesn’t re-process your system prompt on turn 12 just because turns 1 through 11 happened to complete. The loaded state stays in memory for as long as the connection is alive. For a single-turn chatbot this distinction barely matters. For a Codex-style agent that runs a file listing, reads a file, writes a patch, runs tests, reads the error output, and patches again, all within the same task, it matters quite a lot.

How the Responses API WebSocket Mode Works

The Responses API, introduced in early 2025 as a more structured successor to Chat Completions for agentic scenarios, supports streaming via Server-Sent Events over HTTP by default. The newer WebSocket transport opens a persistent bidirectional channel. You connect once, then send input messages and receive streamed response events without re-establishing the connection for each turn.

The wire format uses the same event types you’d see in SSE streaming: response.created, response.output_item.added, response.content_part.delta, and so on. The structural difference is that your client sends subsequent inputs over the same socket rather than opening a new HTTPS request. This lets the server maintain context about your session at the infrastructure layer, which is what enables the connection-scoped KV cache behavior.

A simplified connection looks like this:

import websockets
import json

async def agent_loop(task: str):
    uri = "wss://api.openai.com/v1/realtime?model=gpt-4.1"
    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "OpenAI-Beta": "responses-v2"
    }

    async with websockets.connect(uri, extra_headers=headers) as ws:
        # Send initial request
        await ws.send(json.dumps({
            "type": "response.create",
            "response": {
                "instructions": SYSTEM_PROMPT,
                "input": [{"type": "message", "role": "user", "content": task}],
                "tools": TOOL_DEFINITIONS
            }
        }))

        # Handle streaming events and tool calls within the same connection
        async for message in ws:
            event = json.loads(message)
            if event["type"] == "response.output_item.done":
                item = event["item"]
                if item["type"] == "function_call":
                    result = await execute_tool(item["name"], item["arguments"])
                    # Submit tool result back on the SAME connection
                    await ws.send(json.dumps({
                        "type": "response.create",
                        "response": {
                            "input": [{
                                "type": "function_call_output",
                                "call_id": item["call_id"],
                                "output": result
                            }]
                        }
                    }))

Notice that the system prompt is sent once at the start. Every subsequent tool result goes back over the same socket, and the server knows it’s the same session. There’s no repeated serialization of the full conversation history, no re-transmission of your instructions on every turn, and no hope-the-cache-is-warm problem.

The Codex Agent Loop in Concrete Terms

Codex, OpenAI’s coding agent, runs tasks that typically involve many sequential tool calls: reading directory structures, reading file contents, running shell commands, patching files, re-running tests. A non-trivial task might involve 20 to 40 tool call turns. With HTTP-per-request and prompt caching at its best, you still pay the connection setup overhead 40 times and you’re vulnerable to cache misses as your prefix grows with each appended tool result.

With WebSocket transport and connection-scoped caching, the system prompt is processed once. The growing conversation history is appended incrementally to the existing KV state rather than re-processed from scratch. Time to first token on turn 15 is not materially different from turn 2, because the model isn’t re-reading your system instructions every time.

OpenAI’s article describes this as reducing both API overhead and model latency. The API overhead reduction is the eliminated connection setup cost. The model latency reduction is the KV cache hit behavior: the model genuinely has less prefill work to do on each turn because the cache is warm and pinned.

Why This Pattern Is Familiar If You Build Bots

Anyone who has written a Discord bot knows that WebSocket connections are the baseline, not a special optimization. The Discord gateway is a persistent WebSocket; you connect once and receive events for the lifetime of your bot’s session. The alternative, polling HTTP endpoints, is available but is plainly worse for latency-sensitive event handling.

The model API world has historically been different because requests are computationally expensive and stateless serving is easier to scale horizontally. Each HTTP request lands on any available worker; there’s no stickiness required. A persistent WebSocket session needs to stay pinned to the same backend instance or have its KV cache state transferred, which adds real infrastructure complexity.

What the Responses API WebSocket work represents is OpenAI deciding that for agentic workloads, the latency and cache benefits are worth the stickiness cost. This is the same trade-off Discord made for gateway events versus REST polling, just applied at the model-serving layer.

Comparing the Realtime API Precedent

OpenAI’s Realtime API, which launched in late 2024, already used WebSockets as its primary transport. For voice and audio interaction, WebSockets are non-negotiable: you need low-latency bidirectional streaming for turn detection and audio chunking. The architecture OpenAI built for Realtime, persistent connections with session state and incremental context accumulation, appears to be the foundation that the Responses API WebSocket mode is built on.

The Realtime API uses a concept it calls input_audio_buffer for accumulating audio before committing it as a conversation turn. The Responses API WebSocket mode uses a structurally similar pattern for text: you accumulate context over the lifetime of the connection without re-sending history. The event schema is closely related between the two APIs, which suggests they share a significant portion of infrastructure.

What This Doesn’t Solve

Connection-scoped caching helps with within-session latency. It doesn’t help with the cold start on the first turn, where your full system prompt still needs to be processed. It doesn’t help with tasks that span multiple sessions or that are interrupted and resumed, since the KV state is tied to the connection lifetime. For very long tasks that outlive a typical connection, you’re back to re-establishing context.

There’s also a resource question on OpenAI’s end. Pinning KV cache state per connection means holding GPU memory proportional to the prefix length for each active connection. At scale, this is a significant constraint. It’s presumably why the feature is being rolled out through specific use cases like Codex rather than being universally available with no restrictions.

Finally, the bidirectional nature of WebSockets adds client-side complexity. SSE over HTTP is easy to implement: open a request, read newline-delimited events. WebSocket session management, reconnection logic, and message ordering are more work to handle correctly, particularly when tool execution can fail or time out mid-session.

The Practical Takeaway

If you are building an agent that runs a loop with many sequential model calls, the WebSocket transport in the Responses API is worth adopting. The latency improvement is not marginal on long tasks. Eliminating 20 connection setups and replacing speculative prompt cache hits with guaranteed KV locality changes the feel of the agent’s execution speed.

For single-turn applications or short interactions, the overhead of maintaining a persistent connection probably isn’t worth it. Use SSE over HTTP, lean on prompt caching for your system prompt, and move on. The WebSocket mode is solving a specific problem that only becomes painful at the scale of a real agentic loop, and it solves that problem well.

Was this interesting?