The Session-Pinned KV Cache Behind OpenAI's WebSocket Responses API

OpenAI’s blog post on WebSockets in the Responses API frames the feature as a latency improvement for the Codex agent loop, which it is. But the more interesting story is what “connection-scoped caching” means at the inference layer, how it differs from the prompt caching developers are already familiar with, and what architectural constraints it introduces. These are distinct questions, and most coverage of the feature has only touched the first one.

Why Agentic Loops Break Prompt Caching

OpenAI, Anthropic, and Google all offer some form of prefix-based prompt caching. The mechanism is the same across providers: the server computes a hash of the token sequence up to some breakpoint, and if an incoming request shares that exact prefix, it skips recomputing the key-value attention states for those tokens and loads them from a cache instead. This is worth real money. Anthropic’s prompt caching prices cached input reads at roughly 10% of the normal input token cost. OpenAI’s automatic prefix caching applies to prompts with 1024 or more repeated tokens at the start and charges 50% of the normal rate for cache hits.

For a chat application, this works well. The system prompt and any preamble stay fixed across turns, so each turn reuses the cached KV states for the static prefix and only pays full compute for the new user message. The prefix is genuinely stable because users don’t inject content into the middle of it.

Agentic loops break this model. Consider the Codex loop: the model receives a task, calls tools (bash commands, file reads, code execution), processes the output, and calls tools again. Each tool result gets appended to the conversation history. The message list after three turns might look like this:

system: [large system prompt + tool definitions]
user: implement the feature described in SPEC.md
assistant: <function call: read_file("SPEC.md")>
tool: [1800 tokens of spec content]
assistant: <function call: read_file("src/auth.py")>
tool: [600 tokens of existing code]
assistant: <function call: bash("pytest tests/auth_test.py")>
tool: [test output]
assistant: [analysis and next steps]

The prefix after turn one is not the same as the prefix after turn two. Tool results are inserted at arbitrary positions in the message sequence, or appended to grow the sequence. Prefix caching can protect the static system prompt and tool definitions at the very beginning, but the accumulating conversation history, which grows with each turn, cannot be cached because it changes every turn by definition. By turn ten of a Codex session, you might be prefilling 30,000 or more tokens, most of which are new tool results and model responses from prior turns.

This is the gap connection-scoped caching targets.

What Connection-Scoped Caching Actually Does

In transformer inference, the KV cache stores the projected key and value vectors for every token in the context, one entry per attention head per layer per position. During the prefill phase, the model processes the entire input prompt and fills this cache. During the decode phase, each newly generated token reads from the full cache to attend back to all prior context, and writes one new entry. The prefill phase is compute-bound; it processes all input tokens in parallel on the GPU. The decode phase is memory-bandwidth-bound; it reads the entire cache once per output token.

With HTTP-based requests, each request arrives stateless. The server may have cached KV states from a prior request with the same prefix, but it has no inherent guarantee of continuity. If the cache slot for your session was evicted by an LRU policy to serve another user’s request, your next turn pays full prefill cost. Cache eviction is especially likely for agentic workloads because their KV caches are large (many tokens) and their inter-turn gaps can be minutes long while tools run.

A persistent WebSocket connection changes the unit of caching from content to session. The server can pin a KV cache slot to the connection identifier rather than to a content hash. As long as the connection stays open, the server maintains that slot and incremental turns only require computing KV states for the newly added tokens. The work scales with how much new content appeared since the last turn, not with the total accumulated context length.

For the Codex turn profile above, the difference is concrete. At turn ten with 30,000 tokens of accumulated context:

HTTP without cache hit: prefill 30,000 tokens
HTTP with prefix cache hit (system + tools, say 3,000 tokens): prefill 27,000 tokens
WebSocket with connection-scoped cache: prefill only the new tokens added since the last turn, typically 1,000 to 3,000 tokens depending on tool output size

The prefill reduction translates directly into time to first token. Prefill latency scales roughly linearly with token count for a given context length, so dropping from 27,000 tokens to 2,000 tokens is roughly a 13x improvement in TTFT for that specific turn. The improvement compounds over a long session because the denominator stays manageable rather than growing without bound.

The Infrastructure Bet

This optimization requires stateful server infrastructure, and that is a real constraint. Stateless HTTP servers can be scaled horizontally without concern for session affinity. Any request can go to any server in the pool because no server holds state specific to a user session. WebSocket connections are pinned to a specific backend instance for their lifetime. If that instance needs to be drained, restarted, or if load needs to be rebalanced, the active connections must either be migrated (expensive and complex) or dropped (breaking the session).

This matters more in serverless environments. AWS Lambda, Cloudflare Workers, and similar platforms are built around short-lived invocations, typically with hard execution time limits. An agentic loop that runs for ten minutes across many turns cannot be served from a single Lambda invocation. If each tool-calling step is a separate invocation, there is no persistent process to own a WebSocket connection to the model provider. You’re back to reconnecting on each turn and losing the session cache.

The practical implication is that connection-scoped caching rewards long-lived processes. A server that manages many concurrent agentic sessions, each represented by a persistent WebSocket to the Responses API, benefits fully. A serverless or short-lived architecture cannot hold those connections and must fall back to HTTP with whatever prefix caching survives between requests.

This is OpenAI making an architectural statement: they expect production agentic workloads to run as long-lived services, not as stateless functions.

Comparing Caching Strategies Across Providers

Anthropic resolved a related problem differently with prompt caching. Rather than tying cache lifetime to a connection, Anthropic lets clients annotate specific positions in the prompt with cache_control: {"type": "ephemeral"} markers. The server caches KV states at those positions with a five-minute TTL. You can place up to four cache breakpoints, which is enough to protect system prompt, tool definitions, and a long reference document, but the growing conversation history is still recomputed on each turn unless you happen to hit the TTL window between turns.

Google’s context caching for Gemini takes an even more explicit approach: you create a named cache object containing whatever content you want preserved, with a configurable TTL, and reference it by ID in subsequent requests. The cache is decoupled from any specific connection, which means it works well across stateless invocations and can even be shared across users if the content warrants it. The tradeoff is that you must manage cache objects explicitly, and the growing per-session conversation history is still outside the cache.

OpenAI’s connection-scoped approach is the most aggressive of the three because it caches the full accumulated session state, not just a fixed prefix. It does not require the developer to mark what to cache or manage cache objects. The tradeoff is the connection management complexity described above.

None of these approaches is a clean winner for all workloads. For stateless, bursty workloads with a large static preamble, Anthropic’s explicit markers or Google’s named caches are well suited. For long-running, turn-heavy agent sessions where the conversation history itself is the dominant cost, connection-scoped caching wins by a significant margin.

What This Means for How You Build Agent Servers

If you’re building an agent server on top of the Responses API today, the decision tree is roughly: if your agent sessions run for more than three or four turns on average and your infrastructure can hold long-lived processes, using the WebSocket transport is worth the connection management overhead. The TTFT savings per turn will be significant enough to matter for user experience, especially for interactive coding sessions where the user is watching the agent work.

For a Discord bot that runs single-turn or two-turn AI responses per message, HTTP is fine. The overhead of establishing a WebSocket and maintaining it for one turn exceeds the prefill savings. But for a bot that runs a multi-step research task or debugging session on behalf of a user, a persistent WebSocket session per active task makes sense.

The connection management itself is not particularly complex at small scale. Most WebSocket client libraries handle reconnection and ping-keepalive. What requires thought is what happens to in-flight agent state when a connection drops unexpectedly mid-turn: the session-pinned cache is gone, so reconnecting means paying full prefill cost for the accumulated context. You need to checkpoint enough state client-side to rebuild the request if the connection is lost, which is good practice anyway for agent fault tolerance.

The deeper takeaway from OpenAI’s Responses API WebSocket support is that the infrastructure assumptions underneath API-based AI are shifting. Chat completion was designed around stateless, short-lived requests. Agentic workloads are session-oriented, long-running, and latency-sensitive in a different way: not latency on a single request but cumulative latency across many turns. Connection-scoped caching is an infrastructure concession to that reality, and the fact that it required WebSockets rather than working within the existing HTTP model tells you how different the resource management problem actually is.