When the Model Provider Becomes the Infrastructure Provider

Building an AI agent has always involved two distinct problems: getting the model to reason correctly, and wiring up the infrastructure the model needs to act on the world. Files, sandboxes, browser sessions, command execution, persistent state across turns. Every serious agent implementation has spent significant engineering time on that second problem, and the solutions have been varied: spinning up Docker containers, integrating E2B sandboxes, wrapping Playwright, managing conversation history manually. OpenAI’s Responses API, and specifically the agent runtime they describe building on top of it, is a direct answer to that infrastructure problem. The architecture is worth understanding in detail because it represents a meaningful shift in how the model provider is positioning itself.

What the Responses API Changes

The Chat Completions API is stateless. You send the full message history on every request, the model generates a response, and state management is entirely your responsibility. For single-turn interactions that is fine. For agents running multi-step tasks across dozens of turns, that design forces you to maintain a growing array of messages, serialize it across requests, and pass it back each time. The payload grows with every step.

The Responses API changes this with server-side state. Each response has an id; you reference it in the next call via previous_response_id, and OpenAI stores the full conversation history on their end:

response = client.responses.create(
    model="gpt-4o",
    input="Summarize the CSV I uploaded",
    previous_response_id="resp_abc123",
    tools=[{"type": "shell", "container": {"type": "auto"}}]
)

This is not just a convenience. For long-running agents where the accumulated context can reach hundreds of thousands of tokens, server-side storage means you are not paying to re-transmit that history on every turn. The API chains responses by reference rather than by value. You also stop writing conversation management code entirely, which eliminates a class of subtle bugs around message ordering, role validation, and history truncation.

The output format is also different. Rather than a choices[] array, the Responses API returns a typed output[] list that can include messages, tool calls, tool results, reasoning summaries (for o-series models), and the results of hosted tool execution all in one structure.

The Shell Tool Is Not a REPL

The most significant addition to the Responses API is the shell tool. It is worth being precise about what this means, because it is easy to conflate it with Code Interpreter, which has existed in various forms since the Assistants API.

Code Interpreter was a Python sandbox. The model wrote Python, the sandbox executed it, and the output came back. Useful for data analysis and computation. The shell tool is something broader: a full bash shell running in a Linux container.

{
  "type": "shell",
  "container": {"type": "auto"}
}

When the model invokes this tool, it executes arbitrary shell commands in a persistent container environment. The container survives across multiple turns of the same session. Files written in turn three are still there in turn seven. You can install packages with apt or pip, run compiled binaries, manipulate the filesystem with standard Unix tools, chain commands with pipes, and redirect output. The model is operating something closer to a laptop than a code runner.

This matters because many real tasks require multiple interacting processes or depend on tools that are not pure Python. A task like “clone this repository, run the tests, identify which ones are failing, and generate a fix” requires git, a test runner, file reads and writes, and possibly compilers or interpreters specific to the project’s language. Code Interpreter could not do most of that. The shell tool can.

The container itself has a lifecycle tied to the session. OpenAI provisions it on first use when you specify "container": {"type": "auto"}, or you can manage containers explicitly:

POST /v1/containers
{"name": "analysis-session", "expires_after": {"anchor": "last_active_at", "minutes": 60}}

You can upload files into a container before the agent runs, seed it with data or code, and the agent reads and writes through the same filesystem the uploads land in. OpenAI has also indicated support for persistent containers that survive across sessions, though the details of that feature were still being finalized at launch.

How This Compares to the Alternative Approaches

The landscape of agent sandboxing has developed quickly. E2B offers purpose-built sandboxes for AI code execution with roughly 150ms cold starts and SDKs for Python and TypeScript. Modal provides cloud compute for Python workloads with support for custom Docker images, GPU access, and persistent storage. Daytona builds Git-native development environments increasingly aimed at agent workloads.

All three of these are infrastructure-agnostic: they work with any model provider. They are composable primitives you assemble into an agent stack.

OpenAI’s containers are the opposite of infrastructure-agnostic. They are only accessible through the Responses API. The integration is tight by design: the model and the execution environment are managed by the same service, which means OpenAI can handle the tool execution loop internally without requiring you to poll for results and submit them back manually. For hosted tools, the model emits a shell_call output item, OpenAI executes it, and the result is automatically included in the next turn’s context. You do not write the tool execution loop at all.

Anthropics’s computer use feature takes the opposite design philosophy. Claude can control a desktop or browser, but Anthropic provides only the model capability, not the infrastructure. You run your own VM, typically using their reference Docker implementation with an X11 server and noVNC, and you implement the screenshot-action loop yourself. The advantage is full control over the environment, the ability to run it on-premises, and no lock-in to a single model provider. The cost is operational complexity that the OpenAI approach eliminates.

The E2B cold start advantage (150ms versus several seconds for a full container boot) is real for high-frequency, short-lived tasks. But for multi-step agent sessions where the container stays warm across a long conversation, cold start time matters less than the operational overhead of managing the sandbox lifecycle yourself.

Security and the Prompt Injection Surface

Computer use introduces a meaningful security problem that container-based code execution does not have: the model is interacting with a rendered GUI, and the content of that GUI comes from the external world.

A webpage the agent visits can include text styled to look invisible to humans but readable to the model, attempting to override instructions. A PDF the agent opens can contain instructions in white text on a white background. This is prompt injection through the environment, and it is structurally difficult to defend against because the same visual parsing capability that makes computer use powerful also makes it susceptible.

OpenAI addresses this with a combination of model-level instruction following (the model is instructed to flag and pause on sensitive actions) and a requires_action mechanism that surfaces mid-task to the human when the agent encounters something requiring confirmation, such as submitting a form or entering credentials. This is a reasonable mitigation but not a complete defense. The security posture of a computer use agent depends significantly on the trust level of the environments it is asked to interact with.

The shell tool has a cleaner threat model. The container environment is populated by you or by the model itself. The attack surface is primarily the model’s own code generation, which is well-studied territory. Network egress from containers is allowed by default but is configurable, which matters for tasks where internet access is a liability rather than a feature.

The Strategic Bet

OpenAI is making a specific architectural bet: that developers building agents on their models will prefer a managed, vertically integrated stack over assembling their own from composable parts. The Responses API packages the model, the execution environment, the tool loop, and the state management into a single service boundary.

That is a real value proposition. The amount of engineering that goes into wiring up a reliable agent sandbox, handling container lifecycle, implementing retry logic, managing conversation history at scale, and building the tool execution loop is substantial. Offloading it to the model provider removes entire layers of infrastructure to operate.

The trade-off is the one that always comes with vertical integration: reduced flexibility and a binding to a single vendor’s model quality and pricing decisions. A team that builds on E2B and the Chat Completions API can switch models without changing their infrastructure layer. A team built on the Responses API shell tool is more deeply coupled.

For many applications, particularly those where iteration speed matters more than infrastructure control, that trade-off will be worth making. The Responses API is genuinely simpler to build on for agentic tasks than assembling the equivalent stack yourself. Whether that simplicity is worth the coupling is a question each team will answer differently depending on how central model choice is to their product strategy.