The Responses API Gets a Runtime: What OpenAI's Hosted Containers Actually Mean
Source: openai
back-filled from OpenAI’s March 11 post on the Responses API computer environment extension, this is worth examining carefully because it represents a shift in what OpenAI is actually selling. Not a language model API. An agent runtime.
To understand the significance, you need to start with what the Responses API replaced.
From Chat Completions to Responses
The Chat Completions API was designed for a world where calling a model meant sending a question and receiving an answer. State was the caller’s problem. Tools were schemas you defined yourself, and when the model called one, you received a tool_call object, ran the function on your end, and sent the result in the next request. The model never touched anything directly.
The Responses API, launched alongside the Agents SDK in early 2025, rethought this from the ground up. State becomes server-side: instead of resending your full conversation history on every request, you pass a previous_response_id and the server retrieves the prior context. Built-in tools can execute server-side without a client round-trip. And the response object is richer, an output array of typed items rather than a flat message:
from openai import OpenAI
client = OpenAI()
response = client.responses.create(
model="gpt-4o",
tools=[{"type": "web_search_preview"}],
input="What changed in Python packaging this week?",
)
follow_up = client.responses.create(
model="gpt-4o",
previous_response_id=response.id,
input="Focus on the pip changes specifically.",
)
Server-side state is not a minor convenience. In a long agentic session, every prior turn costs tokens on the next request when you manage state yourself. With previous_response_id, you are referencing stored context, not resending it. That changes the economics of multi-step tasks substantially, especially for tasks that require dozens of tool calls before reaching a result.
The Shell Tool and Hosted Containers
The computer environment extension described in the March 11 post takes this architecture further. The shell tool adds a bash type to the Responses API tool set, giving the model access to a persistent shell session across turns. Unlike a one-off command, this session maintains working directory and environment variables between invocations:
{
"type": "bash",
"command": "cd /workspace && python scripts/run_analysis.py --input data.csv"
}
On the next turn, the working directory is still /workspace. The model can build up a workspace incrementally: write a file, run a script, check the output, iterate. That is qualitatively different from passing text back and forth.
The hosted container is what makes this practical. Rather than requiring callers to maintain their own virtual machine or sandbox, OpenAI provisions a container environment alongside the inference request. Files can be uploaded before the session starts, and outputs can be retrieved after. The container has a lifecycle tied to the session: ephemeral, isolated, and torn down when the task ends.
The security model mirrors what code_interpreter has been doing since 2023: no outbound network access from within the container, filesystem isolation, and resource limits on CPU and memory. The shell tool extends this to a richer execution environment where the model can run arbitrary commands, install packages, and manage a stateful workspace across multiple reasoning steps.
How This Compares to the Alternatives
The most direct competitor is E2B, which has been offering sandboxed execution environments for LLM agents since 2023. E2B’s model is framework-agnostic: you spin up a Sandbox, run code in it, and the sandbox persists across multiple LLM calls within a session.
from e2b_code_interpreter import Sandbox
with Sandbox() as sandbox:
sandbox.run_code("x = 42")
result = sandbox.run_code("print(x * 2)")
print(result.logs.stdout) # "84" — state persists across calls
E2B supports custom Docker images, streaming stdout, configurable network access, and pricing around $0.000014 per vCPU-second. The flexibility is higher than OpenAI’s hosted containers and the cost per execution-second is lower. But E2B does not include the model. You are orchestrating the pieces yourself.
Anthropic’s computer use works at a similar layer: Claude emits actions (clicks, keystrokes, shell commands) and you execute them in an environment you maintain. Anthropic provides a reference Docker image that sets up a virtual display and the tooling needed to capture screenshots and drive a browser. The action schema is explicit:
# A click action
{"type": "computer", "action": "left_click", "coordinate": [760, 400]}
# A shell command
{"type": "bash", "command": "ls -la /workspace"}
Anthropic’s approach is more explicit about the boundary: the model talks to your environment, not to an Anthropic-managed one. You control the VM, the network, the filesystem. That means more operational complexity for you, but it also means the model can access internal services, connect to databases, or run on a machine with your specific toolchain installed.
OpenAI’s hosted container trades that control for integration. You provision nothing. The runtime appears when you create a session. The tradeoff is that the environment is generic. If your task requires a specific library version, a GPU, or access to a corporate VPN, the hosted container will not serve you.
Modal and Fly.io Machines represent a third approach: infrastructure primitives fast enough to be useful as on-demand agent execution environments. Modal’s cold start is around 200ms, it supports GPU workloads, and billing is per-function-invocation. Fly.io Machines boot full micro-VMs in roughly 500ms and support persistent network services. Neither is purpose-built for LLM agents, but both offer more control than OpenAI’s container and more flexibility than E2B’s opinionated sandbox.
The Vertical Integration Question
What OpenAI is building with the Responses API plus hosted containers is vertical integration in the infrastructure sense. The model, the state store, the tool execution, and the compute environment are all within one API surface and one billing relationship. For developers who want to build agents without stitching together a model provider, a sandbox service, and a vector store, this is a genuine simplification.
The cost of that simplification is portability. Agent logic written against the Responses API, relying on server-side state and hosted containers, is not straightforwardly portable to Claude or Gemini. The previous_response_id state model, the specific tool schemas, the container lifecycle assumptions: these are all OpenAI-specific. Migration would require rearchitecting the state management layer, not just swapping a model identifier.
This is a familiar dynamic in cloud infrastructure. S3, Lambda, and RDS each make individual services easier to use, but building on them together makes switching providers progressively harder. The Responses API with hosted containers is a similar bet. OpenAI is not competing only on model quality anymore; it is competing on how much infrastructure work it can absorb on the developer’s behalf.
Whether that is the right tradeoff depends on the use case. If you are building a product where the model and execution environment are implementation details and you want to iterate quickly, the integrated stack is worth the lock-in cost. If you are building infrastructure that needs to be model-agnostic, or if your execution requirements are specific enough that OpenAI’s generic container does not fit, assembling from components remains the more defensible architecture.
What Stays With the Developer
Even with a hosted container and a shell tool, the Responses API does not remove the hard part of building reliable agents. The model still makes mistakes. Shell commands fail. File paths do not exist. Tasks that look straightforward in a linear description turn out to require retry logic, partial failure handling, and result validation.
The Agents SDK provides some scaffolding here: an Agent class with built-in tool configuration, a Runner that manages the execution loop, and tracing tied to the OpenAI dashboard. Guardrails can validate inputs and outputs. Handoffs let one agent delegate to another with different instructions or tools.
from openai.agents import Agent, Runner
agent = Agent(
name="analyst",
model="gpt-4o",
tools=[{"type": "bash"}, {"type": "file_search", "vector_store_ids": ["vs_abc"]}],
instructions="Analyze the uploaded dataset and produce a summary report.",
)
result = await Runner.run(agent, "Run the full analysis pipeline.")
print(result.final_output)
But the scaffolding does not solve the fundamental problem of verifying that an agent completed a task correctly. That still requires either a human in the loop or a separate evaluation step, and neither the Responses API nor the hosted container addresses that.
Where This Lands
The March 11 announcement is a meaningful step toward a world where deploying an agent means configuring a runtime rather than writing an orchestration loop from scratch. The shell tool and hosted containers make a specific bet: developers want fewer moving parts, even at the cost of fewer configuration knobs.
For production use cases with standard requirements, that bet is probably right. For the cases at the edges, the existing alternatives still have meaningful advantages: E2B for custom environments and cheaper compute, Anthropic’s computer use for full control over the execution substrate, Modal for GPU workloads and function-level billing.
The more interesting question is what happens when all the major model providers offer hosted execution environments. At that point, the competition shifts from who has the best sandbox to who has the best model doing the actual work inside it.