· 5 min read ·

Hosting the Compute: What the Responses API's Container Model Actually Changes

Source: openai

The original article from OpenAI, published March 11, covers how they wired the Responses API to a container runtime using a shell tool and hosted execution environments. Looking at this a week later, the more interesting story isn’t the feature itself but the architectural position it represents. OpenAI isn’t just adding a tool; they’re making a claim about where the agent runtime should live.

The State Problem That Killed Earlier Agent APIs

The Assistants API threaded conversation history and file attachments across turns, which was a genuine improvement over stateless completions. But its compute story was narrow. The Code Interpreter gave you a sandboxed Python kernel with a fixed package set and no real OS access. You could run Python and read uploaded files, but you couldn’t install arbitrary packages, execute shell scripts, spawn subprocesses, or do anything that required a general-purpose Linux environment.

This constraint sounds workable until you actually try to build agents that do real work. Shell pipelines, build tools, CLI utilities, git operations, arbitrary package installation, and process management are all outside what a Python sandbox can express cleanly. The gap between what developers needed and what the hosted environment provided is exactly what pushed teams toward building their own execution infrastructure.

Services like E2B built their entire product around this gap: isolated, ephemeral compute sandboxes with clean SDKs, designed specifically for AI agent workloads. Modal and similar platforms offered another path, though more focused on serverless Python execution than agent-specific concerns. Teams were essentially paying twice: once for the model API and once for the execution environment the model API should have provided.

What the Responses API Does Differently

The Responses API separates model state from compute state. The conversation threads forward via previous_response_id, but the container is its own layer with its own lifecycle. A container provisioned for a session persists across all tool calls within that session, meaning state accumulates correctly: installed packages stay installed, written files remain accessible, environment variables persist.

from openai import OpenAI

client = OpenAI()

response = client.responses.create(
    model="gpt-4o",
    input="I've uploaded a dataset. Install scipy, fit a normal distribution to the latency_ms column, and return the parameters.",
    tools=[
        {
            "type": "code_interpreter",
            "container": {"type": "auto"}
        }
    ],
    store=True
)

The container: {"type": "auto"} configuration tells the API to provision an ephemeral Linux environment scoped to this session. That’s the meaningful departure from the old Code Interpreter: the container is a real compute environment, not a sandboxed Python process.

The Shell Tool and What It Unlocks

The shell tool extends the execution model beyond Python into a full POSIX shell. This matters more than it might seem at first glance.

When a model can only run Python, every task has to be expressible as a Python program. A lot of real agent work involves: running compilers, calling CLI utilities, processing files with standard Unix pipelines, invoking curl or wget, checking system state with ps or df, or running arbitrary build commands. None of that maps cleanly onto “execute a Python script.”

# The kind of shell invocation the model can now generate
apt-get install -y --quiet jq && \
curl -s https://api.example.com/data | jq '.records[] | select(.status == "error")' > errors.json && \
wc -l errors.json

The shell tool is closer to giving the model a terminal. It can install packages, pipe commands together, call arbitrary executables, write files, and in principle do anything the container’s process context permits.

This design choice has an important implication for agent architecture. A Python Code Interpreter is relatively easy to reason about: the model generates Python, the Python runs, you get deterministic output. A shell tool is much harder to reason about because shell is effectively Turing-complete with broad side effects. The agent can now install software, modify system configuration, and create persistent artifacts, all within the session boundary.

Security and the Container Isolation Model

This is where hosted containers make their most important trade-off. OpenAI controls the runtime, which means they control the isolation boundaries. Containers are session-scoped and destroyed at session end. Network access can be restricted. Resource limits cap runaway processes. The security model is real.

But the alternative approaches offer different properties. Firecracker MicroVMs provide stronger isolation through hardware virtualization with sub-second startup times; Amazon uses them for Lambda under the hood. gVisor intercepts system calls in user space, adding another boundary between container code and the host kernel. Neither of these is accessible to you when OpenAI hosts the container.

E2B’s approach is more flexible: they provision sandboxes in their infrastructure but give you more control over the sandbox configuration, how sandboxes connect to your application, and how long they persist. The Responses API is more opinionated: OpenAI owns the runtime, and you work within their model. For teams that handle regulated data or need containers running inside their own VPC, hosted containers aren’t an option regardless of the security properties OpenAI provides.

The Agentic Loop Without the Infrastructure

What the Responses API with hosted containers actually enables is the standard agent loop without any external infrastructure:

  1. Model receives a task
  2. Model emits tool calls: shell commands, file reads, web requests
  3. Container executes the tool calls and returns results
  4. Model processes results and decides the next action
  5. Loop until done

Previously, step 3 required building or renting the execution environment, handling I/O plumbing, managing file state, and wiring everything together with orchestration code. Frameworks like LangChain and LlamaIndex grew substantially from this need: the model APIs were stateless, so someone had to manage the state.

The session-scoped container makes that orchestration unnecessary for a large class of tasks. A model that writes a script in one turn, installs a dependency two turns later, and runs the combined output three turns after that is operating coherently across a stateful environment. That’s a different kind of agent capability than chaining completions over a context window.

What This Changes in Practice

For agent workloads that fit within OpenAI’s container model, the Responses API removes substantial infrastructure burden. Teams that were provisioning E2B sandboxes, managing Modal deployments, or running local Docker containers as execution environments for their agents now have a hosted option that’s integrated directly into the API.

The trade-offs are real, though. Custom base images, persistent storage across sessions, private networking, and compliance isolation requirements are not addressed by what’s been publicly documented. Observability is also harder: when the execution environment is external, understanding exactly what the agent did inside the container requires trusting the logs and output the API surfaces to you, which constrains how you build evaluation and debugging pipelines.

There’s a broader pattern here worth noting. OpenAI has steadily moved from providing a model API toward providing an agent runtime. The Assistants API was the first step. The Responses API with containers is a more coherent version of the same bet. Every layer they add to the stack, including tool execution, file storage, and now compute environments, is a layer that developers who need more control will have to route around.

For most prototyping and a meaningful fraction of production agent work, the hosted container story is compelling. For systems with real data requirements, security boundaries, or infrastructure maturity, you will still be running your own containers and managing your own state. The Responses API makes the first category significantly easier; it doesn’t change the second category at all.

Was this interesting?