· 6 min read ·

From Model to Shell: How OpenAI Folded the Execution Layer Into the Responses API

Source: openai

When OpenAI launched the Responses API in early 2025, the architectural story was about unification: one endpoint for tool use, built-in integrations, and server-side conversation state, replacing the patchwork of Chat Completions calls and hand-rolled function dispatch that most agent developers had assembled themselves. The latest extension goes further. By adding a shell tool backed by hosted, sandboxed containers, OpenAI has folded the execution environment itself into the API surface.

The announcement describes how the Responses API now supports agents running in a computer environment: an isolated container that persists state across turns, where models can write files, execute programs, chain shell commands, and observe output. The model does not call a remote function and wait for a webhook; it calls the shell tool and OpenAI’s infrastructure runs the command, captures output, and returns it as context for the next turn. The execution environment has become part of the model API.

Understanding this as infrastructure design rather than feature addition clarifies the trade-offs.

What the Responses API Changed

The original Chat Completions API was stateless by design. Every request carried the full message history. Tool execution was the developer’s responsibility: write a function, register it as a tool, handle the model’s tool call in your code, run the function, and post the result back. State management, retry logic, and execution isolation all lived outside the API boundary.

The Responses API moved two things server-side: conversation state, managed through a previous_response_id parameter that lets you chain responses without rebuilding history arrays, and built-in tool execution. The web_search_preview tool calls infrastructure OpenAI runs, not an API you own. The code_interpreter tool executes in a Python sandbox OpenAI manages. The developer’s responsibility shifted from “wire up every tool yourself” toward “tell the model what to do.”

The shell tool extends this pattern to general-purpose execution. Where code_interpreter gave agents a constrained Python environment, a shell gives them a more complete Unix-like system: a filesystem they can write to, arbitrary programs they can invoke, and state that persists across turns within a session.

from openai import OpenAI

client = OpenAI()

response = client.responses.create(
    model="gpt-4o",
    tools=[{"type": "shell"}],
    input="Check out this repository, run the test suite, and report which tests are failing and why."
)

The model generates shell tool calls. OpenAI’s infrastructure runs them in a container. Output comes back. The model reasons about it and issues more commands. The loop lives entirely within the API boundary, with no developer-managed execution layer required.

The Container Model

Giving a model shell access without isolation would be reckless. The hosted container environment is what makes this practical: each agent session gets an isolated container with its own filesystem and process namespace, preventing interference between tenants and limiting the damage a misbehaving agent can cause.

The containers are ephemeral across sessions but stateful within one. Files written in an early turn remain readable in later turns. A long-running process can be started and observed across multiple tool calls. This makes multi-step tasks tractable in a way that stateless execution cannot. Consider an agent tasked with cloning a repository, installing dependencies, building the project, observing build errors, modifying source files, and rebuilding. Each step depends on the state left by the previous one. In a stateless model, the agent would need to either perform all of this in a single script or reconstruct state from scratch on each turn. With persistent container state, the filesystem is simply there, in whatever condition the previous commands left it, ready for the next tool call.

From a systems perspective this closely resembles how CI/CD pipelines work: a fresh container per job, with consistent state throughout the job’s execution. The meaningful difference is that the job is defined dynamically by the model’s reasoning rather than statically by a YAML configuration file.

What Developers Were Doing Before

Before this, giving an agent real execution capabilities meant either building your own execution infrastructure or using a dedicated service. The main options each represent different points on the control-versus-convenience spectrum, and all require more developer-side orchestration than the Responses API approach.

E2B is the most widely adopted dedicated sandbox for AI agents. It provides a container primitive with SDK support for Python and JavaScript, handles lifecycle management, and integrates with LangChain and LlamaIndex. The developer manages the container explicitly: spin it up, pass a reference to the agent framework, route tool calls to it, clean it up afterward. The integration is more manual, but the developer controls what is installed and how the container is configured.

Modal takes a different angle: serverless containers for arbitrary Python workloads, with GPU support and persistent volumes. Agents built on Modal can access specialized hardware and custom dependencies. It is general-purpose infrastructure adapted to agent use rather than agent infrastructure specifically.

Self-managed Docker containers, on local machines or cloud instances, remain the baseline for production agent systems that need precise control over execution environment, network access policies, or integration with internal systems that cannot be reached from an external service.

OpenAI’s hosted containers sit at the most managed end of this spectrum. The developer gets zero infrastructure configuration, no lifecycle management, and tight integration with the model’s tool-calling loop, at the cost of less control over the environment and full visibility by OpenAI into what agents execute. Whether that trade-off is acceptable depends heavily on the application.

The Stateful Execution Problem

The harder problem that hosted containers address is not running individual commands; it is managing state across a multi-step task where the agent’s reasoning depends on actual system state rather than a model of it.

An agent operating purely through text reasons about state implicitly, from prior turns in the conversation. An agent with a persistent shell environment can observe state directly: list the directory, check whether a file exists, read a log, inspect a process table. This changes the error-recovery loop. When a command fails, the agent does not have to reason from first principles about what went wrong; it can check. When a long-running process finishes, the agent can verify the outcome rather than assuming it.

Persistent execution state also allows incremental progress across longer tasks. An agent working through a complex data processing pipeline can complete partial work, be interrupted, and resume from where it left off because the filesystem state is preserved. Without this, every agent run is a cold start, and any failure means starting over.

The Data Visibility Question

When OpenAI runs your agent’s shell commands in their containers, they have visibility into the commands the agent executes, the files it writes, and the programs it runs. This extends the existing relationship where OpenAI sees prompts and responses, but it extends it into compute behavior, which matters for applications where agent operations are sensitive. Proprietary source code, internal system credentials passed as environment variables, business logic encoded in shell scripts: all of these pass through infrastructure OpenAI controls.

This is not unique to OpenAI’s offering. E2B, Modal, and any other hosted execution service have the same access to workloads. The question is whether the simplicity trade-off is acceptable for a given application. Many applications will find it is. Applications handling regulated data, operating under strict data residency requirements, or working with security-sensitive processes will likely continue using self-managed execution environments.

The hybrid approach deserves consideration here. The Responses API supports custom function tools alongside built-in ones, so it is possible to use the hosted shell for general-purpose tasks while routing sensitive operations to containers you control. This architecture captures most of the state management and tool routing benefits of the Responses API while keeping specific workloads off OpenAI’s infrastructure.

What the Trend Points To

The Responses API computer environment is part of a broader pattern of model providers moving down the stack. Inference was the starting point, then tool routing and state management, and now compute environments. The abstraction level available to developers keeps rising, and the infrastructure assembly that previously required significant engineering effort can increasingly be replaced with API primitives.

This sets a concrete reference baseline for agent infrastructure. Teams evaluating alternatives now have a clear comparison point: if your stack can match the Responses API’s shell tool in terms of security isolation, state persistence, and integration with the model’s reasoning loop, you have a viable independent option. If not, it is worth understanding which gaps you are accepting and why.

What improving infrastructure does not change is the difficulty of the agent reasoning problems themselves. Execution environments make those problems workable to tackle. The container faithfully runs whatever the model decides to run, and the model still needs to be right about what to run. That part remains hard regardless of how seamlessly the execution layer is integrated.

Was this interesting?