Hosting the Shell: What OpenAI's Agent Runtime Actually Requires

A year ago, OpenAI launched the Responses API as the intended successor to the Assistants API. At launch, it supported stateful multi-turn conversations via previous_response_id, streaming, and a set of built-in tools: web_search_preview, file_search, computer_use_preview, and code_interpreter. On March 11, 2026, OpenAI published a retrospective on how they extended that API with a shell tool and hosted containers to produce what amounts to a full agent runtime. Looking back at what they built and why reveals something interesting about where the AI infrastructure layer is heading.

The Division That Used to Exist

Before the Responses API matured into its current form, the implicit contract between model providers and developers went like this: the provider handles inference, the developer handles everything else. Tool use worked through function calling. The model would emit a structured JSON blob indicating which function to call and with what arguments. The developer’s server received that, executed whatever the function mapped to (a database query, an HTTP request, a subprocess call), and fed the result back as a new message. The model never touched a filesystem or executed code; it just described what it wanted done.

This model is still how Claude’s tool use works, and it’s how most LangChain and LlamaIndex integrations are structured. The advantage is simplicity in the provider’s infrastructure and full developer control over the execution environment. The disadvantage is that every developer has to build and secure their own execution layer, even when the workload is the same: run some code, read some files, produce a result.

E2B emerged specifically to fill this gap, providing sandboxed code execution environments as a service that developers could drop in as the execution backend for their agents. Modal, Fly.io, and similar platforms offer ephemeral compute that can serve the same purpose. These tools work, but they represent an extra integration step, an extra billing relationship, and an extra thing to secure.

What the Shell Tool Changes

The shell tool in the Responses API removes that indirection for a common class of workload. Instead of the model returning a function call descriptor that your server then dispatches to some execution environment, the model can directly invoke a shell in a container that OpenAI hosts. The container persists for the duration of a session (keyed to a conversation), so files written in one turn remain accessible in subsequent turns. The shell is a POSIX environment with standard utilities and the ability to install packages, write scripts, and pipe output between commands.

This is architecturally similar to what code interpreter has done since ChatGPT’s Advanced Data Analysis days, but generalized. Code interpreter ran a restricted Python kernel in a Jupyter-adjacent environment. The shell tool is broader: it gives the agent access to the full POSIX toolchain, which means it can run compiled binaries, invoke system utilities, manipulate files in formats that Python libraries might not easily handle, and coordinate between multiple languages in a single session.

A rough sketch of how a request flows through this system:

Developer sends message to Responses API
  -> Model decides to use shell tool
  -> OpenAI routes tool execution to hosted container
     (container assigned to session ID, warm if already used)
  -> Shell command executes, stdout/stderr captured
  -> Output returned to model as tool result
  -> Model incorporates result into response
  -> Streaming response returned to developer

State lives at two levels: the conversation state (managed by previous_response_id, which stitches context across turns), and the container state (files, installed packages, environment variables, process state within the session lifetime). Developers still write the outer loop if they need multi-session persistence or want to extract artifacts when the session closes.

The Multi-Tenant Isolation Problem

Hosting shell environments for arbitrary agents at scale is not a straightforward infrastructure problem. The security surface is wide. A container that can run arbitrary code, read from and write to a filesystem, and potentially make network calls has to be isolated from neighboring tenants by something stronger than conventional container namespaces.

The industry’s established tools for this are VM-based sandboxes. gVisor interposes a user-space kernel between the container and the host, intercepting system calls before they reach the host kernel. Firecracker, developed by Amazon for AWS Lambda and Fargate, uses KVM to run lightweight microVMs with boot times under 125 milliseconds and memory overhead around 5 MB per VM. Either approach adds a layer of isolation that pure Linux namespaces and cgroups do not provide.

OpenAI has not published the full technical details of their container isolation model, but their code interpreter feature has been running multi-tenant sandboxed execution since 2023, and the shell tool extends that same infrastructure. The key properties a hosted agent runtime needs to guarantee are: code in one container cannot read memory or files from another, resource limits prevent any single container from monopolizing host CPU or memory, network egress is controlled to prevent data exfiltration or unexpected outbound connections, and containers are cleanly destroyed after session expiry.

The last point is subtle. An agent that writes a file containing sensitive data (retrieved from a web search, extracted from an uploaded document) and then lets the session expire should not leave that data accessible. Container destruction needs to be complete and auditable.

The Developer Trade-Off

For many use cases, the hosted model is the right choice. Getting an agentic workflow running without setting up your own execution infrastructure saves real time. The billing model (compute time charged separately from tokens) keeps costs proportional to actual usage, similar to how code interpreter is currently metered.

For use cases that need more control, the hosted container model has genuine limitations. You cannot pre-install proprietary software, customize the base image, or give the container access to private network resources within your infrastructure without additional tunneling arrangements. Debugging is harder: when a shell command fails inside a hosted container, you are working from stdout and stderr, not from a terminal session you can attach to. And the execution environment changes when OpenAI updates the container image, which can break reproducible workflows.

This is why alternatives like E2B remain relevant even after OpenAI ships hosted containers. E2B lets you define your own sandbox template with a custom Dockerfile, gives you a programmatic API for interacting with the filesystem, and integrates with multiple model providers. If you need an agent that runs tools requiring a specific system library version or that needs to reach a private database, you want execution infrastructure you control.

The Patterns That Emerge

Looking at a year of the Responses API in production, the pattern that seems to matter most is the coupling between state and execution. The code interpreter era taught developers that ephemeral execution with no persistent filesystem is limiting; you end up re-generating data every session or building your own persistence layer on top. The hosted container model with session-scoped state is a meaningful improvement, but it is still session-scoped. Long-running agents that need to persist state across weeks or months still need external storage.

The shell tool also exposes something that agentic frameworks often paper over: the cost of tool failure recovery. When a shell command produces unexpected output or fails partway through a multi-step operation, the model has to reason about what state the container is in and how to recover. Structured tool interfaces with explicit success and error schemas are easier for models to handle reliably than free-form shell output. OpenAI’s tool use documentation has moved toward encouraging explicit output schemas for exactly this reason.

The infrastructure direction is clear: the model API and the execution environment are converging into a single hosted runtime. Whether that convergence happens inside one provider’s platform or stays distributed across multiple services depends on what trade-offs matter most to the applications being built. The Responses API’s hosted containers answer the convenience question well. The control question is still better answered elsewhere.