The agentic loop has a state problem that context window management alone cannot fix. When a model decides to write code, install a library, parse a file, and then run the result, each of those steps produces artifacts that the next step depends on. The installed library needs to be present when the code runs. The parsed file needs to be on disk when the analysis script reads it. The code needs to have been written before it can be tested.
These dependencies are not representable as tokens in a context window. They require a persistent execution environment, one that survives not just the current tool call but the entire multi-step reasoning chain. That is compute state, and it is structurally different from the conversation history that model providers have been managing for years.
The Responses API computer environment update is OpenAI’s answer to this problem, and the design decision at its center is straightforward: the container persists for the session.
Two Kinds of State
Every provider that has shipped a stateful API has taken responsibility for context window state. OpenAI’s Responses API manages it through previous_response_id chaining, where responses are stored server-side, linked explicitly, and the API reconstructs context on each call. Anthropic’s Messages API is fully stateless but well-documented about it, and the expectation is that the caller manages history. Google’s Gemini API supports conversation history via the contents array. These are different strategies for the same problem: how does the model know what has already happened in a multi-turn conversation.
Compute state is different. It is not tokens. It is filesystem nodes, installed packages, running processes, intermediate files. When an agent performs a task across multiple reasoning steps, it produces artifacts at each step that subsequent steps depend on. A pandas installation, a parsed CSV, a generated chart file: none of these fit in a context window without lossy conversion to text, and some of them cannot be converted at all.
Managing this second kind of state was the developer’s problem, until now.
Why Stateless Execution Breaks Multi-Step Agents
Consider an agent asked to analyze a CSV dataset, produce summary statistics, generate a visualization, and assemble a report combining both. A capable model will produce a reasonable plan: install pandas and matplotlib, load the data, compute statistics, generate a chart, build the report.
If each tool call executes in a fresh environment, this plan collapses at the second step. The pandas installation is gone. The CSV file written to disk is gone. The agent either re-executes all prior setup on every turn, which is expensive and prone to divergence, or it passes every artifact through the context window as text, which loses binary data and hits token limits fast.
This is why stateless execution sandboxes are inadequate for non-trivial agent workflows. The issue is not complexity; it is whether intermediate state can persist across the loop iterations the agent needs to reason about.
Early approaches to this involved managing sandboxes directly. E2B provides microVM-based sandboxes, built on the same Firecracker isolation used by AWS Lambda, where a shell environment persists across SDK calls. The developer connects it to LLM tool calls by wrapping it in a function-calling schema, dispatching tool results to the sandbox in a loop, and tearing it down after the task completes. Modal solves a related problem from the serverless side, letting agents spin up function containers on demand without tight per-call semantics. Both approaches work, but they put the marshaling layer, sandbox lifecycle, resource limits, and cleanup on failure squarely on the developer.
What Session-Scoped Containers Provide
The shell tool in the Responses API eliminates the marshaling layer and the sandbox lifecycle management. You declare the tool in your request:
from openai import OpenAI
client = OpenAI()
response = client.responses.create(
model="gpt-4o",
input="Analyze sales.csv, compute monthly totals, and produce a summary report",
tools=[{"type": "shell"}]
)
The model emits shell commands as tool calls. OpenAI executes them in a hosted Linux container, returns stdout and stderr as tool results, and the model continues from there. Critically, the container is not destroyed between tool calls. It persists across the entire session. The pandas installation the model runs in turn one is available when the analysis script runs in turn three. Files written to disk are readable by subsequent commands without any developer-managed state transfer.
The multi-step execution looks like this from the inside:
model: install dependencies
-> shell: pip install pandas matplotlib
-> result: Successfully installed pandas 2.2.1 matplotlib 3.8.3
model: load and describe the data
-> shell: python3 -c "import pandas as pd; df = pd.read_csv('sales.csv'); print(df.describe())"
-> result: [summary statistics table]
model: generate chart and save it
-> shell: python3 generate_chart.py
-> result: chart.png written
model: assemble report
-> shell: python3 build_report.py
-> result: report.html written
model: read and return the report
-> shell: cat report.html
-> result: [report content]
Each command sees the state left by all previous commands. No developer code manages the transitions between steps, handles container handles, or polls a sandbox status endpoint.
How Other Providers Handle This
Google’s Gemini API has offered a code execution tool since mid-2024. The model generates Python code, Google runs it in a hosted sandbox, and the output returns inline in the response. The constraint is significant: Python only, no arbitrary shell access, and limited file persistence across calls by default. It covers calculation and data analysis in pure Python competently, but tasks that require package management, compiled tools, shell scripting, or chaining multiple language runtimes are outside its scope.
Anthropic does not offer a hosted execution environment. The computer use capability in Claude 3.5 Sonnet and later models operates through discrete action objects: the model emits screenshot, click, type, and key commands, and the developer provides the graphical environment and returns screenshots. The compute side is entirely developer-managed. Anthropic’s documentation explicitly recommends Docker containers or VM-based setups, with E2B as a named option for sandboxed execution.
The comparative picture across providers: Google offers hosted execution constrained to Python; Anthropic provides model capability without hosted compute; OpenAI now provides both, with a general-purpose shell rather than a language-specific runtime. This is a meaningful product differentiation, and it puts OpenAI in direct competition with the sandboxed execution providers whose value proposition is handling exactly this problem.
Where the Model Breaks Down
Session-scoped persistence solves the within-session state problem. It does not solve the across-session problem, and it is not a replacement for application infrastructure.
The container is destroyed when the session ends. Agent workflows that span multiple user sessions, produce results requiring durable storage, or need to write to a database or emit events to an external system still require that layer to be built and exposed through function tools. The hosted container is an ephemeral workspace.
The more significant constraint for production systems is network topology. The container runs in OpenAI’s infrastructure. Connecting it to internal services, private databases, or a VPC requires those services to be reachable externally. Organizations with data residency requirements or strict network security policies will continue managing their own sandboxes, where E2B’s custom base images and Modal’s function environments offer more control: you can pre-install domain-specific dependencies, mount internal paths, or route through a private network. The hosted shell cannot do any of that.
For builders working at the prototype level or building agents against public data and APIs, the hosted container removes a meaningful infrastructure burden. For production systems with internal data access, compliance requirements, or custom execution environments, it raises the floor without replacing the need for custom sandboxing.
The underlying shift is still significant. The Responses API began as a cleaner interface over a stateless model completion primitive. With session-scoped containers and a shell tool, it has become a runtime: a co-located environment where both model reasoning and artifact production happen under a single session boundary. Context window state and compute state are now managed by the same provider, which simplifies the default architecture considerably. Whether that simplification is appropriate for a given system depends on where the data lives and how much of the infrastructure stack actually needs to be owned.