· 6 min read ·

The Execution Environment Is Now the API: Inside OpenAI's Hosted Agent Runtime

Source: openai

The most tedious part of building an AI agent has never been the model call. It is everything around it: spinning up a container for code execution, threading file handles between turns, managing context so prior tool outputs do not overflow the window, writing retry logic for flaky subprocess calls, deciding where state lives when the session ends. Most of the code you write is infrastructure that has nothing to do with what the agent is actually supposed to accomplish.

OpenAI’s writeup on equipping the Responses API with a computer environment, published on March 11, 2026, describes how they have pushed that scaffolding into the platform itself. The shell tool, hosted containers, and session-level state management are not convenience wrappers; they represent a structural change in where the execution layer lives. A week out from that announcement, it is worth looking at what the architecture actually implies.

From Function Calling to Hosted Execution

The Chat Completions API has supported function calling since mid-2023. The pattern is well understood: the model returns a tool_calls array, your application code dispatches those calls to whatever handlers you have wired up, and the results come back in the next request. The model is stateless. Your application holds everything.

The Responses API inverts part of that. Built-in tools like web_search, file_search, code_interpreter, and now shell execute on OpenAI’s infrastructure rather than in your process. You do not receive a specification to execute; you receive the result of execution. Chaining turns uses previous_response_id rather than reconstructing full message histories.

from openai import OpenAI

client = OpenAI()

# First turn: set up an environment
response = client.responses.create(
    model="gpt-4o",
    tools=[{"type": "shell"}],
    input="Create a virtual environment and install the httpx library"
)

# Second turn: the container from turn one is still alive
followup = client.responses.create(
    model="gpt-4o",
    tools=[{"type": "shell"}],
    previous_response_id=response.id,
    input="Write a script using httpx to fetch https://httpbin.org/get and print the response status"
)

The virtual environment created in the first call persists into the second. The file written there can be read, modified, or executed. This is the core design decision: treat the container as session state rather than as disposable compute.

How the Container Layer Works

What OpenAI is shipping here closely resembles what E2B has offered as a standalone service for the past couple of years: sandboxed Linux environments purpose-built for AI agent execution. E2B’s premise was that agent-suitable sandboxes require different defaults than general-purpose containers, with tighter network controls, fast cold start, and APIs shaped around the read-file, write-file, run-command loop that agents actually need.

The difference is integration depth. With E2B or a similar tool, you manage the sandbox explicitly in your application code. You decide when to start it, when to kill it, how to pass outputs back to the model, and how to handle failures. With the Responses API shell tool, that management is implicit; it happens inside the API call.

# E2B: you manage the sandbox lifecycle explicitly
from e2b_code_interpreter import Sandbox

with Sandbox() as sandbox:
    result = sandbox.run_code(
        "import httpx; r = httpx.get('https://httpbin.org/get'); print(r.status_code)"
    )
    # Pass result back to your model call manually
    tool_result = result.text

# Responses API: execution is inside the API boundary
response = client.responses.create(
    model="gpt-4o",
    tools=[{"type": "shell"}],
    input="Use httpx to fetch https://httpbin.org/get and print the status code"
)

Simpler to write, and significantly more opaque. The container specification, resource limits, and network policy are controlled by OpenAI. You can observe what commands ran through the response output, but you cannot inspect the container state directly or configure the sandbox beyond what the API surface exposes.

For most application builders, that tradeoff is acceptable. For anyone running agents against internal systems or sensitive data, the opacity warrants real scrutiny before production adoption.

State Management and Context Accounting

The Assistants API, which predates the Responses API by about a year, also offered stateful threads and code execution. Its design was more heavyweight: Assistants, Threads, Runs, and Messages as distinct objects with their own lifecycle management. Developers who needed stateful agents but found the Assistants abstractions too rigid often built their own state management on top of Chat Completions instead.

The Responses API simplifies this to a single pointer. previous_response_id tells the API which prior response to continue from. The platform handles context window accumulation, including tool outputs and intermediate model reasoning, rather than requiring your code to serialize and resend everything. For a long-running agent that runs fifty shell commands across twenty turns, this matters: the request payload stays small, and you cannot accidentally truncate or reorder the history.

The portability tradeoff that plagued the Assistants API is still present here. Session state tied to a response ID on OpenAI’s servers cannot be migrated to Anthropic’s API or a local model without reconstruction. If you later want to replay the session, run it against a different model, or audit the exact sequence of tool calls, you are dependent on whatever inspection surface OpenAI provides.

Sandboxing at Scale

Shell access in a hosted environment raises a standard set of container security questions. The sandbox needs to prevent arbitrary outbound network access that could exfiltrate data or reach internal services, filesystem access beyond the session’s allocated store, resource exhaustion that affects other tenants, and container escape. These are well-studied problems in the runtime security space.

The approaches that have proven most effective involve VM-level isolation rather than relying solely on namespace and cgroup boundaries. gVisor, Google’s user-space kernel that intercepts syscalls before they reach the host, powers the sandboxing in Cloud Run and several other Google services. Firecracker, developed by AWS for Lambda and Fargate, uses lightweight microVMs with millisecond startup times. Either approach provides substantially stronger isolation than a standard container runtime.

OpenAI has not published specifics about their isolation stack for hosted tools. Given the security requirements for a multi-tenant shell execution service, VM-level isolation or something equivalent is the expected baseline, but the details are worth watching for as they release more documentation.

What Agent Frameworks Are Left Doing

Libraries like LangChain, CrewAI, and AutoGen built substantial infrastructure around tool dispatch, agent loop management, state persistence, and memory systems precisely because the underlying model APIs did not provide those things. If the Responses API absorbs execution, session state, and tool routing, the value proposition of a full-stack agent framework narrows.

What remains is the layer above execution: workflow orchestration across multiple agents with different capabilities, evaluation infrastructure to know when agents are failing, observability into what decisions the agent made and why, and the domain-specific logic that defines what the agent actually does. These are not small problems, but they are different problems than managing context windows and subprocess lifecycles by hand.

This is roughly what happened with serverless compute. AWS Lambda did not eliminate infrastructure engineering; it moved the complexity up the stack. Teams that had deep expertise in server provisioning found the skill less differentiated. Teams that had domain expertise built faster because the platform absorbed the undifferentiated parts.

The parallel to agent development seems fairly direct. Deep knowledge of how to hand-roll a reliable agent loop is becoming less of a differentiator as platforms absorb those concerns. The work moves toward higher-order questions: what should the agent do, how do you know when it has done it correctly, and how do you intervene when it has not.

The Architectural Bet

OpenAI’s Responses API stack now covers model, built-in tools, a hosted execution environment, and an Agents SDK that wires these together into a Python abstraction with multi-agent handoffs. Each layer reduces the work required at the layer above it. For teams building straightforward agents, the result is a genuine reduction in scaffolding code.

The constraint is that the stack is tightly coupled. The execution environment, session state, and model are all managed by the same provider, with the same pricing, the same rate limits, and the same terms of service. Developers who need control over the execution environment for compliance reasons, or who want to run the same agent against multiple model providers, or who need their state to be portable across infrastructure changes, will find this stack more constraining than composable alternatives built on explicit sandboxing tools and provider-agnostic frameworks.

That tension is inherent to managed platforms. The tradeoff is real and the right answer depends on what you are actually building. For rapid development and straightforward use cases, the hosted runtime is a meaningful improvement. For production systems with specific security, portability, or infrastructure requirements, the composable approach built on something like E2B plus a provider-agnostic agent framework gives you more control at the cost of more code.

Both paths will have users for a long time.

Was this interesting?