What OpenAI's Hosted Containers Add to the Agent Equation

The model has never been the hard part of building an agent. Intelligence is straightforward to integrate once you have an API endpoint. What consistently proved difficult is the execution environment: where does the agent run commands, how does it maintain file state between tool calls, and how do you isolate that execution safely enough to deploy without building and operating your own container infrastructure?

OpenAI’s work published earlier this month addresses this directly. The article describes how they combined the Responses API with a shell tool and hosted containers to create a complete agent runtime, where the model, execution environment, and state management all live under a single API surface. It’s worth unpacking what each layer does and what the combination changes for developers building agents.

The Responses API Foundation

The Responses API (POST /v1/responses) was introduced as OpenAI’s stateful alternative to Chat Completions. The key difference is server-side state management: you pass previous_response_id in your request, and OpenAI maintains the conversation context on their end. For an agent that takes many sequential actions, this eliminates the need to reconstruct and re-send the entire conversation history on every step.

The API also introduced a different model for built-in tools. Rather than tool definitions that return results to the client for execution, the Responses API supports hosted tools that execute entirely server-side. The model calls the tool, the infrastructure runs it, and the result feeds back into the next model turn, all without a round-trip to your application code.

from openai import OpenAI

client = OpenAI()

response = client.responses.create(
    model="gpt-4.1",
    tools=[
        {"type": "web_search_preview"},
        {"type": "shell"},
        {"type": "text_editor_20250429"}
    ],
    input=[
        {"role": "user", "content": "Fetch the latest pandas changelog and write a summary to /tmp/summary.md"}
    ]
)

This request authorizes the model to search the web, run shell commands, and edit files, all within a single API call. The infrastructure handles the execution loop.

The Shell Tool and Hosted Containers

The shell tool is where the hosted container architecture becomes concrete. When you include {"type": "shell"} in your tools array, OpenAI provisions an isolated Linux container for the session. The container comes with standard tooling: Python, Node.js, curl, git, and common package managers. The model can execute arbitrary shell commands inside this environment.

State persists across tool calls within the session. A file created in one shell invocation is available to the next; a Python package installed mid-session stays installed. This separation from Code Interpreter matters: Code Interpreter was sandboxed Python-only and intentionally limited, while the shell tool gives the model a general-purpose Linux environment.

The container lifecycle is managed through a container_id field in the API response. To continue working in the same environment across multiple API calls, you reference this ID in subsequent requests:

# First call provisions a new container
response_1 = client.responses.create(
    model="gpt-4.1",
    tools=[{"type": "shell"}],
    input=[{"role": "user", "content": "Clone the repo and run the test suite"}]
)

container_id = response_1.container_id

# Second call reuses the same container and its filesystem state
response_2 = client.responses.create(
    model="gpt-4.1",
    tools=[{"type": "shell"}],
    input=[{"role": "user", "content": "Fix the failing tests and run them again"}],
    container_id=container_id
)

This is a clean solution to a problem that every agent framework handles differently. LangChain does it with custom memory stores and tool implementations. Earlier agentic systems maintained execution state in local files or external databases. Here it’s a single field in the API response, and the infrastructure keeps the container warm as long as you’re referencing it.

Computer Use in the Hosted Environment

The computer_use_preview tool extends the hosted environment with a virtual display. The model can take screenshots of the container’s desktop or browser, issue cursor and keyboard actions, and observe the results. For tasks that require interacting with a GUI, a web application, or a tool that lacks a programmatic API, this is the only general approach that doesn’t require you to hard-code every possible interaction.

The hosted execution model changes the operational picture considerably. Before hosted containers, using computer_use_preview required the client application to manage its own browser or virtual machine, feed screenshots as image inputs, receive action JSON from the model, execute those actions locally, and loop continuously. That’s substantial scaffolding, especially for anything running at scale.

With hosted containers, that loop runs server-side. The client sends a request and waits for a result. The model takes screenshots, clicks, types, and reads the screen through infrastructure OpenAI manages; the client doesn’t need to run anything locally.

response = client.responses.create(
    model="computer-use-preview",
    tools=[
        {
            "type": "computer_use_preview",
            "display_width": 1024,
            "display_height": 768,
            "environment": "browser"
        },
        {"type": "shell"}
    ],
    input=[
        {"role": "user", "content": "Go to the project dashboard, take a screenshot, and save it to /tmp/dashboard.png"}
    ]
)

How This Compares to the Alternatives

Developers were solving this problem before this announcement, using services like E2B and Modal.

E2B is the closest prior art. It provides developer-provisioned sandboxes based on Firecracker microVMs, with explicit SDK calls to manage container lifecycle and tight integration with LLM frameworks like LangChain and LlamaIndex. A sandbox costs roughly $0.05 per hour of compute, provides strong isolation, supports custom Docker templates, and offers predictable lifecycle management. The tradeoff is that you manage provisioning explicitly: you call Sandbox.create(), track the sandbox ID across your application’s state, and call sandbox.close() when done.

# E2B approach: you own the lifecycle
from e2b_code_interpreter import Sandbox

sandbox = Sandbox()
result = sandbox.run_code("import subprocess; result = subprocess.run(['pytest', '--tb=short'], capture_output=True, text=True); print(result.stdout)")
sandbox.close()

Modal is a broader compute platform, useful for agent pipelines that need GPU access or complex multi-step workflows. It’s more powerful but requires more setup: function decorators, deployment steps, volume configuration for persistent state. It’s general-purpose compute repurposed for agents rather than agent tooling built from the start with the model loop in mind.

OpenAI’s approach trades configurability for simplicity. You get a managed environment with no provisioning code and tight integration with the model loop, but you cannot customize the base image, you have less visibility into container internals, and your code and data run on OpenAI’s infrastructure. For many use cases that’s a worthwhile tradeoff. For cases where you need air-gapped execution, custom tooling installed at image build time, or specific data residency guarantees, owning the execution layer remains the right call.

The Security Model

The security argument for hosted containers is real, though bounded. Each container is isolated per session and per customer. OpenAI handles resource limits, network restrictions, and process isolation. An agent with a bug or an unexpected prompt can’t easily affect other workloads or escape into the broader infrastructure.

What it doesn’t provide is protection against the agent doing things you didn’t intend within its own sandbox. A model with shell access in a container that has outbound internet access can make arbitrary HTTP requests, read any file it encounters in the working directory, or install and run additional processes. The container boundary limits horizontal movement; it does not constrain what the agent does within its own scope.

The hosted-and-sandboxed framing can create a false sense of safety. The container is one layer of a defense-in-depth approach; it doesn’t replace careful prompt design, output filtering, and thoughtful decisions about which tools you expose to which agents. If you’re giving an agent shell access to a container that can reach your internal network, the hosted isolation layer is not the control you should be relying on.

What the Consolidation Changes

The practical impact is that the barrier to building an agent with real execution capabilities has dropped substantially. Previously you needed to run your own container infrastructure, integrate a third-party sandbox service, or accept the limitations of Code Interpreter. Now those capabilities are built into the same API layer used to call the model.

That consolidation has genuine value: it reduces the operational surface, standardizes the execution model, and means your compute billing, state management, and model calls share a single integration. For prototyping and for production use cases where the OpenAI infrastructure constraints are acceptable, this is a meaningful simplification over maintaining a separate E2B or Modal integration alongside your OpenAI calls.

The remaining question is whether this approach scales to complex, long-running workflows. A single-session container is a reasonable unit for bounded tasks, a coding assistant fixing a bug or a data pipeline processing a file batch. For agents that need to run for hours, branch into parallel execution paths, or coordinate across multiple models, the architecture feels constrained. That’s not a criticism of the current design so much as an observation about where the next layer of the problem lives, and it suggests that purpose-built execution platforms won’t be obsolete any time soon.