From Chat to Compute: What OpenAI's Hosted Agent Containers Actually Change

The most interesting thing about OpenAI’s latest Responses API update is not the shell tool itself. It is what the shell tool implies: OpenAI is no longer building an API that models call; they are building infrastructure that models run inside.

That distinction matters a lot for anyone who has spent time wiring up agent pipelines.

The Lineage Matters Here

To understand what is actually new, you need to trace the lineage. The Chat Completions API is fundamentally stateless. You ship the entire conversation history in every request and get a response. State is your problem. Scaling is your problem. Tool execution is your problem.

The Assistants API, launched in late 2023, was OpenAI’s first attempt at managing some of that complexity. It introduced threads (persistent conversation state), runs (execution units you poll for completion), and built-in tools: code interpreter, file search, function calling. The model was: you create an assistant, attach files, start a thread, kick off a run, poll until done. Developers criticized it for being opaque, slow to stream, and awkward to debug. The run lifecycle required polling, the tool execution was invisible, and the threading model added more state to manage, not less.

The Responses API, released in March 2025, was positioned as the successor. It simplified state management through previous_response_id chaining, where each response can reference the prior one and the API reconstructs context server-side. Streaming got cleaner. Tool use became more transparent, with each tool call and result surfaced as discrete events in the stream. The Responses API was a clear improvement in ergonomics.

But it still assumed you were handling compute yourself. If the model decided to run code, that code ran somewhere you controlled.

The computer environment update changes that assumption.

What “Computer Environment” Actually Means

The architecture is cleaner than it sounds. When you enable the shell tool in a Responses API call, the model can emit shell commands as tool calls. OpenAI executes those commands inside a hosted, isolated container and returns the output as a tool result. The model then decides what to do next: read more files, run another command, call a different tool, or produce a final response.

The container persists for the duration of the session. Files written in one tool call are available in the next. You can install packages, compile binaries, manipulate a file tree, run a web server and curl it, or chain shell scripts across multiple model turns. The state lives inside OpenAI’s infrastructure, not yours.

This is the meaningful part. Previous approaches to giving models shell access required you to manage the sandbox yourself. You would spin up a container, expose a code execution endpoint, handle stdin/stdout marshaling, enforce resource limits, tear the container down after use, and deal with cleanup when things went sideways. Libraries like E2B exist specifically to solve this problem, offering sandboxed cloud environments with an SDK for connecting them to LLM tool calls. Modal solves a related problem from the serverless side. Daytona and similar tools target the development environment angle.

OpenAI has absorbed that problem into the API.

Security and the Isolation Question

Running arbitrary shell commands from a language model in a shared cloud environment requires careful isolation. The exact implementation details are not public, but the shape is familiar. Modern container isolation at scale typically combines Linux namespaces and cgroups with a hypervisor layer, whether that is gVisor (used by Google Cloud Run), Firecracker (used by AWS Lambda), or equivalent. The goal is the same: limit syscall surface, restrict network egress, enforce memory and CPU ceilings, and ensure that nothing escaping one container can touch another.

For most agent use cases, the critical constraints are network access and persistence. Containers in this model are session-scoped. When the session ends, the container is destroyed. There is no long-running server that accumulates state across requests. That is the right default for security; it also means you cannot use this as a replacement for actual application infrastructure. If your agent needs to maintain a database, call internal services, or store results in durable storage, you still need to build that layer yourself and expose it through function tools.

What the hosted environment does well is the ephemeral compute case: run a script, process a file, do a calculation, explore a directory structure, generate code and test it. These are patterns that come up constantly in agent workflows and are painful to sandbox safely on your own.

The DIY Comparison

Here is a concrete illustration of the difference. If you are building an agent that can, say, write and test code, the old approach looked roughly like this:

import e2b
from openai import OpenAI

sbx = e2b.Sandbox()

def run_code(code: str) -> str:
    result = sbx.run_code(code)
    return result.stdout or result.stderr

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write and test a merge sort in Python"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "run_code",
            "description": "Execute Python code",
            "parameters": {
                "type": "object",
                "properties": {"code": {"type": "string"}},
                "required": ["code"]
            }
        }
    }],
    tool_choice="auto"
)
# Handle tool call, dispatch to sbx, loop

With the Responses API and shell tool, the sandbox is implicit. You declare the tool, and OpenAI handles execution:

from openai import OpenAI

client = OpenAI()
response = client.responses.create(
    model="gpt-4o",
    input="Write and test a merge sort in Python",
    tools=[{"type": "shell"}]
)

The model handles the loop. Shell commands go out, results come back, the agent decides its next step. You never manage a container handle or poll a sandbox status.

The simplification is genuine for the cases it covers. The trade-off is that you are now constrained to the environment OpenAI has provisioned. You cannot bring your own base image, pre-install dependencies for your specific domain, or mount internal filesystem paths. If you need any of that, E2B and Modal still have the advantage.

What This Changes for Builders

For developers building at the prototype or product MVP level, this removes a significant infrastructure concern from the stack. The pattern of “model decides what to do, runs code to do it, sees the result, decides what to do next” is the core loop of most practical agents. OpenAI now owns that loop end to end.

For builders who need more control, the story is different. The hosted container is a black box relative to your infrastructure. You cannot connect it to a private database, route traffic through your VPC, or attach it to internal tooling without exposing those services externally and accepting the associated risk. The model also needs network access to call those services, which means they need to be reachable from OpenAI’s infrastructure, not just your private network.

The architecture question is not whether OpenAI’s hosted compute is good; it is whether the convenience is worth the coupling. Organizations that are building agent systems with non-trivial security requirements, internal data access, or operational SLAs will likely continue managing their own sandboxes. The E2B and Modal use cases do not disappear.

What does shift is the baseline. Starting from zero, you no longer need to think about containerization to build a capable coding agent. That lowers the floor meaningfully. It also means that OpenAI is now in a different competitive position relative to companies like E2B whose primary value proposition is sandboxed execution for AI systems.

The Responses API began as a cleaner interface for the same thing the Chat Completions API provided. With hosted containers and the shell tool, it has become something qualitatively different: a runtime, not just an endpoint. Whether that runtime fits your architecture depends on where your state lives and how much infrastructure you are willing to hand over.