The Agent Runtime OpenAI Embedded in the Responses API

When OpenAI introduced the Responses API in early 2025 alongside the Agents SDK, the framing was deliberate: Chat Completions was designed for conversations, the Responses API was designed for agents. The architectural differences were real, but the API still had a fundamental dependency on the developer to execute tools. The model could decide to call a function; you had to run it.

The March 11, 2026 announcement closes that loop. With hosted containers, a shell tool, and server-managed file state, the Responses API now provides not just an interface to the model but an environment for the agent to operate in. This is a meaningful shift in what an API boundary means for agentic workloads.

What Changed in the Responses API

The Responses API already had built-in tool execution for web search, code interpreter, and file search. In those cases, OpenAI handles tool execution on their infrastructure, the model receives results, and the developer writes no tool handler. The computer environment extends that pattern to the shell itself.

A session now has:

A hosted Linux container with persistent state across turns
A file system scoped to the session
A shell tool the model can invoke to run arbitrary commands
Standard utilities and the ability to install packages mid-session

The session ID is the persistence anchor. All state, files written in turn one are still there in turn three. The model can write a script, execute it, read the output, and iterate, without the developer orchestrating any of that.

from openai import OpenAI

client = OpenAI()

response = client.responses.create(
    model="gpt-4o",
    tools=[{"type": "shell"}],
    input=[{
        "role": "user",
        "content": (
            "Write a Python script that parses nginx access logs "
            "and summarizes the top 10 IPs by request count. "
            "Save it as analyze.py, then run it against /var/log/sample.log."
        )
    }]
)

print(response.output_text)

The model generates the script, invokes the shell tool to write it to disk, runs it, and returns the output, all within a single API round-trip from the caller’s perspective. No tool dispatch loop on your side.

The Stateful API Problem

Understanding why this matters requires understanding what the Responses API was already doing differently from Chat Completions.

In Chat Completions, state is entirely the developer’s problem. Every request includes the full conversation history. The model has no memory between calls; you reconstruct it each time. Tool calling follows the same pattern: the model returns a tool_calls array, you execute each one, append results to the message array, and call the API again. For simple tools this is manageable. For agents with dozens of tool calls across a long session, the conversation history grows large, and the developer is responsible for trimming, compressing, or summarizing it without losing fidelity.

The Responses API inverted this for built-in tools. It stores conversation state server-side behind a previous_response_id. You reference the prior response rather than retransmitting its content. The computer environment extends that principle to a persistent filesystem and shell, creating a coherent runtime rather than a stateless function that returns to zero between calls.

This matters for agent reliability. A common failure mode in agentic loops is state drift: the model’s understanding of the world diverges from actual state because intermediate steps are imperfectly captured in the message history. When the filesystem is the authoritative state and the model reads it directly via shell commands, that class of divergence is eliminated for file-based operations.

Shell vs. GUI: Two Approaches to the Same Problem

Anthropic’s computer use capability, released in late 2024, approaches agent environment access from a different angle. The model receives screenshots, decides what to click or type, and those actions are dispatched against a desktop environment. It is general-purpose in the sense that it can operate any GUI application, but each action requires a full screenshot cycle, and the model’s view of state is a rendered image rather than structured data.

OpenAI’s shell environment is narrower in scope and more efficient for developer-oriented workflows. Running a build, processing a CSV, checking git status, running tests, these operations map cleanly onto shell commands and produce text output the model can parse directly. The latency profile is also meaningfully different: a shell command executes and returns, while a GUI action requires screenshot capture, transmission, model processing, and action dispatch. For workflows where the agent is doing many sequential operations, that per-action overhead compounds.

Neither approach is universally better. For automating a legacy application with no API surface, computer use against a desktop is often the only path. For CI pipeline automation, code generation, data processing, or repository analysis, the shell environment is more efficient, and more reliable because the feedback loop between action and observation is tighter and less ambiguous.

Security Model and Isolation

The container isolation model matters for anything involving arbitrary code execution. The architecture here follows the same patterns as similar hosted sandboxing systems. The expected guarantees for hosted containers of this kind include container-level isolation between sessions using separate filesystems and process namespaces, network egress controls to limit what containers can reach, resource limits on CPU and memory to prevent runaway processes, and session-scoped state that does not persist after session termination unless explicitly exported.

This puts the security burden on OpenAI rather than on the developer. The trade-off is giving up control over the execution environment: you cannot supply a custom base image, configure specific kernel parameters, or audit the host infrastructure. For teams running regulated workloads or requiring specific compliance postures, running equivalent infrastructure on self-managed Kubernetes with gVisor or Firecracker provides more control at the cost of more operational surface area. The hosted model is the right default for teams that do not want that complexity.

The Agents SDK Layer

The computer environment integrates cleanly with the OpenAI Agents SDK, which provides higher-level abstractions over the raw Responses API. The SDK’s shell tool surfaces the same capability with the agentic loop already managed, and the tracing infrastructure records each turn including tool inputs and outputs.

from agents import Agent, Runner, ShellTool

agent = Agent(
    name="build-agent",
    model="gpt-4o",
    tools=[ShellTool()],
    instructions=(
        "You are a build agent. When given a repository path, "
        "run the test suite and report any failures with full context "
        "including the relevant source lines."
    )
)

result = await Runner.run(agent, "Run tests in /workspace/myproject")
print(result.final_output)

The tracing piece is worth emphasizing. When an agent takes fifteen steps across multiple tool calls and produces unexpected output, understanding where it diverged requires a complete record of each action and its result. The SDK emits structured traces that integrate with OpenAI’s dashboard and with third-party observability platforms. This is infrastructure that teams building on raw API calls have to build themselves, and getting it right is not trivial.

What This Architecture Signals

The pattern OpenAI is building toward is an agent that has everything it needs within the API boundary: tools, compute, persistent state, and network access. The developer provides the task; the platform handles execution. This is a fundamentally different model than running your own agent loop with API calls for model inference.

The lock-in question is real. An agent built on hosted containers, server-side state, and built-in tool execution is tightly coupled to OpenAI’s platform. Moving to Anthropic’s API or a local model requires rebuilding the execution environment from scratch. The LangChain and LlamaIndex ecosystems exist partly as an abstraction layer over exactly this kind of coupling.

The counter-argument is that the hardest parts of running reliable agents at scale are not model selection; they are execution isolation, state management, error recovery, and observability. Offloading those to the platform eliminates a significant category of engineering work. For teams without dedicated infrastructure engineers, the hosted model lowers the barrier to production deployment substantially.

The Responses API started as a cleaner interface for agentic workloads. The computer environment update from March 2026 turns it into the runtime those workloads actually need to execute.