· 6 min read ·

OpenAI's Agents SDK Grows Up: Native Sandboxes and the Model-Native Harness

Source: openai

The agent framework space has a recurring problem: most frameworks are wrappers. They sit between your code and a model’s tool-calling interface, translating back and forth, adding abstraction layers that feel productive until something subtle breaks and you’re debugging three levels of indirection. OpenAI’s latest Agents SDK update takes direct aim at that pattern with two additions: native sandbox execution and what they’re calling a model-native harness. Together, they represent a meaningful shift in what an agent framework is supposed to be.

What the SDK Was Before This

The OpenAI Agents SDK launched in early 2025 as the production successor to the experimental Swarm framework. Swarm had good ideas but was explicitly a research prototype. The Agents SDK took those primitives, principally agents with instructions and tools, handoffs between agents, and a structured runner loop, and packaged them for production use. The Python library (openai-agents) gave you decorators for tool registration, a Runner.run() entrypoint, and built-in tracing.

A basic agent setup looked like this:

from agents import Agent, Runner, function_tool

@function_tool
def read_file(path: str) -> str:
    with open(path) as f:
        return f.read()

agent = Agent(
    name="file-reader",
    instructions="You help users understand the contents of files.",
    tools=[read_file],
)

result = await Runner.run(agent, "Explain what's in config.json")

Clean enough. The model calls read_file, gets the contents, synthesizes a response. But the execution of that tool happened in your process, in your environment, with your filesystem permissions. If you were building something that needed to run untrusted or model-generated code, you had to bring your own sandbox. The SDK had no opinion on that problem.

Native Sandbox Execution

The new native sandbox changes the execution model fundamentally. Instead of tool calls running in the host process, the SDK can now route code execution into an isolated sandbox environment, one that the SDK manages rather than the developer. This matters most for the code interpreter use case, where a model writes Python, executes it, and uses the output to drive the next step.

The security boundary here is real. When a model is orchestrating multi-step work across files and tools, the surface area for unintended side effects grows. A model that can write and execute arbitrary code needs to do so somewhere it cannot touch the host filesystem, network, or process table without explicit permission. External solutions like E2B and Modal filled this gap previously by providing sandboxed execution environments you could integrate manually. The SDK now provides this natively, which reduces the plumbing required and, more importantly, puts the security boundary under the same abstraction as the rest of the agent lifecycle.

This is architecturally similar to what Anthropic did when they built computer use around Docker container isolation rather than exposing the model to a live desktop. The pattern is becoming a standard: capable models need bounded execution environments, and the framework layer is the right place to enforce that boundary.

The Model-Native Harness

The second piece, the model-native harness, addresses a different class of problem. Traditional agent frameworks accumulate abstraction. They define their own tool schemas, their own message formats, their own loop logic, then translate all of that into whatever the underlying model API expects. The translation is usually lossy in subtle ways. You lose access to model-specific features that don’t fit the framework’s abstraction, and you sometimes get behavior that’s slightly off because the framework’s notion of a tool call doesn’t map cleanly onto the model’s native notion.

A model-native harness inverts this. Rather than the framework defining the interface and translating to the model, the framework builds directly on the model’s own function-calling and tool-use primitives. For OpenAI’s models, this means the harness speaks the same tool-call format that the API exposes natively, without an intermediate translation layer.

The practical consequence is better feature parity. When OpenAI ships a new capability at the model level, structured output support, parallel tool calls, extended context handling, a model-native harness gets access immediately. A framework with its own abstraction layer has to update the translation logic first.

This is a design philosophy that LangGraph has been moving toward with its lower-level graph primitives, separating the orchestration logic from the LLM abstraction so you can use whatever the model natively supports. The Agents SDK is making a similar move, but from the position of a framework controlled by the same organization that controls the model, which gives them a consistency guarantee that third-party frameworks cannot have.

Long-Running Agents and the State Problem

The update also targets long-running agents specifically, and this is where the engineering challenges compound. Short-lived agents, ones that run a single task across a handful of tool calls, are relatively tractable. Long-running agents face several harder problems.

Context window limits become a real constraint when an agent runs for minutes or hours across dozens of tool calls. The history of what happened has to go somewhere. Either you summarize it (lossy), truncate it (lossy in a different way), or externalize it to a retrieval system (adds latency and complexity). The SDK needs an opinion on this if it wants to support long-running work without requiring developers to implement their own context management.

State persistence across failures is the other hard problem. If a long-running agent crashes halfway through a multi-step task, what do you have? With stateless execution, you have nothing and must restart from scratch. A durable state model lets you resume from the last checkpoint. This requires the framework to treat agent state as a first-class concern rather than an implementation detail of the runner loop.

Frameworks like AutoGen and CrewAI have each made different bets here. AutoGen leans heavily on the conversation history as the implicit state, which is simple but hits context limits fast. CrewAI uses a task/crew decomposition that makes state more explicit but requires more upfront design. The Agents SDK’s approach, as it matures, seems to be combining the sandbox boundary with more structured lifecycle management, letting the framework handle resumption rather than leaving it to the developer.

Comparing the Landscape

It is worth placing this update in context. The agent framework space in 2026 is crowded: LangChain and LangGraph, AutoGen, CrewAI, LlamaIndex Workflows, Anthropic’s Claude agent patterns, Google’s Vertex AI Agents. Each has made different trade-offs between abstraction, flexibility, and native model integration.

OpenAI’s position is distinct because they control both the framework and the most widely used models. That vertical integration means the model-native harness is not just a design preference but a genuine advantage: they can ship model capabilities and framework support simultaneously, with no translation gap. The risk is lock-in. Developers who build deeply on native harness features get fast access to new model capabilities but accept coupling to OpenAI’s model API surface.

For the sandbox execution story, the competition is mostly external services. E2B has built a strong developer experience around isolated code execution. Modal offers more general-purpose serverless compute with good isolation. Bringing sandboxing natively into the SDK removes one integration point, but it also means OpenAI controls the execution environment, including its resource limits, networking constraints, and pricing.

Where This Leaves Framework Design

The broader signal from this update is that agent frameworks are becoming execution environments, not just orchestration libraries. The distinction matters. An orchestration library tells you how to structure code you write and run yourself. An execution environment takes on responsibility for where and how the code runs, including the security boundaries and lifecycle management.

That is a larger surface area to manage, and it raises the stakes for framework correctness. A bug in an orchestration library produces wrong behavior. A bug in an execution environment can produce security incidents. OpenAI taking ownership of the sandbox means they are also taking ownership of the security guarantees, which is a meaningful commitment.

For developers building production agents today, the practical question is whether the native sandbox and model-native harness reduce enough friction to offset the increased coupling. For teams already on OpenAI’s models and planning to stay there, the answer is probably yes. For teams who value model portability or need custom execution environments, the abstraction that costs something in the short term may be the right long-term investment.

Was this interesting?