· 6 min read ·

The Execution Gap: Why Knowing Which Tool to Call Is Only Half the Problem

Source: huggingface

Looking back at the OpenEnv in Practice post from Turing and Hugging Face, published in February 2026, one number stands out more than any other: agents succeeded on calendar tasks roughly 90% of the time when given explicit calendar identifiers, and dropped to around 40% when the same tasks were phrased with natural language descriptions. That gap is not about intelligence. The model understands the task either way. The gap is about execution, about correctly resolving a human reference like “my work calendar” to the specific identifier the API requires, then forming a valid tool call around it.

This is the core insight the Calendar Gym benchmark surfaces, and it reframes how I think about building agents entirely.

What OpenEnv Is

OpenEnv launched in October 2025 as a joint project between Meta and Hugging Face. At its surface it is an evaluation framework, but the more interesting thing is its design philosophy. Rather than exposing raw tool lists to models and hoping they figure out argument schemas, OpenEnv wraps tools inside what it calls “agentic environments”: secure sandboxes with explicit state, isolated execution contexts, and structured error semantics. The environments expose a Gymnasium-style API, so interaction is reset(), iterative step() calls, and close().

The framework is built natively on Model Context Protocol, which means tool discovery and invocation go through two action types: ListToolsAction and ToolCallAction. A minimal interaction looks like this:

from openenv import MCPEnvClient, ListToolsAction, ToolCallAction

client = MCPEnvClient.from_hub(base_url="TuringEnterprises/calendar-gym")
obs = client.reset()

# Discover what tools exist in this environment
tools_response = client.step(ListToolsAction())

# Invoke a tool with structured arguments
action = ToolCallAction(
    tool_name="events_insert",
    arguments={
        "calendarId": "primary",
        "summary": "Team sync",
        "start": {"dateTime": "2026-03-12T10:00:00-07:00"},
        "end": {"dateTime": "2026-03-12T11:00:00-07:00"}
    }
)
result = client.step(action)

One of the structural decisions worth paying attention to: the same environment runs for both reinforcement learning post-training and production evaluation. This is uncommon. Most evaluation frameworks are read-only benchmarks disconnected from training pipelines. OpenEnv explicitly targets the combination, which means reward signals come from the same state machine that production agents will face.

Why Calendars Are a Good Test

The Calendar Gym contributed by Turing might look like a narrow domain choice, but calendar APIs have properties that stress-test agents in ways that matter broadly. Access control lists mean an agent operating on behalf of one user may have read but not write access to another user’s calendar, and figuring that out requires trying an operation and handling the permission error correctly rather than assuming access upfront. Multi-step workflows require correct ordering: you cannot invite attendees to an event that has not been created yet, and the agent needs to track which steps it has completed. Temporal reasoning adds another layer since RFC 3339 datetime strings with timezone offsets are required, and small formatting mistakes produce silent failures or validation errors.

These are the same categories of complexity that appear in any real API-driven workflow: CRM systems, ticketing platforms, infrastructure automation. The calendar is a proxy.

The Finding About Arguments

More than half the errors in Calendar Gym evaluations came from malformed tool arguments, not wrong tool selection. The three dominant categories were schema validation errors (missing required fields, incorrect JSON nesting, type mismatches), permission errors where the agent attempted operations outside its OAuth scope, and datetime format errors where the model produced something like 2026-03-12T10:00:00 without a timezone offset when the API requires 2026-03-12T10:00:00-07:00.

This aligns with what the MAST failure taxonomy, published at NeurIPS 2025, found across a much broader dataset of 1,600 annotated agent traces. Different models fail in distinctly different ways: GPT-OSS-120B shows cascading collapse with an average of 5.3 failure modes per failed trace; Gemini tends to hallucinate task completion; Kimi-K2 terminates prematurely. The failures are model-specific and structural, not just random noise.

The OpenEnv response to this is to return structured error payloads rather than opaque failures. When a tool call fails, the environment returns a typed error like validation_error, permission_error, or format_error along with specifics about what went wrong. This makes repair-and-retry loops possible in principle. Whether a given agent actually uses that signal to correct itself is another question, and it is one of the things the benchmark reveals about agent architecture.

Where This Fits in the Benchmark Landscape

The agent evaluation space has changed rapidly. SWE-bench Verified was deprecated in February 2026 after contamination and flawed test cases undermined its reliability as a signal. AgentBench from 2023 spanned eight environments and found that open-source models of that era scored below 1.0 on an 8-point scale where GPT-4 reached 3.78. tau-bench from Sierra AI introduced the pass^k metric to measure consistency across multiple runs of the same task, which revealed that single-run accuracy on multi-turn policy tasks is misleadingly optimistic. IBM’s IT-Bench, also from early 2026, put agents against real Kubernetes incident triage and got 0% task completion on financial operations scenarios.

OpenEnv sits in a different position than any of these. It is not primarily a leaderboard. It is infrastructure for defining and sharing evaluation environments, with the Hub as a discovery mechanism. The environments on the Hub as of early 2026 include a Python REPL, a BrowserGym integration, Wordle, Sudoku, and the Calendar Gym. The model is closer to a package registry than a benchmark: you publish an environment, others evaluate against it, and training pipelines can consume the same environment.

The MCP-native design is the biggest structural differentiator. Most agent benchmarks define their own tool call formats, which means agents trained or optimized for one benchmark do not transfer cleanly to another. Building on Model Context Protocol creates at least the possibility of shared tooling across evaluation environments and production deployments.

What This Means for Building Agents

I build Discord bots, so my agentic code tends to involve tool calls to Discord’s API, GitHub, maybe a database or two. The failure modes the Calendar Gym surfaces are immediately familiar. Permission errors when a bot tries to post in a channel it cannot access. Argument errors when an embed field exceeds the character limit. Multi-step ordering failures when a reaction role assignment runs before the message it targets has been confirmed to exist.

The ReAct pattern of interleaving explicit reasoning traces with tool calls is the standard approach for giving agents the structure to handle these situations. But ReAct alone does not guarantee that an agent will correctly parse a structured error payload and modify its subsequent tool call accordingly. That behavior needs to be either prompted explicitly, trained in, or enforced architecturally.

OpenEnv’s approach of returning typed error payloads rather than raw HTTP error codes or exception strings is a practical design choice. An agent seeing {"error": "validation_error", "field": "start.dateTime", "expected_format": "RFC3339"} has something to work with. An agent seeing a 400 status code and an opaque error message from a misconfigured API does not.

The 90%-to-40% accuracy gap on explicit versus natural language task descriptions points to a specific capability that is undersupported in most agent training: entity resolution against real system state. An agent needs to list the user’s calendars, match the natural language description to one of them, and then use that identifier in subsequent calls. This is a multi-step sub-task that occurs before the primary task even begins, and it is exactly the kind of thing that gets dropped in benchmarks that only measure final task success.

Where This Goes

The evaluation problem for agents is genuinely harder than the evaluation problem for base language models, and frameworks like OpenEnv are making progress on the right dimensions: real APIs with real permission systems, stateful environments where earlier actions affect later ones, structured feedback that enables intelligent error recovery, and infrastructure that bridges evaluation and training.

The Calendar Gym finding about argument formation errors is not a surprising result in retrospect, but it is useful to have it quantified. Building agents that can correctly resolve ambiguous references, form valid tool arguments on the first try, and recover from typed errors gracefully remains the practical engineering work. Evaluation frameworks that surface these failure modes specifically are more useful than those that report only overall task completion rates.

The environments available on the OpenEnv Hub are still a small set, and the real value of the framework depends on whether the community builds and shares environments that cover a wider range of real APIs. That trajectory looks promising, but it is early.

Was this interesting?