· 6 min read ·

Tool Schema Design Is the Hidden Variable in Agent Reliability

Source: huggingface

The assumption baked into most discussion about AI agent performance is that model quality is the primary variable. Pick a smarter model, get better agents. The findings from IBM Research’s VAKRA benchmark complicate that assumption in a specific and useful way: for a significant category of failures, the bottleneck is not the model but the interface through which the model accesses tools.

VAKRA covers 5,187 tasks across 62 domains, backed by more than 8,000 locally hosted API instances, and its evaluation framework scores both the final answer and the full execution trajectory. Errors are classified by the earliest stage of failure: tool selection, argument specification, argument value, or response generation. These categories are disjoint. A tool selection error is a tool selection error regardless of what the model produces downstream from it. That disjoint classification is what allows the benchmark to locate failure in the execution chain rather than just count wrong answers.

The SLOT-BIRD vs SEL-BIRD Experiment

VAKRA’s first capability tier tests API chaining across 2,077 tasks in 54 domains. What makes it methodologically interesting is that the same underlying data operations are exposed through two different tool collection designs simultaneously.

SLOT-BIRD provides seven generic tools with rich optional-parameter surfaces. A sort_data function might accept direction, column, nulls_first, and additional parameters depending on context. SEL-BIRD covers the identical operations but encodes parameters into the tool name: sort_data_ascending and sort_data_descending are separate tools with narrower argument surfaces. The reasoning task is identical. The data is identical. The only variable is the tool API design.

Models perform differently on the two collections in a pattern that reveals something structural. Failure on SLOT-BIRD is dominated by argument specification errors: models know which tool to call but misspecify optional parameters or omit required ones. Failure on SEL-BIRD shifts toward tool selection errors: models must pick among a larger vocabulary of more specific tools.

This is not a finding about model intelligence in the abstract. It is a finding about how the model’s cognitive task changes depending on how tools are designed. On SLOT-BIRD, the model must reason about parameter semantics. On SEL-BIRD, it must reason about tool selection from a wider menu. These are different loads, and current models handle them differently.

The practical implication for anyone who controls a tool schema: you are choosing which failure mode you are more willing to tolerate. Narrower tool signatures with fewer optional parameters tend to reduce argument specification errors. The tradeoff is a larger tool inventory, which increases selection complexity. Neither is free, but they are addressable through different engineering interventions, and knowing which one your system produces is more valuable than a single accuracy number.

The 128-Tool Ceiling and What Comes After It

Capability 2 tests tool selection from inventories ranging from 6 to 328 tools per domain, averaging 116. This immediately surfaces a hard engineering constraint: the OpenAI API caps function definitions at 128 per request. Any domain with more than 128 tools cannot be handled in a single call without architectural intervention.

The standard response is a shortlisting mechanism, a retrieval step that selects candidate tools from the full inventory before passing them to the model. VAKRA treats shortlisting capability as a measured variable rather than an implementation detail, and agents that can build and use such a mechanism perform better on large-tool-set tasks.

Shortlisting is not fundamentally a model-quality problem. It is an architectural problem. The tools themselves need metadata that supports retrieval, whether through embeddings, category tags, or structured descriptions. The retrieval step needs to run before the model generates its tool calls. The retrieved candidate set needs to be small enough to fit within context while still containing the correct tool with high probability. A better model does not solve any of these by default; they require deliberate decisions at the agent infrastructure layer.

Capability 2 also isolates a failure mode that is distinct from both tool selection and argument specification: what happens after a model calls the right tool correctly. VAKRA identifies this as the tool-to-answer synthesis gap. Models can select and invoke the correct tools, receive correct outputs, and then fail to extract the answer from those outputs. The execution was right; the answer assembly was wrong. This points to the synthesis step as a capability that needs explicit architectural attention, separate from everything that precedes it in the agent loop.

Error Compounding in Multi-Hop Chains

Capability 3 tests chains of one to five sequential reasoning hops across 869 tasks. The performance curve is monotone: more hops, worse results, for every model tested. The standard interpretation is that deeper chains exceed model reasoning capacity.

VAKRA’s disjoint error taxonomy supports a more specific explanation. When hop N produces an incorrect intermediate result, that result becomes the input to hop N+1. The agent has no mechanism to detect the error before proceeding. The failure propagates and compounds through all subsequent steps. The degradation curve reflects cumulative error propagation, not a single reasoning failure at a given hop depth.

An agent architecture with explicit intermediate verification would interrupt this propagation. Between each hop, before proceeding to the next step, the agent could check whether the intermediate result is coherent: does it match the expected type, fall within a plausible range, satisfy basic schema constraints? This does not require a separate model call in most cases. It requires that the agent loop be designed to pause and verify rather than passing intermediate results forward unconditionally.

The observation that no tested model shows robustness to deep chains suggests this compounding pattern is a property of current agent loop architectures, not solely of current models. You could address it with the same models by changing how the loop handles intermediate outputs.

Policy Constraints Must Enter at Plan Generation

Capability 4 introduces natural-language policy constraints expressed in plain text, instructions like “if the user’s query relates to Technology and Software, use only document retrievers.” Every model tested except Granite-4.0-h-Small-32B shows measurable performance drops when such constraints are active.

The failure mode is instructive. Models do not simply ignore the constraints; they read them. The problem is timing. When the model generates its tool-selection plan, it typically selects tools based on what would optimally answer the question, then encounters the policy restriction at execution time. The constraint was available at plan generation but was not operative in the planning step.

The fix is architectural. Policy constraints need to be integrated into tool selection at plan generation time, not checked against a completed plan after the fact. If the model generates a plan and a second pass then validates it against policy, violations are caught but the model never reasoned within the constrained solution space when constructing the plan. It needs to plan with the constraint active, not plan and then conform.

Granite-4.0-h-Small-32B’s robustness is notable at 32B parameters against significantly larger models. VAKRA does not fully account for why. One reasonable hypothesis is that Granite’s training included more policy-constrained agentic scenarios, which forced the model to incorporate constraints as planning inputs rather than execution-time checks. If that is the cause, it suggests that policy-adherent agent behavior is a trainable capability, not just an emergent property of model scale.

Designing Tool Schemas for Current Models

Reading across the VAKRA capabilities, a few design principles emerge that are grounded in the specific failure categories the benchmark identifies.

Prefer narrower signatures over generic parameter-rich tools when you control the tool surface. The argument specification failure mode on SLOT-BIRD is avoidable through schema design choices. The tradeoff is inventory size, which is a more tractable problem than hallucinated or omitted parameter values.

Build shortlisting into the architecture before you reach the context limit. Above 128 tools, you need a retrieval layer. Below 128, you probably still benefit from shortlisting if tool descriptions are long or semantically similar. Treat this as infrastructure rather than a workaround added when the system breaks.

Add intermediate verification to multi-hop reasoning loops. Between each hop, check whether the intermediate result is coherent before proceeding. This does not require a dedicated model call in most cases; lightweight schema or type validation at the loop level is sufficient to catch the most common propagation failures.

Incorporate policy constraints into tool selection at plan generation time. Design your planning context to include active policy constraints as part of the tool-selection framing, not as a separate post-hoc adherence check.

The VAKRA dataset and evaluation code are publicly available. The benchmark runs against locally hosted APIs rather than live internet services, which makes it reproducible without depending on third-party availability. For anyone building agent systems that involve chained tool use, the benchmark provides a structured way to identify which failure modes your current architecture produces before those failures surface in production. Knowing that your system fails at argument specification tells you something different and more actionable than knowing it achieved 60 percent accuracy overall.

Was this interesting?