What VAKRA Exposes About the Gap Between Tool Invocation and Agent Reliability

Most agent benchmarks test whether a model produces the right JSON. VAKRA tests whether the model gets the right answer after the JSON runs. That gap is where the interesting failures live.

VAKRA, released by IBM Research, is a tool-grounded, executable benchmark built around 8,000+ locally hosted APIs spanning 62 domains, all backed by real databases. Tasks require anywhere from 3 to 7 reasoning steps, and the evaluation pipeline actually executes predicted tool calls and checks the outputs against ground truth rather than comparing parameter names at the syntax level. If a model hallucinates an argument value and the API returns garbage, the failure registers downstream where it belongs.

That design choice changes what you learn from the results. Static evals can give you a false sense of competence when models produce syntactically valid but semantically wrong calls. VAKRA forces the error to surface at the only level that matters: did the agent retrieve the right information and produce a grounded answer.

Four Capabilities, Four Distinct Failure Profiles

VAKRA structures evaluation around four progressively more complex capability tiers.

Capability 1 covers API chaining: 2,077 instances across 54 domains using two different tool collection designs. SLOT-BIRD provides 7 generic, parameterized tools modeled on interfaces like Tableau or Google Analytics. SEL-BIRD flattens those parameters into specialized tools, so instead of one sort_data function with a direction argument, you get sort_data_ascending and sort_data_descending as separate tools. The trade-off is explicit: SLOT-BIRD has fewer tools with richer argument surfaces; SEL-BIRD has more tools but simpler argument surfaces per tool.

A typical Capability 1 chain looks like this:

{
  "query": "Which football team has build-up play speed of 31, dribbling of 53, and passing of 32?",
  "tool_calls": [
    {"name": "get_data", "arguments": {"tool_universe_id": "486ea46224d1-aeb8037c5e78"}},
    {"name": "select_data_equal_to", "arguments": {"data_label": "retrieved_data_1", "key_name": "play_speed", "value": 31}},
    {"name": "select_data_equal_to", "arguments": {"data_label": "FILTERED_DF_0", "key_name": "play_dribble", "value": 53}},
    {"name": "select_data_equal_to", "arguments": {"data_label": "FILTERED_DF_1", "key_name": "play_passing", "value": 32}},
    {"name": "get_team_name", "arguments": {"data_label": "FILTERED_DF_2", "n": 1}}
  ],
  "answer": "FC Barcelona"
}

The interesting failure mode in Capability 1 is not tool selection, it is argument naming, especially with SLOT-BIRD’s optional parameter-heavy interface. Models that perform well overall still drop points by omitting optional arguments or hallucinating parameter names that do not exist in the schema.

Capability 2 introduces a tool selection problem at scale: 1,597 instances across 17 domains, with REST-style endpoint collections ranging from 6 to 328 tools per domain, averaging 116. This immediately surfaces an architectural reality. OpenAI’s API caps the tool list at 128 entries. For domains in Capability 2 with 328 tools available, any system built on that API must implement a shortlisting mechanism before the model ever sees the options. That is not a minor implementation detail, it is a retrieval problem nested inside a reasoning problem, and mistakes at the retrieval stage cascade into tool selection failures downstream.

The finding that stood out here: even when models selected and executed the correct tool calls, they frequently failed to synthesize a coherent final answer from the tool responses. The execution was right; the answer assembly was wrong. That pattern suggests something distinct from tool-use capability, it is closer to reading comprehension over structured outputs.

Capability 3 tests multi-hop reasoning across 869 instances with chains requiring 1 to 5 logical hops. Performance drops predictably with hop depth. Single-hop queries show the best results across all evaluated models. Two-hop queries show significant degradation. Three or more hops degrade further. The curve is not surprising, but its steepness reinforces that models are not doing genuine multi-step planning, they are pattern-matching well on shallow chains and losing coherence as dependency depth increases.

Capability 4 is where things get genuinely hard and where the most deployment-relevant findings appear. It stacks everything: API calls combined with document retrieval (RAG), multi-turn conversational context, and explicit tool-use policies expressed in plain text.

Policy Adherence Is the Real Test

The policy constraints in Capability 4 are the most practically significant part of the benchmark. An example policy looks like this:

If a user's query pertains to Technology & Software (codebases, software platforms,
applications, user interactions in tech), answer using only document retrievers.
Do not use other types of tools.

This is not a capability restriction hidden in a system prompt as an afterthought, it is a routing rule of the kind you would write in any real enterprise deployment where different data sources have different access controls, compliance requirements, or cost implications. Every production agent system I have seen or built eventually needs something like this.

The finding: almost every evaluated model shows performance drops under policy constraints. Models either violate the constraint outright or, when they respect it, fail to retrieve enough information through the permitted channels to answer correctly. The one exception in the benchmark was Granite-4.0-h-Small-32B, which maintained performance under policy constraints, though the paper does not detail why.

The failure pattern for constraint violation is telling. These models understand the policy text. They can parse it. The problem is that when the policy restricts access to the information source the model would normally reach for, many models fall back on parametric knowledge or reach for the disallowed tool anyway rather than working harder through the permitted retrieval path. Constraint adherence under resource restriction is a different cognitive load than constraint adherence when the unconstrained path is also available.

What the Evaluation Architecture Gets Right

VAKRA’s waterfall evaluation pipeline is worth examining on its own terms. Rather than a single judge call, it stages verification:

Policy adherence check (programmatic, for Capability 4)
Tool sequence comparison via execution and output matching
Final response grounding check

The tool sequence comparison is permissive in a useful way: it rewards alternative valid reasoning paths rather than requiring exact reproduction of the ground-truth call sequence. If a model finds a different sequence of tool calls that recovers the same information, that counts. This avoids penalizing genuine generalization, which is a common failure mode in strict sequence-matching evals.

For cases where programmatic comparison is inconclusive, a secondary LLM-based semantic equivalence check runs. The scoring for Capability 4 weights multi-source hybrid tasks at 2x relative to single-source tasks, which reflects their actual difficulty rather than flattening everything to a uniform score.

The leaderboard score formula is simple:

Leaderboard_Score = 1/4 × (Capability_1 + Capability_2 + Capability_3 + Capability_4)

Capability 4’s internal weighting:

Capability_4 = (# correct multi-source × 2 + # correct API/RAG-only) /
               (# total multi-source × 2 + # total API/RAG-only)

The Gap That Matters for Building Systems

I build Discord bots. The surface area of tool use in that context is narrow: a handful of Discord API calls, some database reads, maybe an HTTP request to an external service. Even at that scale, I have watched language models pick the right function and fill it with wrong values, or pick the wrong function because the description was ambiguous, or succeed at every individual call and then hallucinate a synthesis that contradicted what the tools returned.

VAKRA quantifies that experience at enterprise scale and gives it taxonomy. The failure categories, tool selection errors, argument omission and hallucination, argument value errors, and final response grounding failures, map cleanly onto the bugs I have debugged in production.

What the benchmark makes concrete is that surface-level tool invocation competence and end-to-end agent reliability are different properties. A model that calls tools correctly in isolation can still fail compositionally when those calls need to chain, when the relevant tools exceed what fits in a single context window, when the answer requires integrating structured API responses with retrieved documents, or when a policy constrains which tools are allowed at all.

The enterprise software world is in the process of discovering this gap the hard way. VAKRA is a structured way to measure it before deployment rather than after. The dataset is publicly available on Hugging Face, the code is on GitHub, and the leaderboard is live. If you are evaluating models for any agentic workflow that involves more than a single tool call, running your candidates against this benchmark seems like a reasonable step before committing to an architecture.