Where Agent Pipelines Break: VAKRA's Approach to Stage-Wise Failure Attribution

Most agent benchmarks measure outcome. VAKRA, a benchmark from IBM Research, measures diagnosis. That distinction shapes what you actually learn from the results.

When an agent fails a task on SWE-bench or GAIA, you know it failed. You do not know whether it selected the wrong tool, named an argument incorrectly, supplied a wrong value to a correctly named argument, or retrieved everything right and then synthesized a bad final answer. These failures have different causes and different fixes. Collapsing them into a single pass/fail score tells you almost nothing about where to direct engineering effort. VAKRA runs a waterfall-style failure attribution pipeline that classifies the first point of breakdown in the execution chain, covering 5,187 test instances across four capability areas, backed by over 8,000 locally hosted APIs spanning 62 domains.

The Four Capabilities

The first three capabilities form a ladder of increasing compositional difficulty.

API Chaining (2,077 instances) requires agents to sequence 1 to 12 tool calls where each call’s output feeds into the next as a labeled reference. The benchmark uses two tool collections: SLOT-BIRD, which provides 7 generic data manipulation tools with many optional parameters, and SEL-BIRD, which provides specialized tools with categorical arguments and a larger selection per domain. A representative chaining task:

{
  "query": "Which football team has a build-up play speed of 31, dribbling of 53, and passing of 32?",
  "tool_calls": [
    {"name": "get_data", "arguments": {"tool_universe_id": "486ea46224d1"}, "label": "retrieved_data_1"},
    {"name": "select_data_equal_to", "arguments": {"data_label": "retrieved_data_1", "key_name": "play_speed", "value": 31}, "label": "FILTERED_DF_0"},
    {"name": "select_data_equal_to", "arguments": {"data_label": "FILTERED_DF_0", "key_name": "play_dribble", "value": 53}, "label": "FILTERED_DF_1"},
    {"name": "select_data_equal_to", "arguments": {"data_label": "FILTERED_DF_1", "key_name": "play_passing", "value": 32}, "label": "FILTERED_DF_2"},
    {"name": "get_team_name", "arguments": {"data_label": "FILTERED_DF_2", "n": 1}}
  ],
  "answer": "FC Barcelona"
}

Each intermediate label threads through subsequent calls, creating an explicit dependency graph across execution.

Tool Selection (1,597 instances) uses REST-style dashboard APIs with 6 to 328 tools per domain, averaging 116. Because the OpenAI function-calling API limits tool specifications to 128 per request, VAKRA incorporates a shortlisting mechanism as the baseline. This capability isolates whether models can identify the correct API in a large tool set without necessarily chaining many calls.

Multi-Hop Reasoning (869 instances) requires 1 to 5 logical hops using the same REST-BIRD dashboard collection. Performance degrades predictably with hop depth: 1-hop tasks score highest, and each additional reasoning step compounds error rates across all tested models.

Multi-Source Reasoning and Policy Adherence (644 instances) combines API calls with document retrieval, multi-turn dialog context, and plain-text access constraints. An example policy constraint from the benchmark:

If a user's query pertains to Technology & Software, which focuses on codebases,
software platforms, applications, and user interactions in tech, make sure you try
answering them by only using document retrievers. Do not use other types of tools.

This capability is qualitatively different from the first three. The agent must reason about constraints on its own tool access, not just execute sequences correctly.

The Waterfall Evaluation Pipeline

VAKRA attributes failures to one of four stages, in order:

Tool selection: were the right tools chosen?
Argument specification: were argument names correct, without hallucination or omission?
Argument values: were the correct values supplied?
Final response: was the retrieved information synthesized accurately?

To evaluate tool-sequence correctness, the benchmark runs predicted tool calls in the same execution environment as the ground truth and compares the resulting sets of tool responses. This matters in a way schema-level comparison does not: a model can emit syntactically valid JSON for a tool call that still produces incorrect results. A two-stage verification follows: a programmatic check for information recovery, then an LLM-based semantic equivalence check adapted from the CRAG evaluation framework to handle cases where structurally different responses carry equivalent meaning.

Running the actual execution traces rather than comparing predicted call schemas to ground truth schemas is the key methodological move. It tolerates alternative valid tool sequences while still catching the specific failure of retrieving incorrect or incomplete information.

Where Each Capability Breaks

API Chaining failures split by tool collection type. SLOT-BIRD’s many optional parameters drive argument naming errors; models generate plausible-looking parameter names that do not match the actual schema. SEL-BIRD’s larger tool set shifts failures upstream to tool selection. The failure type is not fixed across a capability; it depends on what kind of complexity the task structure imposes.

Tool Selection exposes something the VAKRA analysis explicitly highlights: models that select the right tool and call it with correct arguments can still fail to synthesize a correct final answer from the tool’s response. Tool execution competence and answer synthesis competence are separate skills, and they do not correlate reliably. Error rates are high at both the selection stage and the value specification stage, while required-parameter omission and hallucination rates are comparatively low. The practical implication is that optimizing purely for function-calling accuracy leaves a meaningful gap in end-to-end reliability.

Multi-Hop Reasoning degradation with hop depth aligns with what is understood about transformer attention over long contexts. As the execution trace grows, earlier intermediate results get appended to the context and become less accessible. Each additional tool call result extends the context window the model must reason over, and the information from earlier hops competes with more recent additions. GPT-OSS-120B performed best on API chaining tasks; Gemini-3-flash-preview led on dashboard API tasks, likely because of stronger performance across the large tool selection surface in Capability 2.

Multi-Source and Policy Adherence produces the most informative failure analysis. The benchmark distinguishes two policy situations: policies consistent with the information source the agent would naturally use, labeled “No Updates to Answer,” and policies that restrict access to a source the agent needs, labeled “Policy Updates Answer.” The first situation has minimal performance impact. The second causes a clear drop across GPT-OSS-120B, Gemini-3-flash-preview, and Claude-Sonnet-4-5. Granite-4.0-h-Small-32B showed notably less degradation under restrictive policies.

Failures under restrictive policies divide between two patterns: the agent violates the constraint and uses the restricted tool, or the agent honors the constraint but fails to find sufficient information through permitted sources. Both patterns produce wrong answers, but they represent different underlying problems. One is a compliance failure; the other is a compensatory reasoning failure.

What VAKRA Adds to the Benchmarking Landscape

The Berkeley Function Calling Leaderboard tests whether models generate syntactically correct function calls with the right schema. GAIA stratifies tasks by the number of reasoning steps and tools required. ToolBench evaluates API discovery and chaining on real-world REST endpoints. Each of these measures something real, but none attributes failures to specific pipeline stages the way VAKRA does.

The closest analogue is ablation-style evaluation: fix everything upstream and measure what breaks downstream. VAKRA does not hold stages artificially correct; it runs the full execution trace and identifies the first real failure point. This is more computationally expensive but produces more useful signal for understanding where a model’s agent architecture is weak, and more importantly, which weakness is the binding constraint on overall performance.

For practitioners building enterprise agents, the Capability 4 results carry the most immediate relevance. Access policies in production deployments are frequently expressed as natural language in system prompts. If models do not treat those constraints as first-class reasoning inputs during tool routing, behavioral reliability is bounded by something that more tool-use training data will not directly address. Instruction-following on action constraints is a different problem from schema-correct function calling, and benchmarks that collapse the two will mask this gap.

The VAKRA dataset, leaderboard, and GitHub repository are publicly available. The per-capability breakdown on the leaderboard is more diagnostic than the aggregate score: a model with a strong Capability 1 score and a weak Capability 4 score has a different engineering profile than one that is uniformly mid-tier across all four, and those profiles warrant different responses in how you build the agent system around it.