The 14 Ways Enterprise AI Agents Fail, and What IBM and Berkeley Found When They Looked Closely
Source: huggingface
Back in February 2026 (though the work was already circulating on arXiv in early 2025), IBM Research and UC Berkeley published a combination of benchmark and failure taxonomy that I think deserves more attention than it got. The IT-Bench benchmark combined with the MAST (Multi-Agent System Taxonomy) analysis framework isn’t just another leaderboard. It’s an attempt to answer a question that’s been nagging anyone building production agents: when an agent fails, what exactly went wrong, and how do you fix it?
Most agent benchmarks give you a number. SWE-bench gives you a percentage of GitHub issues resolved. AgentBench gives you a score across eight environments. Those numbers are useful for comparing models, but they tell you nothing about the failure mechanism. Two agents can score identically while failing for completely different reasons, which means the engineering interventions required to improve them are completely different too.
IT-Bench and MAST try to close that gap.
What IT-Bench Actually Tests
The benchmark covers three domains of enterprise IT automation: Site Reliability Engineering (SRE), compliance and security operations (CISO), and financial operations (FinOps). These aren’t toy tasks. The SRE scenarios drop an agent into Kubernetes environments with real observability data; Prometheus metrics, OpenTelemetry logs and traces, raw Kubernetes events, and alert snapshots in JSON format. The agent’s job is to identify which pods, services, or deployments are responsible for an ongoing incident. The CISO scenarios require generating correct OPA/Rego or Kyverno policies for Kubernetes compliance requirements. The FinOps scenarios ask agents to identify which resources caused cost anomalies in cloud billing data.
Across 65 publicly available scenarios in ITBench-Lite, the benchmark evaluates correctness, safety, and execution speed together, which is closer to what an enterprise actually cares about than correctness alone.
The results from the initial paper (arXiv 2502.05352) were instructive. The best models at the time of release achieved a 13.8% resolution rate on SRE tasks, 25.2% on CISO tasks, and 0% on FinOps. Zero. Cloud cost anomaly attribution is apparently beyond the current capability of any available model when evaluated rigorously. CISO compliance policy generation is the “easiest” domain and still fails three-quarters of the time.
The February 2026 blog post focused on a deeper study: 310 SRE diagnostic traces from three models, analyzed through the MAST framework. Gemini-3-Flash achieved 75.5% mean recall (using recall rather than F1, since SRE triage prioritizes not missing faults over precision), Kimi-K2-Thinking achieved 28.6%, and GPT-OSS-120B achieved 12.4%. Those numbers span a wide range, but the more useful question is why.
The MAST Failure Taxonomy
MAST was built by analyzing traces from five multi-agent frameworks across more than 150 tasks, with six human annotators reaching a Cohen’s Kappa of 0.88, which is strong agreement by any reasonable standard. The resulting taxonomy has 14 distinct failure modes organized into three categories, covering 1,600+ annotated traces in the MAST-Data dataset.
Category 1: Specification and System Design Failures. These are failures in how the agent was set up before it ever ran a task. They include disobeying the task specification, getting stuck in step repetition loops (FM-1.3), losing conversation history as context windows grow (FM-1.4), and being unaware of when to terminate (FM-1.5). These are architectural problems. They usually can’t be solved by prompting alone.
Category 2: Inter-Agent Misalignment. These failures involve communication and coordination breakdowns, including failing to ask for clarification when ambiguity blocks progress (FM-2.2), drifting off task entirely (FM-2.3), and the particularly frustrating reasoning-action mismatch (FM-2.6), where the agent correctly identifies what to do next and then does something else entirely.
Category 3: Task Verification and Termination. This covers agents that stop too early (FM-3.1, premature termination), and agents that hallucinate success, declaring a task complete based on faulty self-verification rather than tool-backed evidence (FM-3.3).
The Per-Model Failure Signatures
This is where the analysis earns its value. Gemini-3-Flash, Kimi-K2-Thinking, and GPT-OSS-120B don’t just fail at different rates. They fail in structurally distinct ways.
Gemini fails like an agent that found the answer but declared victory too early. Its dominant failure mode is FM-3.3 (Incorrect Verification), appearing 52% more frequently in failed traces than successful ones. Gemini identifies the right signals in the observability data but terminates before cross-referencing them against actual system state. It shows zero instances of FM-1.4 (memory loss), meaning it maintains context cleanly throughout long traces. The fix for Gemini is an external verification gate: require tool-based confirmation, AlertManager clearance or a clean Kubernetes health check, before allowing the agent to exit.
Kimi-K2 fails at the exit condition entirely. FM-3.1 (Premature Termination) spikes 46% in failed traces, FM-1.5 (Unaware of Termination Conditions) spikes 43%, and FM-2.6 (Reasoning-Action Mismatch) appears in 92% of its failures. Kimi often correctly identifies the next diagnostic step and then executes something unrelated instead, sometimes drifting into debugging its own investigation scripts rather than the original incident. The recommended fix is a deterministic finite state machine for termination control, not more prompting.
GPT-OSS-120B shows cascading collapse. It averages 5.3 distinct failure modes per failed trace, compared to 2.6 for Gemini. FM-1.4 (Loss of Conversation History) appears in 24% of its traces, versus 0% for Gemini and 7% for Kimi. As the SRE trace grows longer, it forgets which alerts it was originally triaging. FM-2.6 appears in 94% of its traces, nearly three times the rate seen in Gemini. These aren’t isolated problems; they compound each other. The agent loses context, then reasons inconsistently, then executes the wrong action, then can’t verify because it no longer remembers the original goal. Aggressive context hygiene and early-exit detection are the interventions that matter here.
The MAST paper (arXiv 2503.13657), which was presented at NeurIPS 2025, quantified the improvement potential: prompt engineering alone for memory-related failures yields roughly 15.6% improvement, while structural interventions like a Summarizer Agent to maintain state combined with a State Machine to enforce termination conditions yields up to 53% improvement. The gap between those two numbers should inform how you spend your engineering time when you’re trying to improve an enterprise agent.
What This Changes About Building Agents
The conventional approach to improving an underperforming agent is to adjust the prompt, maybe add chain-of-thought, maybe switch models, and run the benchmark again. If the score goes up, you ship it. IT-Bench and MAST argue for a different diagnostic loop: collect traces, classify failures by mode, identify which modes are fatal versus recoverable, and then apply targeted structural fixes.
Some failure modes turn out to be non-fatal. FM-1.3 (Step Repetition) appears in over 90% of successful Kimi-K2 runs. Looping through diagnostic steps repeatedly is apparently just what good SRE agents do; it’s not a signal of breakdown. FM-3.3 (Incorrect Verification) appears in failed traces, but also in successful ones, meaning agents can survive and correct from poor self-assessment. Treating every failure mode as equally urgent would send you chasing the wrong problems.
The failure modes that strongly predict task failure are FM-1.5, FM-3.1, FM-1.4, and FM-2.3. These are the ones worth engineering around first.
For anyone building production agents that need to operate reliably in enterprise IT environments, the ITBench-Trajectories dataset gives you a starting point for trace-level analysis without having to generate your own corpus. The MAST taxonomy gives you the annotation schema. Applying both to your own agent’s failure traces is the practical takeaway from this research.
The FinOps domain remaining at 0% is worth sitting with. Cloud cost attribution requires reasoning across billing data, workload configuration, and time-series anomalies simultaneously, and no current model can do that reliably in an autonomous loop. That’s a hard constraint on what enterprise FinOps automation can actually promise today, regardless of what any vendor’s marketing says.
The benchmark and taxonomy are both open; the IT-Bench code is at GitHub and the 65-scenario public dataset is on Hugging Face. If you’re building agents for enterprise IT workflows, it’s worth running your system against it before you trust it with a production Kubernetes cluster.