What It Actually Takes to Benchmark AI Agents for the Factory Floor

When benchmark results look good on paper but agents fail in production, the problem is usually the benchmark. IBM Research’s AssetOpsBench, which landed on Hugging Face in January 2026, takes a pointed stance on this: most existing AI agent benchmarks were not built with industrial realities in mind, and the scores they produce do not translate to reliable deployment.

Coming back to this a couple of months later, the core argument holds up well.

The Problem With Existing Agent Benchmarks

Most agent benchmarks in wide use today, whether SWE-bench for code repair, GAIA for general assistant tasks, or WebArena for browser navigation, share a common structure: present the agent with a task, observe whether it completes the task, report a pass rate. This works when the task boundaries are clean and the environment is deterministic.

Industrial asset operations do not have clean task boundaries. A maintenance engineer working with sensor telemetry is navigating data that is noisy, incomplete, and temporally dependent. Logs from different systems may conflict. A sensor that went offline last Tuesday leaves a gap that has to be reasoned around rather than filled in. Work orders reference failure modes that require domain knowledge to interpret. Getting this kind of work right means handling uncertainty gracefully, not just pattern-matching against a training distribution.

AssetOpsBench was built to evaluate agents in exactly this messier territory: anomaly detection across sensor streams, failure mode reasoning and diagnostics, KPI forecasting, and work order summarization and prioritization. The dataset behind it is substantial, with 2.3 million sensor telemetry points, 4,200 work orders, 53 structured failure modes, and over 150 expert-curated evaluation scenarios.

A Six-Dimensional Scoring Framework

The most technically interesting design decision in AssetOpsBench is the move away from binary task completion toward a six-dimensional scoring framework. Each agent run is scored across task completion, retrieval accuracy, result verification, sequence correctness, clarity and justification, and hallucination rate.

This is worth unpacking. Sequence correctness captures whether the agent took steps in the right order, which matters enormously in multi-step workflows where premature action can corrupt downstream state. Result verification checks whether the agent validated its own output before proceeding, a proxy for self-correction behavior. Hallucination rate is measured directly rather than inferred from task success, which means an agent can complete a task while still fabricating intermediate reasoning that would not survive scrutiny in a real deployment.

The effect is that two agents with similar task completion scores can look very different once you apply all six dimensions. This mirrors how industrial software systems are actually evaluated: correctness of the final output is necessary but not sufficient.

TrajFM: Making Failure Analysis Systematic

The benchmark’s failure analysis pipeline, called TrajFM, is the contribution that carries the most practical weight. It runs in three stages: LLM-guided extraction of failure events from execution trajectories, embedding-based clustering to surface recurring patterns, and structured visualization for developer feedback.

The clustering step is particularly significant because it allows the taxonomy to grow. Rather than mapping every failure to a predefined category, TrajFM can discover new failure patterns from the data and add them to the taxonomy over time. This is closer to how postmortems work in real engineering organizations: you start with a known set of failure categories, but production surprises you regularly.

From the community evaluation run so far, involving 225 users and over 300 agent submissions, the benchmark has identified five dominant failure patterns. Ineffective error recovery accounts for 31.2% of observed failures. Overstated completion, where the agent reports success despite having failed, accounts for 23.8%. Formatting issues come in at 21.4%, unhandled tool errors at 10.3%, and ignored feedback at 8.0%.

The overstated completion figure is the one that should concern anyone building agents for production use. An agent that silently fails and reports success is considerably more dangerous than one that fails loudly. In a maintenance workflow, a false positive on task completion could mean a critical inspection is skipped. TrajFM surfacing this as a named, quantified failure mode rather than burying it inside a generic accuracy number is a meaningful step forward.

Multi-Agent Coordination: Where Performance Degrades

The benchmark tracks two evaluation tracks separately: planning-oriented agents that coordinate multiple sub-agents, and execution-oriented agents that operate as dynamic single-unit workflows. The performance gap between these tracks reveals a structural problem in current LLM capabilities.

Single-agent accuracy across tested models sits at 68%. Multi-agent accuracy drops to 47%, a 21-point degradation. This is not a marginal gap; it reflects a genuine failure mode in how current models handle coordination. When information has to be passed between agents, context management becomes a source of error. Agents disagree about shared state. The orchestrator may issue instructions that sub-agents interpret inconsistently.

Among the specific models evaluated, GPT-4.1 achieved the highest planning score at 68.2 and the highest execution score at 72.4, but its primary failure mode was hallucinating task completion on complex workflows, exactly the overstated completion pattern. Mistral-Large scored 64.7 on planning and 69.1 on execution but struggled with multi-hop tool sequences. LLaMA-4 Maverick scored 66.0 and 70.8 but frequently missed clarifying questions before acting on ambiguous inputs.

None of these models hit the benchmark’s 85-point deployment readiness threshold, defined as the score level at which IBM Research considers an agent viable for industrial deployment. That no current model reaches this threshold is the benchmark’s central finding.

Tool Use as the Key Differentiator

One result that stands out from the evaluation data: tool accuracy is the single biggest differentiator between high and low performing agents. Top-performing agents achieve 94% tool accuracy; low-performing agents land at 61%. The 33-point spread here is larger than the gap between any two models on overall task completion.

This makes sense given the domain. Industrial asset operations require agents to query telemetry databases, retrieve historical work orders, access failure mode knowledge bases, and execute diagnostic tools in the correct order. An agent that misuses or fails to invoke these tools correctly will produce plausible-sounding outputs that are grounded in nothing. Hallucination rate and tool accuracy are closely coupled: most hallucinated outputs trace back to tool invocation failures rather than factual errors in the model’s training data.

The benchmark also found that access to structured failure mode databases improved performance, but that retrieval-augmented generation was not used optimally even when the knowledge was available. Models retrieved relevant context but did not always integrate it correctly into their reasoning chains, suggesting that retrieval quality and reasoning integration are partially decoupled problems worth addressing separately.

What This Means in Practice

AssetOpsBench is useful as a benchmark, but it also functions as a design document for anyone building agents that operate in industrial or enterprise settings. The six scoring dimensions read as a checklist: does your agent verify results before proceeding, maintain correct operation order in multi-step workflows, and fail loudly rather than claiming false success?

The multi-agent coordination findings suggest that current orchestration approaches carry real costs. The 21-point accuracy drop under multi-agent scenarios implies that the overhead of coordination, context passing, and state management is not yet handled reliably by any off-the-shelf framework. Systems designed for industrial deployment probably need explicit verification steps at agent handoff points rather than relying on the orchestrator to maintain coherent shared state.

The TrajFM methodology is also worth borrowing independently of the benchmark itself. Treating failure trajectory analysis as a first-class evaluation signal, clustering failure patterns from actual execution traces, and evolving the taxonomy as new patterns emerge is a more honest accounting of agent behavior than a single aggregate score. The same approach could be applied to any domain where agents operate in complex, multi-step workflows.

The AssetOpsBench playground is live on Hugging Face Spaces, and the competition on CodaBench is open for submissions. The GitHub repository includes the full dataset and evaluation code.

For anyone building agents for real operational environments, the benchmark’s primary message is direct: task completion rates are a floor, not a ceiling, and the interesting work starts when you instrument what actually goes wrong.