How EVA Exposes the Measurement Gap in Voice Agent Evaluation

Voice agents are no longer experimental. Customer service pipelines, enterprise tooling, and consumer apps increasingly rely on spoken AI interfaces that must understand language, orchestrate tools, and respond audibly in real time. The engineering has moved faster than the evaluation. Most benchmarks still test pieces of the pipeline in isolation, which means the numbers developers see before shipping bear little resemblance to what users experience during a call.

EVA (Evaluating Voice Agents), published by ServiceNow AI in early 2026, is a direct response to that problem. It is an end-to-end evaluation framework that runs complete multi-turn spoken conversations against a voice agent, scores them across two distinct axes, and surfaces failure modes that component-level testing cannot see.

Why Component Testing Falls Short

The prior evaluation landscape is fragmented by design. Frameworks like AudioBench, VoiceBench, and VoxEval measure speech understanding quality in single-turn interactions. EmergentTTS-Eval and SHEET focus on synthesis output and listener experience. FD-Bench and Full-Duplex-Bench measure turn-taking dynamics. VoiceAgentBench and CAVA add tool-calling into the picture but stop short of evaluating the full conversational workflow from opening greeting to task resolution.

Each of these frameworks is internally valid. The problem is composition. A cascade architecture (speech-to-text feeding a language model feeding text-to-speech) might score well on ASR quality metrics and well on TTS naturalness scores while still failing users systematically, because the failure modes that matter in production are often emergent: a single misrecognized character in a booking confirmation number cascades into authentication failure, the agent cannot proceed, and the conversation deteriorates across three more turns before terminating without resolution. No component-level metric captures that chain.

Audio-native models (speech-to-speech and large audio language models) create additional evaluation challenges. They do not have a clean STT/TTS boundary, so you cannot even slot them into most existing component pipelines for measurement.

The Two-Axis Model

EVA’s core contribution is measuring task accuracy and user experience as separate but related scores, then reporting both simultaneously.

EVA-A (Accuracy) breaks into three sub-dimensions. Task Completion is deterministic: after a conversation ends, the framework compares the final state of the scenario database against a ground truth state. Either the flight was changed to the correct date with the correct seat and within budget, or it was not. Faithfulness is evaluated by an LLM judge reviewing transcripts for hallucinations, policy violations, and misrepresentations against the agent’s declared instructions and tool outputs. Speech Fidelity is the most novel sub-dimension: a large audio language model judges whether the agent’s spoken audio accurately reproduced the intended text, with particular attention to named entities like confirmation codes, dollar amounts, and flight numbers.

EVA-X (Experience) covers three dimensions evaluated by LLM judges: Conciseness (response length appropriate for spoken delivery), Conversation Progression (maintaining context, avoiding repetition, driving toward resolution), and Turn-Taking (appropriate timing, no interruptions, no excessive silence).

This two-axis structure matters because accuracy and experience can diverge. An agent can complete tasks correctly while being verbose, repetitive, and awkward to speak with. Another agent might be pleasant to interact with while quietly failing the backend booking state. Optimizing for one without measuring the other produces systems that look good on half the picture.

Bot-to-Bot Evaluation Architecture

EVA runs evaluations without human annotators. A user simulator generates speech audio via TTS, sends it to the voice agent under test, receives audio in return, and continues the conversation according to a structured goal and persona specification. Tool calls made by the agent are executed by a deterministic Python environment connected to a per-scenario database. The full audio stream, transcript, and tool logs are then passed to the metrics suite.

The scenario specification format is specific enough to make evaluations reproducible. A user goal defines exactly what the user needs (rebook a flight from AUS to LAX on March 25, arrive by 4 PM, spend no more than $120, confirm a window seat). A decision tree encodes how the user should negotiate: what criteria are hard requirements, what is acceptable to trade away, when to accept an offer, and when to terminate the call. A persona layer adds behavioral texture: communication style, patience level, how the user handles unexpected offers.

The dataset ships with 50 airline scenarios covering irregular operations rebooking, voluntary itinerary changes, cancellations, same-day standby, and compensation vouchers, with 15 backend tools available to the agent. Airline customer service is a sensible initial domain: it has well-defined policies, real consequences for errors, complex multi-step workflows, and heavy reliance on named entities that are acoustically fragile.

Loading the scenarios is straightforward:

from datasets import load_dataset

dataset = load_dataset("ServiceNow-AI/eva", "airline")

Speech Fidelity and LALM-as-Judge

The Speech Fidelity metric deserves specific attention. Most evaluation work on voice agents checks whether the agent correctly understood the user; EVA is among the first frameworks to formally check whether the agent correctly spoke its own output.

This is not a trivial concern. TTS systems abbreviate entity strings inconsistently. A confirmation code like ZK3FFW might be spoken as “zed-kay-three-eff-eff-double-you” by one TTS engine and “ZK three FFW” by another. For a caller writing down a confirmation number, the difference is significant. EVA uses a large audio language model as the judge for this dimension, evaluating the audio output directly rather than comparing transcripts, which can silently absorb errors that the audio makes audible.

What the Early Results Show

EVA tested 20 systems, mixing proprietary and open-source models across cascade and audio-native architectures. Two findings stand out.

First, the accuracy-experience tradeoff is real and consistent across configurations. Systems that score well on task completion tend to score worse on conversational experience, and vice versa. No evaluated system dominated both axes. This finding was invisible to all prior benchmarks that measured only one dimension, which means developers building on single-axis evaluations have been making architectural choices based on incomplete signal.

Second, the gap between pass@k and pass^k is large. Pass@k (at least one of k runs succeeds) and pass^k (all k runs succeed) describe different properties: peak capability versus consistency. For production deployment, consistency is what determines user experience at scale. A system that can complete a complex rebooking task in one out of three attempts is not a reliable system. The EVA results show substantial gaps between these two metrics across the board, which suggests that current voice agents have more variance than their accuracy numbers imply.

Named entity transcription emerged as the dominant failure mode. A single misheard character early in a conversation, particularly in a confirmation number or flight code, cascades through authentication failures and ultimately breaks the entire workflow. This is an argument for specific investment in named entity handling, separate from general ASR quality improvement.

Acknowledged Limitations

The framework is transparent about its constraints. The user simulator uses a commercial TTS provider, which may favor ASR systems tuned on similar audio characteristics. The 50-scenario airline domain is narrow; the same agent behaviors that work on flight rebooking may not generalize to healthcare or financial services. LLM judges carry their own biases, especially when the evaluated model and the judge model share a provider. Task completion is binary, which discards useful information about partial progress.

Bot-to-bot evaluation also introduces a gap relative to production. Real callers produce disfluencies, emotional variation, background noise, and off-topic detours that a structured user simulator will not. The consistency gap EVA measures is almost certainly narrower than the gap that appears in production.

What This Changes

EVA does not solve voice agent evaluation; it establishes a more complete problem statement for it. The two-axis scoring model gives teams a framework for making explicit tradeoffs rather than optimizing blindly for whichever metric their existing tooling happens to measure. The Speech Fidelity sub-dimension and the bot-to-bot architecture both point toward evaluation practices that the field has been neglecting.

The dataset and framework are both available under an MIT license. The GitHub repository contains the evaluation code, and the project site includes the early results visualization. For anyone building voice agents with production ambitions, running evaluations against this benchmark is a reasonable baseline check before the conversation moves to users.