Evaluating Voice Agents as Complete Systems: What EVA Gets Right

The proliferation of speech benchmarks over the past few years has produced a detailed map of how well models transcribe audio, synthesize speech, detect intent, and identify speakers. None of that tells you whether a voice agent can book a flight. That is the gap EVA, a new evaluation framework from ServiceNow AI, is designed to close.

Published on the Hugging Face blog and backed by an open dataset and GitHub repository, EVA builds a bot-to-bot audio pipeline that evaluates voice agents end-to-end on task-oriented conversations. The methodology is worth examining in detail, because it exposes structural problems in how these systems are built and measured.

The Component Evaluation Trap

Existing speech benchmarks are good at what they measure. SUPERB evaluates frozen speech representations across a battery of tasks: phoneme recognition, keyword spotting, intent classification, ASR word error rate. Frameworks like AudioBench, VoxEval, and VoiceBench evaluate single-turn audio understanding. TTS quality benchmarks measure intelligibility, naturalness, and prosody in controlled listening conditions.

The problem is that voice agents fail in ways these benchmarks cannot see. A pipeline with strong ASR accuracy and high TTS quality scores can still fail reliably when a user says “confirmation code B3R-7K9,” the agent mishears one character, returns a booking-not-found error, and the conversation collapses. The ASR score measured a different input distribution than the one the agent encountered. The TTS metrics said nothing about whether the agent accurately spoke the confirmation code back before asking the user to confirm.

Component benchmarks also cannot capture failures that arise from interactions between components. Context accumulated over ten turns can overwhelm a model’s effective attention, causing it to drop a constraint stated early in the conversation. An LLM response can be factually grounded in its context but phonetically ambiguous when synthesized, misleading the user without producing a hallucination by any evaluator’s definition. A silence gap that is unremarkable in text-based chat becomes disorienting in voice because the caller cannot tell whether the system is processing or has dropped the call.

EVA’s central design decision is to evaluate the complete audio-to-audio pipeline in an interactive, multi-turn conversation rather than isolating any single stage.

The Bot-to-Bot Pipeline

The evaluation setup uses a simulated caller, a conversational AI with a specified goal and persona that produces audio via TTS, speaking to the voice agent under test. Both sides operate on audio throughout the conversation. The simulated caller draws its objectives from a 50-scenario airline customer service dataset covering rebookings, voluntary itinerary changes, cancellations, standby requests, and compensation voucher issuance. Each scenario includes a persona specification with details like speaking style and patience level.

Critically, tool call responses are deterministic. Each scenario has its own isolated database tracking state changes. When a voice agent calls a lookup function, it always gets the same answer. This eliminates a variance source that makes multi-run evaluation hard to interpret: you can distinguish between “the agent gave wrong information” and “the backend returned different data” because the latter cannot happen. Task completion is assessed by comparing the expected final database state to the actual final state, entirely programmatically, with no LLM judge involved.

The framework also includes validators that check whether each simulated conversation was completed and whether the user simulator faithfully reproduced the intended persona behavior. Conversations that fail validation are regenerated automatically, without manual annotation.

The pipeline is built on Pipecat, an open-source Python framework for real-time voice applications that supports both cascade architectures (STT to LLM to TTS) and audio-native models that produce speech output directly.

Six Metrics, Two Axes

EVA reports six metrics organized into two axes: accuracy (EVA-A) and experience (EVA-X).

On the accuracy side, task completion is the fully deterministic database-state comparison described above. Faithfulness is evaluated by an LLM judge that checks whether agent responses are grounded in available information, instructions, and tool results, flagging hallucinations, fabrications, and misrepresentations. Speech fidelity is the structurally novel addition: it uses a large audio language model as judge, operating directly on the raw audio output of the agent, checking whether named entities spoken aloud match what they should be. Confirmation codes, flight numbers, dollar amounts, and dates are the primary failure surface. An agent can produce a correct text transcript but synthesize it in a way that sounds wrong on playback; only evaluating the audio catches this class of error. No existing evaluation framework had included a metric of this kind.

On the experience side, conciseness checks whether responses are sized appropriately for spoken delivery, since users cannot skim or reread audio. Conversation progression checks whether the dialogue moves toward task completion rather than looping on repeated questions or losing earlier context. Turn-taking assesses interruption handling and silence management.

Conciseness, progression, and turn-taking use LLM judges. Following the methodology established in MT-Bench, each judge is validated against a labeled dataset and selected based on agreement with human annotations. Different metrics may use different judge models, with the best performer per metric winning the assignment.

The Consistency Gap

EVA runs each scenario three times and reports both pass@3 and pass^3, adapting the pass@k framing from HumanEval and code generation evaluation. Pass@3 measures whether at least one of three trials succeeds; pass^3 measures whether all three succeed. The framework reports that the gap between these two numbers is substantial across all tested configurations.

This finding has the most direct consequence for teams building production voice agents. A system with good pass@3 but poor pass^3 can solve a task occasionally but cannot do so reliably. Users encounter the failures, not the successes. Informal evaluation where a developer runs a scenario once or twice to verify it works measures something closer to pass@3. The number that matters for a production deployment is pass^3.

The gap is amplified by failure cascades specific to voice. A single misheard character in a booking code blocks authentication. Authentication failure prevents retrieval of the user’s itinerary. Without the itinerary, every downstream step in a rebooking conversation fails. A three-turn failure mode of this kind shows up in pass^3 as a complete miss, even when the system handles all other aspects of the conversation competently. Transcript-level evaluation catches this only if the transcript already reflects the audio error; the Speech Fidelity metric is designed for exactly the cases where the transcript does not.

The Accuracy-Experience Tradeoff

The second major empirical finding is that no tested system dominates both EVA-A and EVA-X simultaneously. Systems with strong task completion and faithfulness scores tend to produce responses that are verbose, slow to progress toward resolution, or poorly timed for voice interaction. Systems with stronger conversational experience metrics tend to sacrifice factual grounding or completion rate.

This tradeoff reflects where model development incentives have concentrated. LLMs optimized for instruction following and factual grounding tend toward comprehensive, thorough responses. These perform well in text-based benchmarks and behave poorly as voice agents, where length and pacing carry meaning in ways they do not in text. Concision in voice requires the model to determine, in the moment, which information is load-bearing and which can be omitted, a capability that standard RLHF pipelines do not directly reward.

EVA surfaces this tradeoff quantitatively, which allows it to be treated as a design decision with measurable trade-offs rather than as a vague tension felt subjectively during development.

Where the Framework Stands

EVA is explicit about its current limitations. Fifty scenarios in a single English-language domain is a narrow scope for general claims about voice agent quality. The user simulator uses a single TTS provider, which may systematically disadvantage ASR systems not trained on that voice distribution. Task completion scoring is binary, with no partial credit for conversations that complete most of a task but fail at a final confirmation step. The bot-to-bot setup, for all its reproducibility advantages, diverges from production conditions where human callers produce disfluencies, background noise, and off-topic asides that no simulator currently replicates well.

The stated roadmap covers prosodic quality, noise robustness, multilingual settings, affect-aware evaluation, and domain expansion. The dataset and code are publicly available, which matters if EVA is to function as a shared evaluation substrate rather than an internal ServiceNow tool.

The core methodology, particularly the deterministic tool execution and the pass@k consistency measurement, is adaptable to other domains without waiting for official framework expansion. The consistency gap is not specific to airline rebooking conversations. It is a property of how LLMs behave under multi-turn audio constraints, and it is the number most voice agent teams are not currently measuring.