· 8 min read ·

Evaluating Voice Agents End-to-End: What EVA Gets Right About a Hard Problem

Source: huggingface

Most voice agent evaluation frameworks measure the wrong thing. They test components in isolation: how accurately does the ASR transcribe speech, how well does the LLM follow instructions, how natural does the TTS sound. Each measurement is useful, but none of them tell you whether the complete system, running end-to-end over audio in a real multi-turn conversation, can actually complete a task while being tolerable to talk to. That combination, task accuracy and conversational experience evaluated together in a live audio context, is what EVA from ServiceNow AI is designed to measure.

The Architecture Decision That Defines EVA

The most consequential design choice in EVA is the bot-to-bot audio pipeline. A simulated user agent, configured with a specific goal and persona, calls the voice agent under test over real audio. The conversation happens in real time. The user simulator speaks using TTS, the agent responds through its full stack, and the interaction runs until the conversation concludes or the agent fails to progress.

This is meaningfully different from text-based agentic benchmarks like SWE-bench or WebArena, and also different from prior voice-agentic work like VoiceAgentBench and CAVA, which EVA explicitly positions against. The issue with those benchmarks is not that they are poorly designed; it is that they evaluate agentic capabilities in isolation from the complete conversational workflow. A voice agent that handles tool calls correctly is not the same thing as a voice agent that handles tool calls correctly while also managing turn-taking, avoiding interruptions, staying concise enough for spoken delivery, and maintaining context across five or six turns with a caller who does not have a transcript to reference.

The pipeline is built on Pipecat, the open-source Python framework for real-time voice applications. The framework supports two architecture types for the agent under test: cascade systems (STT to LLM to TTS) and audio-native systems (speech-to-speech or large audio language models feeding into TTS). Both architectures are supported because the comparison between them is a live research question, and any benchmark that bakes in one architecture type would answer the wrong question.

Six Metrics, Two Dimensions

EVA evaluates across six metrics grouped into two composite scores. EVA-A covers accuracy; EVA-X covers experience.

On the accuracy side:

Task Completion is deterministic. After each conversation, the expected final state of a per-scenario database is compared against the actual state. A flight rebooking scenario has a ground truth: the booking record should reflect the new flight, the seat assignment should carry over if requested, the original fare difference should be charged correctly. Binary pass or fail, no LLM judge involved.

Faithfulness uses an LLM-as-judge to check whether the agent’s responses were grounded in its instructions, policies, and the tool call results it received. This catches hallucination and policy violation at the conversational level, not just in individual completions.

Speech Fidelity is the metric that has no equivalent in any prior voice agent benchmark. It uses a large audio language model as judge to evaluate whether the agent’s spoken output faithfully reproduced the intended content at the audio level. The focus is on entities that matter in voice contexts: confirmation codes, flight numbers, dollar amounts. A single misheard or mispronounced character in a confirmation code cascades into authentication failure, and text-level metrics would never catch it because the transcript might look correct even when the audio is wrong.

On the experience side:

Conciseness evaluates whether responses were appropriately brief for spoken delivery. Callers cannot skim or re-read. A response that would be fine in a chat interface can be overwhelming spoken aloud, and agents trained primarily on text interactions tend to produce responses calibrated for reading.

Conversation Progression checks whether the agent moved the conversation forward: no unnecessary repetition, context retained across turns, the agent driving toward task completion rather than stalling or asking redundant clarifying questions.

Turn-Taking evaluates timing: whether the agent interrupted, whether it left excessive silence after the user finished speaking. This is a dimension that only exists in real-time audio interactions, and it directly affects user experience in ways that transcript-based evaluations cannot surface.

Each scenario runs three trials. Results are reported as both pass@k (probability at least one of three runs succeeds, measuring peak performance) and pass^k (probability all three runs succeed, measuring consistency). The gap between these two numbers is often substantial, and it matters: a voice agent deployed to handle customer service calls cannot succeed only two-thirds of the time.

The Dataset and What It Tests

The initial release covers the airline domain: 50 scenarios, 15 tools, English only. Tasks include IRROPS rebooking, voluntary itinerary changes, cancellations, same-day standby requests, and compensation vouchers. Each scenario contains a user goal (a detailed specification of exactly what the caller wants to achieve, including the decision tree they will follow), a user persona (speaking style, patience level, personality), a scenario database, and a ground truth representing the expected final database state.

The airline domain is narrow by design. The scenarios are built to stress-test specific capabilities: temporal reasoning, policy adherence, constraint satisfaction across multiple steps, and named entity handling. A scenario that requires rebooking a flight while preserving ancillary services, like a seat selection and a checked bag, forces the agent to maintain state across multiple tool calls while managing a caller who may provide information in a non-linear order.

The dataset and code are public. The leaderboard at servicenow.github.io/eva covers 20 systems including both proprietary and open-source, cascade and audio-native architectures.

The Tradeoff Finding

The most significant empirical result from the initial evaluation is that no tested system dominates on both EVA-A and EVA-X. Agents that achieve higher task completion rates tend to deliver worse conversational experience scores, and vice versa. The relationship holds across architectures, across proprietary and open-source systems, and across cascade and audio-native configurations.

This is not surprising once you think about it. Task completion rewards comprehensive behavior: confirming every detail, clarifying every ambiguity, repeating back information to ensure accuracy. Conversational experience penalizes exactly those behaviors when they are excessive. An agent optimized purely for task accuracy becomes verbose and robotic. An agent optimized for conciseness and natural flow is more likely to miss a step or fail to confirm a critical detail.

The tradeoff being consistent across all 20 tested systems is the finding that justifies EVA’s joint evaluation design. If some systems cleared both bars simultaneously, you could argue that accuracy and experience are not in tension and can be optimized independently. The data says otherwise, which means teams building voice agents for production deployment need to make explicit decisions about where to sit on that tradeoff curve, and they need a benchmark that makes the tradeoff visible.

The Prior Art Problem

To understand why EVA fills a gap, it helps to be specific about what prior frameworks measure.

Benchmarks like AudioBench, VoiceBench, and VoxEval evaluate speech understanding and audio comprehension: can the model transcribe accurately, answer questions about audio content, follow spoken instructions. These are single-turn or short-context evaluations. They do not run complete task-oriented conversations.

Speech quality benchmarks like EmergentTTS-Eval and SHEET assess perceived speech quality through listening tests. Useful for comparing TTS systems, but entirely disconnected from whether the agent can complete a task.

Conversational dynamics work like Full-Duplex-Bench and Talking Turns analyzes turn-taking, interruption handling, and backchanneling. These are relevant to EVA-X but evaluated in isolation from task-oriented tool use, which is where turn-taking decisions actually matter in a deployed voice agent.

VoiceAgentBench and CAVA both evaluate agentic voice capabilities, but they do not run the agent through a complete multi-turn conversational workflow from initial user request through multi-step tool orchestration to final resolution. EVA’s distinguishing claim is that this completeness is not optional; it is the only way to observe the accuracy-experience tradeoff in practice.

What Remains Hard

EVA’s authors are candid about the limitations. The LLM-as-judge metrics carry inherent biases and may favor response styles associated with certain providers, especially when the judge model and the evaluated model share a lineage. Task completion as binary pass/fail gives no credit for agents that fail gracefully, which understates quality differences between a catastrophic failure and a near-miss.

The prosody problem is acknowledged as unsolved. The current framework does not evaluate pronunciation quality, rhythm, or expressiveness in the agent’s speech. Attempts to use LALM-as-judge for prosodic quality found very low alignment with human judgments, so the metric was excluded from the initial release. This is an open research problem, and it matters: an agent that speaks accurately but sounds stilted will still produce bad user experiences.

The user simulator relies on a single commercial TTS provider for all user speech, which means voice characteristics may systematically favor certain ASR systems. The bot-to-bot pipeline also involves audio format conversions and real-time interfaces that may not perfectly represent production deployment conditions. Full reproduction requires commercial API access, which is a real constraint for independent researchers.

The roadmap includes robustness testing under noisy conditions, diverse accents, and multilingual inputs; affect-aware evaluation for callers expressing distress or frustration; and additional domain datasets beyond airlines. Fifty English scenarios in one domain is a starting point, not a final answer.

Why This Matters for Teams Building Voice Agents

The practical implication of EVA’s findings is that teams building production voice agents are likely making implicit tradeoff decisions without knowing it. Most evaluation setups measure task completion through text-based conversation simulations or unit tests against individual pipeline components. Neither approach reveals the experience-side degradation that comes from optimizing for accuracy, or the accuracy-side degradation that comes from tuning for a natural conversational style.

The Speech Fidelity metric in particular is something that almost no existing evaluation pipeline captures. If your voice agent quotes a booking reference number and the TTS mispronounces two digits, a text-level test will show the correct code in the transcript. A caller will not be able to write it down. This is the kind of failure mode that only appears at the audio level, in a complete end-to-end test, and EVA is currently the only benchmark that measures it.

The consistency gap revealed by pass@k versus pass^k is equally important. A voice agent that can complete a task 60% of the time is not a production-ready system regardless of how its peak performance looks on a leaderboard. EVA’s reporting methodology makes consistency a first-class metric rather than something inferred from a single-trial score.

EVA is not the final word on voice agent evaluation. The airline domain will need to generalize, prosody remains unmeasured, and the accuracy-experience tradeoff will need to be studied across many more system configurations before the underlying dynamics are well understood. But it establishes a foundation that the field needed: a complete, reproducible, jointly-scored benchmark for conversational voice agents that does not reduce the problem to its components.

Was this interesting?