Task Completion Tells Half the Story: Inside EVA's Voice Agent Benchmark
Source: huggingface
When you evaluate a text-based AI agent, the main challenge is deciding whether its output is correct. For voice agents the problem compounds: you have to evaluate whether the system heard correctly, whether it reasoned correctly, whether it spoke correctly, and whether the conversation felt coherent to a human caller. None of those four things are the same, and optimizing for one can degrade the others.
This is the core problem that EVA (Evaluating Voice Agents), a new framework from ServiceNow AI, is designed to address. It takes the position that existing benchmarks have been measuring the wrong things, or measuring them in isolation, and it proposes a unified end-to-end evaluation methodology that covers both task accuracy and conversational experience in a single pipeline.
Why Single-Metric Benchmarks Fall Short
Most existing voice evaluation frameworks measure one thing well and leave everything else implicit. AudioBench, SD-Eval, VoxEval, and similar benchmarks test ASR quality in single-turn or non-interactive settings. EmergentTTS-Eval and SHEET focus on speech quality through MOS (Mean Opinion Score) evaluations. FD-Bench and Talking Turns target dialogue coherence but are disconnected from whether the agent actually accomplished any task.
The problem is that these dimensions interact. A cascade voice agent, the STT → LLM → TTS pipeline that dominates current production deployments, can transcribe with high accuracy and still fail tasks because the LLM misapplied a policy. An audio-native speech-to-speech model might sound natural in conversation but mishear a single digit in a flight confirmation number, cascading into an authentication failure that ends the call entirely.
EVA’s design acknowledges this interdependence and requires you to measure both axes simultaneously, on real multi-turn conversations with tool use.
Architecture of the Evaluation Pipeline
The framework has five components that work together to simulate a complete voice interaction without human participants.
A User Simulator plays the role of a caller. It has a defined goal, a persona covering speaking style, patience level, and personality, and it communicates through a TTS engine to produce actual audio. The Voice Agent under test receives that audio and responds as it would in production. Behind the scenes, a Tool Executor provides deterministic responses to tool calls, backed by a per-scenario database that tracks state changes. Two types of Validators check that conversations reached completion and that tool execution was faithful to agent intent. Finally, a Metrics Suite processes the conversation recording, transcript, and tool logs to produce scores.
Supporting both cascade and audio-native agent architectures is an explicit design requirement. EVA integrates with Pipecat, an open-source framework for building real-time voice agents, which provides a consistent interface across pipeline types. This matters because the industry is currently in transition: most production systems still use cascade pipelines, but audio-native large audio language models are closing the gap, and a benchmark that only supports one architecture will miss half the field.
Two Dimensions, Six Metrics
EVA splits its measurement into EVA-A (accuracy) and EVA-X (experience), three metrics each.
EVA-A covers:
- Task Completion is deterministic. After the conversation ends, the framework compares the expected database state defined per scenario against the actual state after tool calls. No model judgment required. Either the booking was changed correctly or it was not.
- Faithfulness uses an LLM as judge to detect hallucinations, policy violations, and misrepresentations in what the agent told the caller. This catches cases where task completion appears to succeed but the agent fabricated information along the way.
- Speech Fidelity uses a LALM as judge, evaluating the audio output directly for accuracy in named entities: confirmation codes, flight numbers, monetary amounts. This is the metric that text-based agent benchmarks structurally cannot capture.
EVA-X covers:
- Conciseness, judged by an LLM, measuring whether responses are appropriately brief for spoken delivery. A voice agent cannot rely on the caller scanning ahead or skimming; verbose responses represent a qualitative failure even when content is accurate.
- Conversation Progression, measuring forward momentum, context retention, and avoidance of repetition or stalling across turns.
- Turn-Taking, evaluating whether the agent interrupts callers or leaves excessive silence at transition points.
The judge selection process for LLM and LALM evaluators is itself methodologically considered. Rather than arbitrarily picking a model, EVA benchmarks candidate judges against a curated evaluation dataset and selects the best-performing model per metric. This reduces the risk of systematic bias that emerges when a single judge is applied across all dimensions without validation.
The Airline Domain Benchmark
The initial dataset, available on HuggingFace, covers 50 synthetic scenarios in the airline industry, with 15 tools spanning IRROPS rebooking, voluntary itinerary changes, cancellations, same-day standby, and compensation vouchers. The domain choice is not arbitrary. Airline customer service calls involve temporal reasoning around connection times, strict policy enforcement around fare differences and change fees, named-entity handling for confirmation codes and gate numbers, and multi-step workflows where a single error early in the conversation invalidates subsequent steps.
Each scenario includes the user goal expressed as a decision tree, the user persona, the scenario database for tool queries, and the ground-truth expected final database state. This structure is significant. By defining ground truth at the database state level rather than at the response text level, EVA avoids the reference-matching problem that plagues text-based dialogue evaluation. There is no ground-truth transcript to compare against; correctness is defined entirely by outcomes.
What the Results Reveal
EVA evaluated 20 systems, including proprietary and open-source models and both cascade and audio-native architectures. Two findings stand out.
The first is a quantified accuracy-experience tradeoff. No single system configuration dominated both EVA-A and EVA-X scores. Agents with high task accuracy tended to deliver poor conversational experience, and vice versa. This is not surprising in principle, but it matters because it is now measured. Production teams optimizing solely on task completion are accepting a quality deficit in the user experience that they have likely never tracked.
The second finding concerns consistency. The framework reports both pass@k, the probability that at least one of k runs succeeds, and pass^k, the probability that all k runs succeed. The gap between these two measures was large across all evaluated systems. An agent might complete a task in two of three runs while failing the third due to a transcription error or an unexpected conversational detour. From a pass@3 perspective that looks acceptable; from a pass^3 perspective it is a reliability problem. In production, callers who hit the failure case do not receive a retry.
The LLM-as-Judge Question
Using LLMs to evaluate LLM outputs is now standard practice, but it carries known risks. EVA acknowledges these: potential biases, the inability to award partial credit in binary task completion, and systematic preferences that emerge when a judge evaluates a model from the same training lineage.
The LALM-as-judge approach for Speech Fidelity is less established and more interesting. Rather than transcribing audio and running text-level comparison, EVA evaluates audio output directly. This catches errors that survive ASR normalization, such as a system that correctly transcribes a spoken confirmation code internally but produces TTS output that distorts one character. Text-level evaluation would not catch that failure; audio-level evaluation can.
This represents a meaningful methodological advancement over prior frameworks. Speech Fidelity as an audio-native metric is the kind of evaluation that only makes sense when you treat voice as a first-class interface rather than a text interface with an acoustic wrapper.
Limitations Worth Noting
Fifty scenarios in a single domain is a narrow testbed. The user simulator, by construction, behaves more consistently and rationally than real callers. Disfluencies, emotional states, and background noise are outside the current scope. The framework operates in English only, and the single TTS provider used for the user simulator may favor ASR systems tuned on that provider’s voice characteristics, introducing an unintended selection bias in results.
The EVA GitHub repository and the project website outline a roadmap that addresses several of these gaps: prosodic quality assessment, robustness testing under noise and accent variation, multilingual scenarios, compound multi-step requests, and an eventually published leaderboard.
Why This Framework Matters
Voice agent development has been building on borrowed evaluation methodology, repurposing text-dialogue metrics and ASR benchmarks for systems that are neither. EVA makes the case that voice agents require their own evaluation paradigm, one that starts with the audio signal and measures all the way through to database state.
The dual-axis design, separating task accuracy from conversational experience while keeping both in a single evaluation run, is the right abstraction level. Neither metric alone is sufficient. A voice agent that completes tasks but sounds robotic and verbose will lose callers; a voice agent with excellent turn-taking and conciseness that fails to actually change the booking has solved the wrong problem.
There is real implementation overhead here: scenario databases, user simulators, tool definitions, and API access to both LLM and LALM judges. But for teams shipping voice agents into production, the alternative is operating without metrics that reflect what users actually experience. That is a more expensive trade-off in the long run.