The problem with evaluating voice agents is that the available metrics were borrowed from adjacent problems. Word Error Rate comes from speech recognition research, Mean Opinion Score from telephony quality assessment, and BLEU from machine translation. Each measures something real, but none of them answers the question a product team cares about: did the agent complete the user’s task, and was the conversation tolerable?
ServiceNow AI’s EVA framework, released this month on HuggingFace, is the first framework to treat both questions as first-class concerns in a single pipeline. The design choices reflect a clear diagnosis of where prior evaluation approaches fell short.
What Prior Evaluations Were Measuring
WER tells you the percentage of words the speech recognition system got wrong. It is useful for benchmarking ASR systems in isolation, but it conflates two very different failure modes. A system that mishears “window seat” as “window sheet” might have elevated WER but complete the task fine. A system that mishears a six-character confirmation code as five characters fails the task entirely, regardless of how low its overall WER is. The variance in consequence between these two errors is enormous, and WER treats them identically.
MOS has the opposite problem: it measures perceived audio quality via human listening panels, which is expensive and disconnected from task success. A voice agent can produce natural-sounding speech and still forget to confirm the rebooking, contradict an earlier statement, or fail to retrieve the right flight record.
BLEU was designed to measure n-gram overlap between candidate and reference translations. In task-oriented dialogue, there is rarely a single correct response. Paraphrase is acceptable and often preferable; BLEU penalizes all of it equally.
EVA discards these as top-level metrics and builds evaluation from task completion upward, separating it from benchmarks like AudioBench, SD-Eval, or VoiceAgentBench, which evaluate audio understanding or speech quality without running full multi-turn interactions against a live backend.
How EVA Works
The pipeline has five components: a user simulator, the voice agent under test, a tool executor with a per-scenario database, validators, and a metrics suite.
The user simulator receives a goal encoded as a decision tree of acceptable outcomes, and a persona specifying speaking style, patience level, and verbosity. It communicates via synthesized audio rather than text, meaning the agent under test processes actual speech. This matters for cascade systems (STT → LLM → TTS) because errors propagate through the pipeline; a confirmation code rendered by TTS in an unusual cadence will challenge the ASR stage downstream.
Each of the 50 scenarios is structured around four artifacts: the user goal decision tree, the user persona, a scenario-specific database that the tool executor draws from, and the ground-truth expected final state of that database after a successful interaction. The scenario isolation is important, since each run has its own backend state and cannot be contaminated by prior runs. The full dataset is available at huggingface.co/datasets/ServiceNow-AI/eva.
After the conversation ends, the framework compares the actual final database state to the expected ground truth. That comparison is the Task Completion score; it is fully deterministic, with no language model in the evaluation path.
The remaining evaluation uses two kinds of judges. LLM-as-judge handles Faithfulness (detecting hallucinations and policy violations), Conciseness, Conversation Progression, and Turn-Taking. LALM-as-judge, a large audio language model operating directly on the audio stream rather than a transcript, handles Speech Fidelity: whether entities like confirmation codes, dollar amounts, and flight dates were conveyed accurately.
The LALM-as-judge approach for Speech Fidelity is the most technically distinct contribution. Transcribing to text first and then checking entity accuracy with an LLM would miss the audio-level errors that matter most in practice. A TTS system that renders “flight UA 302” in a particular cadence, and an ASR stage that processes it inconsistently depending on speaking rate, may produce identical-looking transcripts while introducing real confusion in how the confirmation number was communicated. Operating at the audio level covers the full class of entity transcription failures that cascade into authentication errors downstream.
The framework is built on top of Pipecat, an open-source voice agent framework, and supports both cascade architectures and audio-native Speech-to-Speech systems. That architecture-agnosticism is meaningful: S2S models and Large Audio Language Models process audio end-to-end without a separate ASR stage, and any evaluation framework that only tested cascade systems would miss an entire class of current architectures.
The Two Scores and What They Reveal
EVA produces two composite scores. EVA-A (Accuracy) aggregates Task Completion, Faithfulness, and Speech Fidelity. EVA-X (Experience) aggregates Conciseness, Conversation Progression, and Turn-Taking.
After evaluating 20 systems spanning both cascade and audio-native architectures, the framework found a consistent inverse correlation between the two scores. No evaluated system scored well on both axes simultaneously. High-accuracy systems tended toward verbose, mechanical conversations. High-experience systems tended to fail at task completion.
The tradeoff had been visible to practitioners but unmeasured across architectures. Product teams know a system can be reliable without being pleasant, and pleasant without being reliable, but there was no benchmark that quantified both simultaneously across a diverse set of systems. The scatter plot of 20 systems across the two axes, available in the EVA repository, makes the tradeoff frontier concrete rather than intuitive. The practical implication is that where a system sits on that frontier is an architectural decision; there is no current system that escapes it through tuning alone.
Consistency as a Separate Dimension
EVA reports two statistical metrics per scenario. pass@k gives the probability that at least one of k runs succeeds; pass^k gives the probability that all k runs succeed. With k=3 in their experiments, these two statistics bound a system’s capability and reliability from above and below.
The gap between them is diagnostic. A system with pass@3 of 0.80 and pass^3 of 0.20 can handle the scenario but fails to do so reliably. Across all 20 evaluated systems, this gap was large, meaning inconsistency is a universal characteristic of current voice agents rather than a property of specific architectures.
Single-run evaluation cannot surface this; a system that succeeds 33% of the time and one that succeeds 95% of the time are indistinguishable on any given draw. The pass@k / pass^k decomposition makes inconsistency measurable without requiring large sample sizes, and the gap between the two values is itself a signal about where engineering effort should go: a wide gap means the system is capable but brittle, which calls for robustness work rather than capability work.
This framing has precedent in code generation benchmarks. The HumanEval benchmark introduced pass@k to evaluate language models on programming tasks, acknowledging that a model which occasionally produces correct code is different from one that reliably does. EVA applies the same decomposition to voice agents, where reliability failures are more consequential; a caller who gets disconnected mid-rebooking because the agent lost track of context is a different class of failure than a benchmark percentage point.
Scope and Known Limitations
The current dataset covers airline customer service in English: irregular operations rebooking, voluntary flight changes, cancellations, standby list management, and compensation handling. Fifty scenarios, 15 tools. The narrow domain means results may not generalize to healthcare, financial services, or technical support without new scenario sets.
The LLM judge carries same-provider bias. If the judge model and the evaluated model come from the same provider, the judge may score the evaluated outputs more favorably. The EVA paper acknowledges this without fully resolving it, which reflects the state of the field; no LLM-as-judge framework has a complete answer to provider familiarity effects.
The user simulator uses a single commercial TTS provider, potentially favoring ASR systems trained on that provider’s audio distribution. The stated roadmap includes noise robustness testing, multilingual support, and accent diversity, but those are absent from the current release. English-only evaluation with a single TTS voice is a real limitation for any team building agents for diverse caller populations.
What This Means for Teams Building Voice Agents
For most teams, the benchmark results are the most immediate value. The accuracy-experience frontier, quantified across 20 real systems, provides a calibration point that did not exist before. Current systems cluster on opposite ends of the two axes; no outlier sits comfortably in the top-right corner of the EVA-A / EVA-X space.
The LALM-as-judge approach and the pass@k / pass^k decomposition are both worth adopting in internal evaluation setups regardless of whether the full EVA pipeline fits a given context. Entity-level audio fidelity measured at the audio layer is the right metric for voice agents handling structured data like booking references and account numbers. Multi-run consistency statistics surface reliability gaps before they become production incidents, which is a better use of evaluation budget than adding more diverse test scenarios to a single-run benchmark.
Plugging a custom system into EVA starts at the repository. The framework’s architecture is designed to be extended, and the Pipecat-based agent interface means the adapter work for most modern voice agent stacks is not prohibitive.