Benchmarking Voice Agents Requires More Than a Transcript

The evaluation problem in voice AI has been accumulating debt quietly. We have mature benchmarks for LLM reasoning, code generation, retrieval, and multi-turn dialogue. But a voice agent deployed to rebook a flight or resolve a billing dispute is doing something qualitatively different from a text chatbot, and the tooling for measuring how well it performs that job has not kept pace.

ServiceNow AI Research’s EVA framework is a serious attempt to close that gap. It evaluates conversational voice agents end-to-end, in audio, across both task accuracy and user experience, using a bot-to-bot simulation architecture. The results surface failure modes and tradeoffs that text-level evaluations simply cannot see.

Why Text Benchmarks Fail for Voice

Most existing agent benchmarks operate on text. You feed a model a transcript or a structured task description, it calls tools, you check the tool call sequence and final state against ground truth. This works well enough for pure reasoning tasks, but it collapses several important distinctions when the agent is deployed as a voice system.

A cascaded voice agent runs three sequential components: a speech-to-text model, an LLM for reasoning and tool use, and a text-to-speech model for output. Error compounds at each stage. A flight confirmation code misheard by the ASR model produces a wrong entity in the LLM’s context, which cascades into a failed authentication check, which cascades into a failed task. The LLM never made a reasoning error, but the outcome is identical to one that did. A text-only benchmark would never surface this.

Audio-native architectures, speech-to-speech models and large audio language models piped through TTS, trade the ASR error for different problems: weaker tool use, less reliable structured output, and inconsistent handling of long multi-turn context. Comparing these two architecture families on a text benchmark is not just imprecise, it’s misleading.

There’s also the experience dimension, which text benchmarks ignore entirely. Whether the agent asked for unnecessary confirmations, interrupted the user, responded too quickly after a long user turn, or gave a response so verbose it was awkward to listen to, none of this shows up in task completion rates. For real deployments, these properties matter as much as accuracy, because users abandon interactions that feel broken even when the agent is technically getting the job done.

The Bot-to-Bot Architecture

EVA sidesteps the cost and variability of human evaluators by using a bot-to-bot simulation setup. A user simulator, a conversational AI with an assigned goal and persona, speaks to the voice agent under test using TTS. The agent responds in audio. Both sides exchange turns until the scenario resolves or fails.

This is a pragmatic design choice with real costs. The TTS voice of the simulator does not produce the accent variation, background noise, or disfluencies of real users. The simulator’s conversational behavior, while persona-conditioned, is more predictable than a real person with genuine stakes in the outcome. EVA’s authors acknowledge this directly: the user simulator may not perfectly replicate real user behavior, and the use of a single commercial TTS provider may bias ASR results.

The benefit is scale and reproducibility. Human-in-the-loop evaluation is expensive, slow to parallelize, and produces noisy signal. Bot-to-bot evaluation lets you run 50 scenarios across 20 systems with k=3 trials each in a reasonable amount of time, and the results are deterministic enough to compare across configurations. EVA reports results from exactly this kind of comparison: 20 systems tested, including both cascade architectures built on Pipecat and audio-native models.

Two Axes, Six Metrics

EVA separates evaluation into two orthogonal dimensions: EVA-A for accuracy and EVA-X for experience.

EVA-A covers three sub-metrics. Task completion is fully deterministic, checking the database end-state against ground truth to give a binary pass or fail. Faithfulness uses LLM-as-judge to detect hallucinations, fabricated policies, and confabulated details. Speech fidelity adds a LALM-as-judge layer that evaluates the audio output directly, checking whether critical entities like confirmation codes, dollar amounts, and flight numbers were spoken correctly. This last metric is important because a TTS model can slur or mispronounce a string that the LLM produced correctly, and that error is invisible to any text-level evaluation.

EVA-X covers conciseness, conversation progression, and turn-taking, all evaluated by LLM-as-judge. Conciseness measures whether responses are appropriately compressed for spoken delivery. Conversation progression tracks whether the agent is moving the task forward and retaining context. Turn-taking checks response timing relative to user turn length.

The LLM-as-judge components inherit the standard concerns about that methodology: potential biases toward verbosity, sensitivity to prompt framing, and inconsistency across model versions. EVA addresses this partially by selecting the best-performing judge model per metric on curated evaluation datasets, rather than using a single model for everything. That’s a thoughtful design, though it introduces its own question about how judge model selection interacts with the systems being evaluated.

The Accuracy-Experience Tradeoff

The most consequential finding from EVA’s evaluation of 20 systems is that task accuracy and user experience trade off against each other in practice. Agents that perform well on EVA-A tend to deliver worse EVA-X scores. Agents that optimize for conversational quality often sacrifice task completion. No single configuration dominates both axes.

This is not a surprising result if you think about how these systems are tuned. A cautious agent that asks for confirmation at each step, paraphrases user intent back before acting, and requests clarification when uncertain will score better on faithfulness and task completion. That same behavior produces verbose, slow, repetitive conversations that score poorly on conciseness and progression. An agent tuned for natural, efficient conversation will interrupt fewer times and make more forward-looking assumptions, which increases errors.

The tradeoff mirrors a familiar pattern in retrieval and generation: precision versus recall, coverage versus quality. What EVA makes concrete is that this tradeoff exists at deployment time, not just at training time, and that it is measurable across both dimensions simultaneously.

Named Entity Transcription as a First-Class Failure Mode

Among the dominant failure modes EVA identifies, named entity transcription deserves more attention than it typically gets in discussions of voice agent performance.

A single misheard character in a booking reference, a loyalty number, or a passenger name can cascade into a complete authentication failure. The agent cannot look up the record, cannot verify the traveler, and cannot proceed. From the task completion perspective, this is a total failure. From the reasoning perspective, the LLM may have performed perfectly; the breakdown happened in the ASR layer before the LLM ever saw the input.

This failure mode compounds in the airline domain EVA uses for its benchmark, where identifiers like confirmation codes are alphanumeric strings with no semantic redundancy. There is nothing in the surrounding context that lets the model recover a misheard character. Compare this to a misheard city name, where a city-aware model can infer the correction from partial audio. Named entities in booking systems are deliberately opaque, and ASR models have little signal to fall back on.

The fact that EVA’s speech fidelity metric operates at the audio level, not the transcript level, means it can detect the symmetric problem on the output side: when the agent speaks a confirmation code and the user mishears a digit. These are real failure modes in production voice systems that only show up when you evaluate in audio.

Consistency Is the Overlooked Dimension

EVA reports metrics across k=3 trials per scenario using two distinct aggregations: pass@k, the probability that at least one run out of k succeeds, and pass^k, the probability that all k runs succeed.

The gap between these two numbers across all tested systems is large. Even configurations with strong pass@3 scores show weak pass^3 scores, meaning they can succeed but not reliably. For a voice agent deployed in a real call center, this distinction matters enormously. A system that completes a rebooking successfully two out of three times is not a system you can deploy at scale. The variance is the product.

This is an argument for pass^k as the primary benchmark metric for voice agents in production contexts, rather than pass@k, which measures peak capability. The two numbers tell very different stories about a system, and conflating them obscures the consistency gap that separates experimental performance from operational reliability.

What EVA Does Not Yet Cover

The benchmark is currently limited to 50 English-language airline scenarios. The domain is well-chosen for complexity, multi-step workflows with real authentication and ancillary service management, but generalization to healthcare, financial services, or customer support in other languages requires additional scenario sets that do not yet exist.

Prosodic quality is absent from the current metric suite. Whether the agent sounds natural, conveys appropriate affect, handles pauses well, or produces robotic-sounding output with correct content, none of this is measured. The roadmap includes pronunciation, rhythm, and expressiveness assessment, along with noise robustness and multilingual support.

The requirement for commercial API access also limits who can run the full benchmark. An open evaluation framework that requires proprietary TTS and LLM APIs for its judge components is not fully open in practice.

These are real constraints, and the EVA authors list them clearly. The framework is the first serious attempt at this kind of comprehensive end-to-end voice agent evaluation, with the dataset and source code both publicly available. The value is not that it measures everything, but that it measures the right things together, task completion and conversational quality in a single evaluation run, using audio as the medium rather than text as a proxy for it.