EVA Measures What Voice Agent Benchmarks Have Been Skipping

Most benchmarks for language models evaluate text. You pass in a text prompt, you get back a text response, and you score it against some reference or rubric. That works reasonably well for general-purpose chat models, but it misses nearly everything that matters when a model is answering phone calls.

ServiceNow AI’s EVA (Evaluating Voice Agents) is a framework that tries to fix this by evaluating voice agents the way they actually operate: through full audio conversations, with real tool calls, against measurable end states. The design choices embedded in EVA are worth unpacking carefully, because they surface some problems that won’t get solved by making LLMs smarter.

The Gap Text Benchmarks Cannot See

When you test a voice agent by feeding it text transcripts, you are skipping the entire audio stack. Speech recognition errors, TTS fidelity failures, latency-driven interruptions, and turn-taking mechanics all vanish. The model looks fine on paper; it falls apart on the phone.

EVA addresses this with a bot-to-bot audio architecture. A user simulator, operating through text-to-speech, conducts multi-turn spoken conversations with a voice agent built on Pipecat, an open-source Python framework for real-time voice pipelines. The agent calls tools backed by a deterministic database, and validators confirm the conversation ran correctly before scoring. Everything travels as audio.

The framework handles two broad agent architectures. Cascade systems chain STT (speech-to-text) to an LLM to TTS, passing text between stages. Audio-native systems use speech-to-speech models or Large Audio Language Models (LALMs) that process audio more directly, often piping into TTS only at the output stage. These architectures have meaningfully different failure profiles, which matters when you are trying to understand why a system failed rather than just that it did.

Two Axes of Evaluation

EVA splits scoring into two dimensions: EVA-A for accuracy and EVA-X for experience.

EVA-A covers three things:

Task Completion: deterministic comparison of the expected versus actual database state after the conversation ends. Did the agent actually rebook the flight correctly? Did the correct ancillary services carry over?
Faithfulness: LLM-as-judge evaluation that checks whether agent responses are grounded in the instructions, policies, user inputs, and tool outputs. This catches hallucinated confirmation codes and fabricated policy exceptions.
Speech Fidelity: LALM-as-judge evaluation that checks whether the TTS output faithfully reproduced the intended text, with particular attention to critical entities like flight numbers, amounts, and booking codes.

EVA-X covers a different set:

Conciseness: whether responses are appropriately brief for spoken delivery. A response that reads fine as text can be exhausting to listen to.
Conversation Progression: whether the agent avoids repetition, retains context, and moves the conversation toward resolution rather than spinning in circles.
Turn-Taking: whether the agent handles timing correctly, neither interrupting the user nor leaving long silences that signal processing problems.

The paper reports that agents performing well on EVA-A tend to score worse on EVA-X, and vice versa. No single system dominates both axes across the 20 configurations tested. This is the kind of finding that text-only benchmarks structurally cannot produce, because they have no experience axis to measure.

Consistency Over Capability

The metric design that deserves the most attention is the distinction between pass@k and pass^k.

pass@k is the probability that at least one of k runs completes the task successfully. pass^k is the probability that all k runs complete it successfully. EVA runs three trials per scenario, so k=3.

The gap between these two numbers is large across all configurations. Agents regularly pass at least once in three tries but fail to pass all three. In a benchmark context, this might look like reasonable performance. In deployment, it means your voice agent sometimes rebooksa flight correctly and sometimes does not, with no obvious signal to the user about which kind of conversation they are having.

This consistency problem is not unique to voice agents. It shows up in LLM evaluations broadly, but most benchmarks report single-run pass rates and hide it. The nature of real-time voice interactions makes consistency more urgent than it might be for a coding assistant or a document summarizer, because there is no affordance for the user to retry. They called once. If the agent failed that one time, the task did not get done.

Named Entity Transcription as the Dominant Failure Mode

The EVA authors identify named entity transcription errors as the most common cascade failure. A single misrecognized character in a booking reference or confirmation code is enough to break authentication, which terminates the task regardless of how well the LLM portion performs.

This is a particularly hard problem for cascade architectures. The STT model produces a transcription, and everything downstream trusts it. If the transcription contains JFKL934 instead of JFKL934, the LLM has no reliable way to know the difference. It sees a string, passes it to the tool, and the tool fails. The conversation has to be recovered from a failure state that the agent may not even recognize as a transcription error.

Audio-native architectures can theoretically do better here because they process audio representations more directly, preserving information that gets discarded when audio is quantized to text. Whether they actually do better in practice depends on how well the model learned to handle phonetically ambiguous entities during training. EVA provides the infrastructure to measure this concretely, which is part of what makes it useful as a research platform rather than just a leaderboard.

The Benchmark Domain Problem

The current EVA dataset covers airline customer service: irregular operations rebooking, voluntary itinerary changes, cancellations, same-day standby, and compensation vouchers, across 50 scenarios with 15 tools. The choice is deliberate. Airline conversations are operationally complex, policy-heavy, and involve the kinds of named entities (flight numbers, seat assignments, baggage allowances, booking references) that stress the audio stack hardest.

But 50 scenarios in one domain is a limited surface area. The authors acknowledge this and plan additional domains. The risk with single-domain benchmarks is that systems overfit to domain-specific patterns, either during training or through inference-time prompt engineering. An agent that performs well on airline scenarios may have been tuned specifically for airline conversations; EVA’s current scope cannot detect this.

This is the same tension that has appeared in LLM benchmarks like HELM and BIG-bench: a benchmark that is narrow enough to be tractable tends to get gamed, and a benchmark that is broad enough to resist gaming becomes expensive to run. EVA is currently on the tractable end of that spectrum.

What the Architecture Tells You

The bot-to-bot evaluation loop does something important beyond just generating audio: it makes the evaluation infrastructure itself a research artifact. The user simulator is parameterized by goal, persona, patience level, and speaking style. The tool executor uses dynamic database queries rather than static fixtures. The validators catch malformed conversations and trigger regeneration before they pollute the results.

This is more rigorous than most evaluation setups, where conversations are generated once, artifacts are discarded, and reproducibility depends on saving the right logs. EVA’s architecture means that failures can be diagnosed at the component level: did the STT misread an entity, did the LLM fabricate a policy, did the TTS drop a digit, did the agent interrupt the user at the wrong moment.

The diagnostic metrics in EVA-A and EVA-X are specifically designed for this. They are not aggregated into a single score; they are reported separately so that developers can understand which part of their stack is causing problems. This is a different design philosophy from leaderboard-first benchmarks, which optimize for a single comparable number.

Where This Fits

Voice agents are being deployed fast. Twilio’s Voice Intelligence, Retell AI, Bland AI, and a growing list of others are running millions of spoken conversations. The models underneath are improving quickly. The evaluation infrastructure has been lagging.

EVA is not the first attempt to close this gap. Projects like VoiceBench and various spoken QA datasets have probed audio comprehension in isolation. What EVA adds is the end-to-end framing: the full stack, the tool calls, the database state, the user experience dimensions, and the consistency metrics all evaluated together in a single pass.

The code is on GitHub and the dataset is on Hugging Face. The framework is built on Pipecat, which is actively maintained and supports a range of STT, LLM, and TTS providers, so extending EVA to new architectures should be straightforward.

The findings from the current leaderboard are less interesting than the framework itself. Twenty systems tested in one domain gives you directional signal, not conclusive rankings. What matters more is that the infrastructure exists to run this evaluation rigorously, and that it is measuring things text benchmarks cannot: whether the agent heard the user correctly, whether it said what it meant to say, and whether it can do the task reliably, not just occasionally.