How to Evaluate a Voice Agent End to End

Voice agents are now serious production systems. Airlines use them for rebooking. Banks use them for account queries. Healthcare companies use them for appointment scheduling. The underlying technology has improved enough that these deployments are not experiments; they are customer-facing infrastructure. The evaluation tooling has not kept pace.

Most voice agent benchmarks were designed for narrower problems. SUPERB measures individual speech processing tasks: ASR accuracy, speaker identification, emotion recognition. VoiceBench tests single-turn QA performance of LLM-based voice assistants. AIR-Bench focuses on audio instruction following. These are useful benchmarks for what they measure, but none of them evaluate a voice agent completing a multi-step task with tool calls, backend state mutations, and a policy it must follow, while also being judged on whether its spoken output is clear and well-timed.

ServiceNow AI’s EVA (Evaluation of Voice Agents) is a framework for end-to-end evaluation of conversational voice agents across both task accuracy and user experience. The core design decision is a bot-to-bot evaluation loop: a user simulator speaks to the agent in audio, the agent responds in audio, and the conversation continues until task completion or failure. The dataset, framework, and code are available under the MIT license.

Why Component-Level Evaluation Falls Short

A cascade voice agent pipeline typically runs speech-to-text, feeds the transcript into an LLM, generates a text response, and then runs text-to-speech on that response. An audio-native architecture might use a speech-to-speech or large audio language model (LALM). Either way, multiple systems are involved, and each one can introduce errors.

Component benchmarks evaluate each stage in isolation, giving you ASR word error rate, LLM task completion on text inputs, and TTS naturalness scores. What they do not tell you is what happens when errors compound across stages. An STT system might misread a flight confirmation code as a phonetically similar string. The LLM receives the garbled transcript, constructs a response based on the wrong code, and the TTS system faithfully reads that wrong answer back to the user. Each individual component performed within spec; the end-to-end system failed the customer.

The second problem is that component evaluations use text inputs and text outputs, even for systems that operate on audio. A benchmark that sends transcripts to the LLM layer and checks text outputs is measuring a different thing than what happens when the full audio pipeline runs. This matters most for proper nouns: names, confirmation codes, flight numbers, dollar amounts. Human speech introduces ambiguity that transcripts elide, and the places where that ambiguity is most costly are exactly the places that matter most in task completion.

EVA’s Architecture

EVA is built around five components: a user simulator, a voice agent, a tool executor, validators, and a metrics suite.

The user simulator is a conversational AI with a specific goal and persona. It communicates with the agent entirely via audio, using high-quality TTS for its turns. This forces the agent under evaluation to process speech rather than text, which surfaces pipeline errors that text inputs would hide. The validators check that the simulated user behaved faithfully to its goal, keeping the evaluation honest without requiring human annotation.

The voice agent is built with Pipecat, an open-source framework for real-time voice applications. EVA supports both cascade architectures (STT to LLM to TTS) and audio-native architectures (LALM or speech-to-speech models), evaluated on equal footing.

The tool executor provides deterministic responses to tool calls via Python functions backed by per-scenario databases. Each scenario has its own isolated backend state: a specific reservation, a set of available flights, a fare class, a seat inventory. Task completion is evaluated by comparing the expected database state against the actual database state after the conversation ends. The comparison is binary pass/fail, implemented in code, with no LLM judgment involved.

The EVA dataset covers 50 scenarios in the airline domain, using 15 tools. The scenarios include irregular operations rebooking, voluntary itinerary changes, cancellations, same-day standby requests, compensation voucher issuance, and adversarial user behavior. Each scenario specifies must-have and nice-to-have goals for the simulated user, enabling realistic negotiation behavior rather than a binary pass-or-quit interaction. The ground truth for each scenario was generated with LLM assistance and then reviewed by humans.

The Six Metrics: EVA-A and EVA-X

EVA splits its evaluation into two axes, each with three sub-dimensions.

EVA-A measures accuracy. Task Completion is deterministic: the database either reflects the expected outcome or it does not. Faithfulness is LLM-judged, checking whether the agent fabricated information, violated policy, misrepresented tool results, or hallucinated facts. Speech Fidelity is judged by a large audio language model on the agent’s actual audio output.

EVA-X measures experience. Conciseness evaluates whether the agent’s responses are appropriately brief for spoken delivery, because lengthy text-style responses are difficult to process in audio. Conversation Progression checks for context retention, avoidance of repetition, and forward movement toward task completion. Turn-Taking evaluates timing: whether the agent interrupts the user or leaves excessive silence.

The Speech Fidelity metric is the most novel contribution in the framework. EVA feeds the agent’s audio response directly to an LALM judge and asks it to verify whether critical entities, confirmation codes, flight numbers, dollar amounts, were spoken correctly. No prior end-to-end voice agent benchmark evaluated the agent’s spoken output at the audio level. Prior benchmarks checked text transcripts at every stage. The EVA paper argues, and the benchmark results support, that transcript-based evaluation misses a systematic failure mode: an agent can produce a phonetically plausible but factually incorrect utterance that passes transcript comparison and fails speech fidelity.

What the Results Show

EVA evaluated 20 systems across both proprietary and open-source models, covering both cascade and audio-native architectures.

The central empirical finding is that accuracy and experience trade off against each other. When results are plotted across the two axes, no system occupies the high-accuracy, high-experience quadrant. Systems optimized for task completion tend to produce longer, more thorough responses that score poorly on conciseness. Systems that produce pleasant, concise audio tend to leave tasks incomplete. The tradeoff has been observed in text-based agent evaluation before, but EVA makes it concrete by measuring both dimensions on the same scenarios in the same evaluation run.

The consistency finding is equally important. EVA runs each scenario three times per system and reports both pass@k, the probability that at least one of k runs succeeds, and pass^k, the probability that all k runs succeed. The gap between these two numbers is large across all evaluated systems. A system might complete a scenario in one of three runs and fail in the other two. That pattern looks acceptable on pass@3; it looks like a reliability problem on pass^3. For production voice agents, pass^k is closer to what users experience over time.

The dominant failure mode surfaced by Speech Fidelity is named entity transcription. A single character error in a confirmation code cascades into authentication failure. Transcript-based evaluation cannot surface this failure mode because the transcript may be phonetically accurate. Task completion metrics miss it if the conversation ends before the code is used in a tool call. The error is only visible at the audio output layer.

Limitations and What Comes Next

EVA is English-only and limited to a single domain. The user simulator uses one commercial voice provider, which may advantage ASR systems trained on that provider’s audio characteristics. The LLM-as-judge metrics carry standard biases of that evaluation method, and the risk of systematic bias is heightened when the evaluated model and judge share a provider. Task completion is binary, with no partial credit for scenarios where the agent completes most but not all requirements.

The roadmap addresses several of these. Planned additions include prosodic quality metrics covering pronunciation, rhythm, and expressiveness; robustness evaluation under noisy conditions and diverse accents; multilingual scenarios; and additional domains with distinct policy structures and named entity profiles. The tooling side is also in development: an error analysis application for systematic failure identification and structured summary generation.

The leaderboard is live at the EVA project site.

The framework does not solve the voice agent evaluation problem, but it frames it correctly. Evaluating a system in text while it operates in audio is not a conservative approximation; it measures something different. Getting evaluation right for voice requires running the full pipeline, in audio, end to end, with real tool calls and real backend state. EVA provides a concrete starting point for that, along with an empirical baseline showing how far current systems still have to go.