EVA: What End-to-End Evaluation Reveals That Component Benchmarks Hide in Voice Agents

The fundamental problem with evaluating voice agents has been structural, not methodological. Every benchmark in the field measures something real, but each one measures it in isolation from the rest of the system. You get STT accuracy figures, LLM task completion rates, and TTS naturalness scores, but none of those numbers predict whether a real conversation will succeed. ServiceNow AI’s EVA framework (GitHub), published March 24, 2026, is the first serious attempt to close that gap with a complete, end-to-end evaluation pipeline.

Why Component Metrics Are Insufficient

The existing benchmark landscape is well-developed for what it covers. MultiWOZ established joint goal accuracy and task success as standard task-oriented dialogue metrics, but it operates entirely on text. The DSTC series introduced speech tracks at DSTC11 and DSTC12, moving toward audio, but the evaluation remained component-level rather than end-to-end. VoiceBench and AudioBench evaluate audio instruction following and speech understanding across many tasks, and they surface something important: the best audio models degrade 15-40% relative to their text counterparts on reasoning tasks. SpeechBench extended this across 20+ task types in 2025 and confirmed the pattern holds broadly.

None of that work measures what happens when you chain all the components together and run a multi-turn task to completion. The reason is practical: end-to-end evaluation requires a lot of infrastructure. You need a realistic user who speaks, an agent that responds in audio, tools the agent can actually call, a way to verify the outcome deterministically, and metrics that cover both whether the task succeeded and whether the conversation was acceptable to experience. Most research groups do not build all of that.

EVA builds all of it.

The Pipeline Architecture

EVA’s design has five components working together. The user simulator combines an LLM with a TTS engine, parameterized with a goal and a persona, so it generates realistic spoken turns rather than typed ones. The voice agent under test is built on Pipecat, an open-source Python framework for real-time voice pipelines. Pipecat supports both cascade architectures (STT to LLM to TTS) and audio-native speech-to-speech models, and EVA evaluates both.

A cascade configuration in Pipecat looks like this:

pipeline = Pipeline([
    transport.input(),          # Incoming audio
    stt,                        # Speech-to-Text (e.g., Deepgram)
    llm_context.user(),
    llm,                        # Language model (e.g., GPT-4o)
    tts,                        # Text-to-Speech (e.g., ElevenLabs)
    transport.output(),
    llm_context.assistant(),
])

The third component is a deterministic tool executor backed by a per-scenario database. This is what makes ground-truth verification possible: the tools (rebooking, cancellations, standby lists, compensation) operate against a controlled state, so you can compare the expected final state to the actual one without ambiguity. The framework then runs validators and a metrics suite across each completed conversation.

The evaluation domain is airline customer service: 50 synthetic scenarios, 15 tools, English only. That scope is modest, which the EVA team acknowledges explicitly, but the architecture is designed to be extensible.

Six Metrics Across Two Dimensions

EVA separates accuracy from experience. The accuracy metrics (EVA-A) cover whether the agent did the right thing; the experience metrics (EVA-X) cover whether the conversation was good to be in.

On the accuracy side:

Task Completion is deterministic, comparing the expected database end state to the actual one after the conversation finishes.
Faithfulness uses an LLM judge to check whether the agent’s responses are grounded in actual policy and data, flagging hallucinations and policy violations.
Speech Fidelity uses a large audio language model as judge to verify that the agent’s spoken audio correctly conveyed critical named entities: booking codes, flight numbers, dollar amounts.

On the experience side:

Conciseness checks whether responses are appropriately brief for spoken delivery, where verbosity carries a higher cost than in text.
Conversation Progression evaluates context retention, avoidance of repetition, and whether the agent is driving toward resolution.
Turn-Taking checks for premature interruptions and excessive silence, the paralinguistic failures that make a conversation unpleasant even when the task succeeds.

EVA also reports consistency statistics at k=3: pass@k, the probability that at least one of k runs succeeds, and pass^k, the probability that all k runs succeed. The gap between those two numbers tells you something about behavioral consistency that a single-run evaluation cannot.

What the Results Reveal

Three findings stand out from EVA’s evaluation of 20 systems.

The first is the accuracy-experience tradeoff. Systems that maximize task completion score lower on EVA-X, and systems that optimize for conversational quality score lower on task completion. This tradeoff is invisible to benchmarks that measure only one dimension. An agent tuned to be thorough and cautious will confirm details multiple times, which drives up faithfulness and task completion but gets penalized for conciseness and conversation progression. An agent tuned for fluid, natural conversation may move through the interaction faster in ways that increase error risk. You cannot optimize your way out of this tradeoff without knowing it exists, and component-level metrics do not surface it.

The second finding is that named entity transcription is the dominant failure mode in cascade architectures. A single misheard character in a six-character booking code does not produce a graceful degradation. It cascades: the agent proceeds with a wrong identifier, the tool calls fail or return wrong results, and the conversation unravels. This is qualitatively different from the aggregate accuracy drops that audio benchmarks measure, because those drops are distributed across many output types. Named entity errors are concentrated and catastrophic. They explain why STT benchmark numbers, which report aggregate word error rates, are insufficient predictors of voice agent reliability. A system can have excellent WER on common vocabulary and still fail frequently on the sparse, high-stakes tokens that task-oriented conversations depend on.

The third finding is the consistency gap. Across all 20 systems, pass@3 consistently exceeded pass^3 by a substantial margin. Peak capability is not behavioral consistency. A system that succeeds on one of three runs is not a reliable system, regardless of what its maximum capability score suggests. This matters for production deployment in ways that single-run evaluations miss entirely.

Where EVA Falls Short

The EVA team is candid about the framework’s limitations. Prosodic quality remains unsolved: the LALM-as-Judge approach showed very low alignment with human judgments on prosody. Automated metrics for speech naturalness beyond named entity accuracy do not yet have validated proxies that correlate with human perception, and EVA flags this as an open problem rather than papering over it.

Task completion is binary, which penalizes partial progress with the same weight as complete failure. Latency is not part of the scoring at all, which is a significant omission for any production deployment where response time affects user experience at least as much as content quality. The 50-scenario scope and English-only constraint limit generalizability. LLM-as-Judge carries the known biases of that approach, including the possibility that same-provider models rate each other’s outputs differently than cross-provider evaluations would.

These are not objections to EVA’s contribution; they are the next set of problems to solve. The framework establishes what end-to-end evaluation requires and provides a concrete baseline to improve against.

What This Means for Building Production Voice Agents

EVA’s findings suggest a few things worth internalizing if you are building a voice agent today.

Your STT component is probably the most consequential single failure point for task-oriented conversations, but not in the way aggregate WER measures. You need to evaluate it specifically on the named entity vocabulary your domain depends on: identifier formats, codes, amounts, proper nouns. Domain-specific fine-tuning or post-processing for these token classes is likely to have higher return than general WER improvement.

The accuracy-experience tradeoff means you need to be explicit about which dimension you are optimizing for at each stage of development. Tuning for task completion without measuring experience will produce an agent that succeeds at tasks in ways users find unpleasant. The reverse produces an agent that feels good to talk to but fails at its job. Neither is acceptable in production, and the tradeoff cannot be resolved without measuring both.

Single-run evaluation is not sufficient for characterizing reliability. The gap between pass@3 and pass^3 in EVA’s results suggests that behavioral consistency requires explicit attention during development. Techniques like temperature control, output normalization, and conversation state management all affect consistency in ways that capability benchmarks do not capture.

EVA is available on GitHub and the full write-up is on HuggingFace. The 50-scenario airline domain is narrow, but the measurement architecture is sound and the failure modes it surfaces are real. Any team building production voice agents should be running something like it.