The Accuracy-Experience Tradeoff That Voice Agent Benchmarks Keep Missing
Source: huggingface
Evaluating voice agents is not the same problem as evaluating chatbots, and the tooling has not caught up. Most existing benchmarks measure either task completion in text (did the agent book the flight?) or audio quality in isolation (was the TTS output intelligible?). Neither captures what actually matters when a caller is on the line trying to rebook a missed connection: whether the agent finished the job, said something correct and coherent, and did it in a way that wasn’t exhausting to listen to.
EVA (Evaluating Voice Agents), released by ServiceNow Research in early 2026, is the first framework to score both of these dimensions together, end-to-end, in audio. The most important result it surfaces is a finding that feels obvious in retrospect: systems that score highest on task accuracy tend to score lowest on conversational experience, and no system in their benchmark of 20 agents sits comfortably in the top right of both axes.
Why End-to-End Audio Evaluation Matters
The standard way to build a voice agent today is the cascade pipeline: speech-to-text converts the caller’s audio to a transcript, an LLM processes that transcript and generates a response, and a text-to-speech engine reads it back. Each component can be evaluated in isolation, and that is usually how it is done. STT gets evaluated on word error rate. The LLM gets evaluated on task completion or similarity scores against reference responses. TTS gets a MOS score.
The problem is that errors compound across the pipeline in ways that per-component benchmarks do not see. A named entity that the STT misheard becomes a hallucination at the LLM layer. An LLM response that looks concise in text becomes awkward when read aloud at 150 words per minute. Turn-taking cues that work in text chat do not translate to audio, where a 300ms pause means something different than it does on a screen.
Audio-native models complicate things further. Models like GPT-4o’s audio mode or the category of systems EVA calls LALMs (Large Audio Language Models) process and generate speech directly without intermediate transcription. Their failure modes are qualitatively different from cascade systems, but they have mostly been evaluated using the same text-centric metrics applied to transcripts of their outputs.
EVA addresses this by running full audio conversations between a user simulator and the agent under test, capturing conversation recordings, transcripts, and tool call logs, and scoring all of it together.
The Five-Component Architecture
The bot-to-bot setup has five pieces.
The user simulator is a conversational AI given a specific goal and persona, operating entirely in audio through a TTS interface. It simulates realistic caller behavior including turn-taking patterns and plausible speech styles. Each scenario includes an explicit user goal (what the caller wants to achieve) and a user persona (how they communicate).
The voice agent is the system being evaluated. EVA uses Pipecat, an open-source Python framework from Daily, as the agent runtime. Pipecat supports both cascade architectures (STT to LLM to TTS) and audio-native pipelines, so the same evaluation harness handles both paradigms without requiring separate scaffolding.
The tool executor provides deterministic responses to API calls. Rather than hitting live airline backends, each scenario ships with a per-scenario database and custom Python functions that answer tool calls predictably. This makes results reproducible and removes external variability from the evaluation loop.
Validators check that conversations are complete and that the simulator behaved correctly. Conversations that fail validation are regenerated, ensuring the evaluation set is clean before scoring runs.
The metrics suite is where the framework does its most interesting work.
Two Scores, Six Dimensions
EVA produces two top-level scores: EVA-A for accuracy and EVA-X for experience.
EVA-A has three components. Task completion is deterministic: compare the expected final database state (ground truth) against the actual state after the conversation ends. Either the flight got rebooked correctly or it did not. Faithfulness is evaluated by an LLM judge scanning the conversation for hallucinations, policy violations, fabricated information, and misrepresentations. Speech fidelity is evaluated by a LALM judge specifically checking spoken audio for critical named entities: confirmation codes, flight numbers, dollar amounts, and dates. This last metric exists because cascade systems frequently mangle exactly these tokens, and the damage is invisible if you only evaluate at the text level. An agent can produce a correct internal transcript but mispronounce the confirmation code in a way the caller mishears, and that is a real-world failure that WER benchmarks average away.
EVA-X also has three components, all handled by LLM judges. Conciseness penalizes responses that are verbose in ways that become taxing to listen to. Conversation progression evaluates whether the agent avoids repetition, maintains context across turns, and drives the conversation toward resolution. Turn-taking scores timing behavior: whether the agent interrupts the caller, leaves excessive silence, or handles barge-in appropriately.
The distinction between EVA-A and EVA-X is not just organizational. They measure fundamentally different agent properties. An agent can complete a task while being insufferable to talk to, or it can be pleasant and natural while failing to actually change the booking. Task-only benchmarks report the former as a success. Experience-only benchmarks might report the latter as acceptable.
The Tradeoff Finding
Across 20 systems, both proprietary and open-source, both cascade and audio-native, the EVA-A versus EVA-X scatter plot shows a consistent inverse correlation. No system lands in the upper-right quadrant. Systems optimized for instruction-following and precise tool use tend to produce verbose, mechanical responses that score poorly on conciseness and conversation progression. Systems with more natural dialogue behavior tend to be less reliable at multi-step tool call sequences.
This is not a calibration artifact. It reflects a genuine tension in how these systems are built. Cascade systems tuned for accuracy are typically prompted to be explicit, to confirm details, and to narrate their reasoning, because that reduces hallucination and task errors. All of that verbosity punishes EVA-X scores. Audio-native models trained on more naturalistic dialogue patterns are better at sounding human but less reliable at structured task execution with multiple tool calls.
The practical implication is direct: if you are building a voice agent and optimizing only on task completion, you may be degrading your product’s usability in ways you are not measuring.
Named Entity Transcription as a Dominant Failure Mode
One of the more concrete findings is that named entity transcription is a top failure driver in cascade systems. Confirmation codes like XK7TQ2, flight numbers like UA 1842, and dollar amounts like $237.50 are exactly the categories where general-purpose ASR models perform worst. These are not failure modes you discover through standard WER benchmarks because WER is averaged across all tokens, and named entities are a small fraction of words but a large fraction of meaning.
The speech fidelity component of EVA-A was added specifically to capture this. The LALM judge listens to the agent’s audio output rather than reading a transcript, which means it catches mispronunciations that ASR would have re-corrected on the way back through the pipeline. That correction step is exactly what creates the invisibility problem: a cascade system mishears XK7TQ2 as XK7T Q2, feeds the wrong code to the LLM, the LLM uses the wrong code, the TTS reads it back wrong, but the transcript-level evaluation sees something plausible and does not flag it.
Measuring Consistency with pass@k and pass^k
EVA borrows the pass@k notation from code generation benchmarks like HumanEval, applying it to conversational scenarios run multiple times. Each scenario runs three times (k=3). pass@k measures the probability that at least one of the k runs succeeds. pass^k measures the probability that all k runs succeed.
The gap between these two numbers reveals consistency problems that single-run evaluations miss. A system with a high pass@3 but a low pass^3 is solving scenarios occasionally but not reliably. For a production voice agent handling thousands of calls per day, reliability matters more than peak performance on any single run. The EVA results show a large consistency gap across all evaluated systems, which means that LLM inference stochasticity is a real operational problem rather than a theoretical edge case.
Limitations Worth Taking Seriously
The framework includes an honest limitations section. The airline domain has 50 scenarios, all in English, all using a single TTS voice. That is enough to demonstrate the methodology and surface the tradeoff, but it is not enough to make strong claims about generalization to other domains or languages. The user simulator cannot replicate the full range of real caller behavior: genuine disfluencies, emotional states, creative misunderstandings, or callers who deviate from expected decision trees in unexpected ways.
LLM-as-judge evaluation carries known biases, and the risk increases when the judge model and the evaluated model share the same provider. The EVA authors acknowledge this without fully solving it, which is the honest position given the current state of the field.
The roadmap addresses some of these gaps: prosodic quality assessment (pronunciation, rhythm, expressiveness), robustness testing under noisy conditions and diverse accents, multilingual scenarios, and additional domain datasets with distinct policies. A continuous leaderboard is planned as well.
What This Framework Gets Right
The real contribution of EVA is not the specific metrics but the architectural decision to evaluate the whole pipeline in audio, jointly, across both dimensions simultaneously. That is the gap it fills.
Prior dialogue evaluation work like the DSTC series (Dialogue State Tracking Challenge, running since 2013) focused on state tracking accuracy in text. MultiWOZ and its successors provided richer multi-domain coverage but stayed at the text level. More recent LLM-based agent benchmarks like AgentBench and ToolBench evaluate tool use in text settings. None of these capture the audio modality, the experience dimension, or the compound failure modes that emerge when a full voice pipeline faces a goal-oriented task.
The dataset is available on Hugging Face and the framework code is on GitHub. The airline domain is a reasonable starting point partly because airline rebooking has well-defined policies and state transitions, and partly because voice is still a primary channel for airline customer service, which means the cost of errors is concrete and familiar.
The accuracy-experience tradeoff that EVA reveals is not a problem the framework creates. It is a problem the framework finally makes visible. If you are shipping a voice agent evaluated only on task completion in text, you are flying with instruments that are not measuring the right things.