What Text Benchmarks Miss When You're Testing Voice Agents

Voice agents are in a strange position right now. The underlying language models are capable, speech recognition has become commodity infrastructure, and text-to-speech quality has crossed thresholds that would have seemed impressive just a few years ago. Yet evaluating these systems rigorously remains poorly solved. Most benchmarks treat the voice layer as a thin wrapper over a text LLM, which means they’re measuring the easy part and ignoring the hard part.

EVA (Evaluating Voice Agents), a framework from ServiceNow AI, addresses this directly by proposing a structured evaluation methodology that treats speech-specific failure modes as first-class concerns rather than afterthoughts.

The Problem With Porting Text Evaluation to Voice

Standard LLM evaluation frameworks like MT-Bench and AlpacaEval assess instruction following and multi-turn conversational quality using text. They’re useful for what they measure, but voice introduces a distinct class of problems that these benchmarks cannot probe.

ASR error propagation is the most consequential one. When a user says “set a timer for two” and the ASR transcribes it as “set a timer for too,” the downstream language model sees a grammatically valid but semantically broken input. How the agent handles this, whether it proceeds confidently with the wrong interpretation, asks for clarification, or infers intent from surrounding context, is a real competence dimension that no text benchmark surfaces. The agent’s behavior on clean transcripts tells you almost nothing about its behavior on realistic speech.

Turn structure is another gap. Text conversations have clean, discrete turns. Voice conversations don’t. Users interrupt mid-sentence, pause and restart, produce disfluencies like “uh” and “um,” and sometimes begin a request before knowing how to finish it. A voice agent that scores well on text benchmarks can fail badly in production because its context window is being fed inputs it was never evaluated against.

Latency compounds everything. Response generation time directly affects perceived conversational quality. Long latencies change user behavior: users speak again, creating overlapping inputs that require their own handling logic. A text benchmark has no way to represent this, because it doesn’t model the timing dimension of conversation at all.

What EVA Measures

EVA structures evaluation around several dimensions that together cover the failure modes that matter in production voice deployments.

Task completion is tracked, but with attention to how efficiently the agent arrives at a successful outcome. An agent that completes a booking task after four clarification turns is not equivalent to one that resolves it in two, even if both register as successes on a binary completion metric. The path matters because in voice interactions, every unnecessary turn is a real user experience cost.

Robustness to ASR noise is evaluated by injecting realistic transcription errors into inputs and measuring behavioral degradation. This is more principled than either ignoring transcription quality entirely or testing only on perfect transcripts, which is what most text-centric evaluation pipelines do by default.

Conversational repair handling, the situations where something went wrong and the user tries to correct it, is also assessed. “No, I meant Tuesday, not Thursday” is a standard example. Handling repair correctly requires tracking prior state, updating appropriately, and not re-litigating the entire conversation from scratch. Most agents handle happy-path conversations reasonably well; repair scenarios are where the gaps appear.

Why the LLM-as-Judge Approach Gets Complicated Here

Modern evaluation frameworks frequently use language models as judges, following the methodology established by G-Eval and related work. This scales, it captures semantic quality that surface-level metrics like BLEU miss entirely, and it handles open-ended responses where there’s no single correct answer. For text evaluation, the tradeoffs are well-understood.

For voice evaluation, there’s an additional wrinkle. A judge model that sees clean, well-formed text has no exposure to the context the voice agent was reasoning about. If an agent received a noisy transcript and produced a hedged or clarifying response, a text-only judge might rate that response poorly compared to a more confident one, without any awareness that the hedging was the correct behavior given what the agent actually received.

This means that voice evaluation pipelines need either judges that understand speech-specific context explicitly, or evaluation designs that make the noise conditions visible to the judge. Neither is trivial. The former requires judge models with specific training or prompting around ASR behavior. The latter requires careful instrumentation of the evaluation harness to surface inputs alongside outputs.

The Historical Gap This Is Filling

The challenge of evaluating dialogue systems isn’t new. MultiWOZ, released in 2018, became a standard for task-oriented dialogue evaluation and drove significant progress on information-seeking and booking tasks. But MultiWOZ assumes text input and clean turn boundaries. It was designed for chat interfaces, not voice.

SUPERB addressed speech model evaluation but focused on isolated acoustic tasks: ASR accuracy, speaker identification, emotion recognition from audio. It wasn’t designed to evaluate an agent engaged in multi-turn task completion over a real conversation.

The gap between dialogue benchmarks and speech benchmarks has been present for years. Voice agents have been deployed into production, particularly in customer service and smart home contexts, without benchmarks that reflect what “performing well” means in those deployments. Teams have shipped systems that score well on whatever text evaluation they use in development and then encountered failure modes in production that the evaluation never covered.

EVA is a direct response to that pattern.

Practical Implications for Developers

For developers building voice agents, a few things follow from the EVA framing.

First, the evaluation harness used during development should match the conditions that exist in production. Testing on clean text transcripts optimizes for a condition that won’t occur in real user interactions. Building ASR error injection into the test pipeline early, even with simple noise models, is worth the setup cost because it changes which failure modes you see.

Second, clarification question frequency is worth measuring explicitly. Agents that over-clarify frustrate users. Agents that under-clarify make confident errors. The right balance is domain-dependent and user-expectation-dependent, but you can’t tune it without measuring it. A simple proxy: how often does a successful task completion require more turns than the theoretical minimum for that task?

Third, conversational repair handling tends to be underweighted in development because it’s hard to test with cooperative testers. Users testing a system generally don’t interrupt themselves, don’t say “wait no” after getting a confirmation, and don’t change their mind mid-request. Real users do all of these things. Synthetic test cases that cover repair scenarios, interruptions, and mid-turn corrections are genuinely valuable work that most development processes skip because there’s no standard benchmark demanding it.

Where This Sits in the Broader Evaluation Landscape

The current moment in AI evaluation is characterized by an expanding catalog of benchmarks, each addressing a specific capability or failure mode, alongside legitimate skepticism about whether benchmarks remain valid once they’re widely used for training. HELM and related holistic evaluation efforts have tried to address benchmark proliferation by unifying evaluation across dimensions, but the focus has been predominantly text-centric.

Voice agent evaluation is at an earlier stage. There’s less consensus on the right dimensions, less standardization around test set construction, and less tooling for running evaluations at scale across realistic voice conditions. EVA represents a foundation rather than a final answer.

The most durable part of frameworks like this tends to be methodological: establishing what dimensions matter and how to operationalize them, even as the specific benchmarks evolve. The argument that voice agents require voice-native evaluation, with speech-specific noise, turn structure handling, and latency as first-class concerns, is the part that will outlast any specific leaderboard or score. The research field has spent years building increasingly sophisticated text evaluation infrastructure, and most of that work has to be partially rebuilt to apply to spoken conversation. EVA is a meaningful step toward that.