Voice Agent Reliability Is Not a Capability Problem

Most voice agent benchmarks evaluate the wrong thing. They measure whether an ASR model transcribes accurately under controlled conditions, whether an LLM reasons correctly over text, whether a TTS system produces intelligible speech. These are legitimate measurements. They are also poor predictors of whether the whole system actually completes a task reliably in production.

This is the central problem that ServiceNow AI’s EVA framework was built to address. Released in March 2026, EVA is an end-to-end evaluation pipeline that tests voice agents by making them conduct real audio conversations with a simulated caller, complete real tasks, and get measured on both what they accomplished and how the conversation felt. The framework’s most important finding is not which architecture wins; it is that all 20 systems tested show a large gap between peak capability and consistent reliability.

The Component Evaluation Trap

The evaluation landscape before EVA was fragmented along architectural lines. AudioBench, VoiceBench, and VoxEval test LLM capabilities in the audio modality, but they treat the model as the unit of evaluation. FD-Bench, Talking Turns, and Full-Duplex-Bench evaluate conversational dynamics like turn-taking and interruption handling in isolation. VoiceAgentBench and CAVA push further by testing tool-calling and instruction-following for voice agents, but they work from text transcripts rather than audio.

None of these test what a voice agent actually has to do. In production, a voice agent takes audio input, runs it through some combination of ASR, LLM reasoning, tool execution, and TTS, and delivers spoken output in real time. Errors at any step do not stay local; they compound. A misheard confirmation code corrupts the authentication step, which blocks every subsequent tool call, which causes the task to fail entirely. Measuring each component in isolation cannot capture this failure mode because the failure only exists in the full pipeline.

EVA’s architecture forces evaluation at the pipeline level. A user simulator generates real audio through TTS and conducts a goal-directed conversation with the voice agent under test. The agent is built on Pipecat, an open-source real-time voice AI framework, and it calls into a deterministic tool executor that simulates a backend. Validators check task completion by comparing the expected and actual database state after the conversation ends. The entire evaluation runs without human annotation in the loop.

pass@k and pass^k: The Metrics That Matter

EVA borrows its primary metrics from code evaluation literature. pass@k (with k=3) measures the probability that at least one of three independent runs on a scenario succeeds. pass^k (with k=3) measures the probability that all three runs succeed. The gap between these two numbers is a direct measure of consistency: a system with a large gap can sometimes solve a problem but cannot do so reliably.

Across all 20 systems evaluated, this gap is large. That is the finding. It means that the current generation of voice agents has a reliability problem that is distinct from a capability problem. A system might score well on pass@3, demonstrating that it can complete complex airline rebooking scenarios. Its pass^3 score might still be poor, meaning that the same scenario on the same system will frequently fail when run again.

For anyone building production voice agents, this matters more than peak capability numbers. A customer who calls to rebook a flight after a weather cancellation is not well-served by a system that handles the scenario correctly 40% of the time. The benchmark makes this concrete in a way that component-level evaluations cannot.

The borrowing of pass@k from code evaluation is conceptually honest. The HumanEval paper introduced this framing precisely because a model that can produce a correct solution on one attempt out of ten is qualitatively different from one that produces it consistently. Voice agents face the same distinction, but with the added complexity that sources of non-determinism are distributed across the pipeline, not just in the LLM’s sampling.

The IRROPS Domain and Why Naming Is Hard

EVA’s current dataset covers 50 scenarios in the airline irregular operations (IRROPS) domain: flight rebooking, cancellations, standby requests, itinerary changes, voucher handling. The domain was chosen deliberately. IRROPS calls are cognitively demanding for voice agents because they involve temporal reasoning, complex policy constraints, and high-stakes named entities.

That last point deserves attention. A confirmation code like “XQRT7W” is a string of six characters where every character matters. An ASR model that transcribes “XQRT7W” as “XQRF7W” produces a result that looks nearly correct but is completely wrong from a system perspective. Authentication fails, the booking cannot be located, and the agent is stuck. EVA identifies named entity transcription errors as the dominant accuracy failure mode across evaluated systems.

The Speech Fidelity sub-metric in EVA’s accuracy dimension uses an LALM (large audio-language model) as a judge specifically for these high-stakes entities: confirmation codes, flight numbers, dollar amounts. This is a sensible decomposition because prosodic quality and semantic accuracy are different evaluation problems. Whether the agent sounds pleasant is a different question from whether it said the right number.

The LALM-as-Judge approach for prosodic quality, however, runs into an alignment problem that the EVA authors flag honestly. Human judgments of speech expressiveness and prosodic quality correlate poorly with LALM judge scores. This is not a minor caveat; it means that the experience-side metrics for speech quality are not yet grounded in what humans actually perceive. The EVA framework names this as an open problem rather than papering over it, which is the right call, but it does mean that the experience axis is better calibrated on conversation structure than on speech quality.

Cascade vs. Audio-Native Architectures

EVA evaluates two architectural families. Cascade architectures chain STT, LLM, and TTS sequentially. Audio-native architectures use a speech-to-speech model or large audio-language model that processes audio more directly. The evaluation framework is designed to be architecture-agnostic; both families plug into the same Pipecat-based interface.

The cascade vs. audio-native tradeoff has been actively debated since GPT-4o’s real-time voice API made audio-native pipelines practically accessible. The intuition for audio-native is that eliminating the ASR step removes a major failure mode; transcription errors never occur because the model reasons directly over audio. The intuition for cascade is that text-based LLMs have stronger reasoning and tool-use capabilities built up over years of alignment work, and that ASR quality is good enough on clean speech.

EVA does not settle this debate cleanly, and probably no 50-scenario benchmark can. What it does show is that named entity errors remain a significant failure mode for cascade systems, and that audio-native systems have not yet demonstrated uniformly better task completion. The accuracy-experience tradeoff, which EVA documents across all 20 systems, cuts across both architectural families. Agents optimized for task completion tend to score worse on user experience dimensions, and vice versa, regardless of whether they are cascade or audio-native.

Per-Metric Judge Selection

One design decision in EVA worth examining is that LLM-as-Judge evaluation uses different models for different metrics, selected based on performance on a curated calibration set. This avoids a systematic bias that would arise from using a single provider’s model to judge another provider’s output on all dimensions.

This is a real problem in the evaluation literature. The LLM-as-Judge framework acknowledged positional and verbosity biases; subsequent work has documented provider-correlated bias as well. If you always use GPT-4o to evaluate Gemini responses, you might introduce biases that have nothing to do with the actual quality of the responses. EVA’s approach of selecting the best judge per metric is a practical mitigation, though it does make the evaluation setup harder to reproduce exactly as model capabilities shift over time.

The calibration set approach also assumes that there is a ground-truth signal to calibrate against, which is straightforward for task completion (the database state is either correct or not) but more difficult for subjective dimensions like conciseness or conversation progression.

What EVA Does Not Yet Cover

The framework is explicit about its current limitations. Fifty scenarios in one domain in one language is a narrow footprint. The user simulator does not replicate real caller behavior: disfluencies, hesitations, emotional distress, non-native accents. The single commercial TTS voice used for the user simulator may systematically favor ASR systems trained on similar voice characteristics.

The absence of multilingual evaluation is the most significant gap for anyone building voice agents for global deployments. ASR error rates vary substantially across languages, and named entity handling becomes harder in languages with more phonological ambiguity. The EVA roadmap includes new domains and eventually multilingual support, but the current version is English-only.

The 50-scenario dataset also cannot cover the long tail of edge cases that matter in production. Multi-step scenarios with memory requirements spanning many turns, simultaneous constraint satisfaction across ancillary services, and scenarios where policy changes mid-conversation are all harder than what the current benchmark includes. The bot-to-bot pipeline with format conversions may also introduce artifacts that do not represent real production conditions precisely.

Why the Methodology Matters More Than the Numbers

The practical contribution of EVA is not the specific benchmark numbers; those will change as models improve. The contribution is the methodology: end-to-end audio pipeline evaluation, bot-to-bot simulation, the pass@k versus pass^k decomposition, and the explicit separation of accuracy and experience axes.

For builders, the consistency framing is the most immediately useful. Evaluating voice agents with a single run per scenario will systematically overestimate reliability. Running three independent trials and tracking how often all three succeed gives a much more honest picture of what users will actually experience. That methodology does not require EVA specifically; it can be applied to any voice agent evaluation setup.

The EVA leaderboard and the dataset on HuggingFace will become more useful as more systems are evaluated and as the dataset expands. For now, the framework’s value is in establishing that end-to-end evaluation catches failure modes that component evaluation misses, and that consistency is a distinct problem from capability that the field has not adequately measured.