The Hidden Cost of Talking to an LLM: Voice Mode and the Model Tier Problem

Simon Willison noticed something worth paying attention to: ChatGPT’s voice mode is backed by a weaker model than what you get when you type. This is not a bug report or a complaint. It is an observation that, once you start pulling on the thread, reveals a set of trade-offs that run through the entire stack of real-time audio AI.

The distinction matters more than it might seem at first, because voice interfaces have become a common recommendation for accessibility, ambient computing, and hands-free workflows. If the model answering your spoken questions is materially less capable than the one reading your typed ones, that has real consequences for the people relying on it most.

How ChatGPT Voice Mode Actually Works

OpenAI ships two different voice architectures. The original voice mode was a pipeline: Whisper transcribed your speech to text, GPT-4o answered, and a text-to-speech system read the response back. The intelligence layer in the middle was the same model you got in the chat interface.

Advanced Voice Mode, which rolled out broadly in late 2024, replaced this with a native multimodal approach. Instead of converting audio to text before the model sees it, the audio is tokenized directly. The model receives audio tokens as input and emits audio tokens as output, without a transcription step in the middle. This is the same architecture exposed through the Realtime API, which OpenAI released in October 2024.

The model powering this is gpt-4o-realtime-preview, not gpt-4o. Those are different model checkpoints with different training objectives. The Realtime variant is optimized for low-latency streaming conversational audio. Standard GPT-4o is optimized for quality on complex reasoning tasks.

The Latency Constraint Is Load-Bearing

Conversational voice has a hard ceiling on acceptable latency. The general threshold cited in HCI research is around 200-300ms for a response to feel natural. Go much beyond 500ms and the interaction starts feeling broken. Standard chat interfaces routinely take 2-5 seconds to produce a full response, and users accept that because they can watch the text stream in.

Audio does not stream the same way. You cannot play back partial phonemes while the model finishes thinking. The system has to commit to producing audio output at a cadence that feels continuous, which means the model needs to begin generating almost immediately.

This is not just a serving infrastructure problem. It shapes what kind of model you can run. A model that does extensive chain-of-thought reasoning before answering, or that takes multiple passes over a context window, is incompatible with sub-300ms voice response. The Realtime model is likely shallower in its processing, finetuned to respond quickly rather than to reason carefully.

The OpenAI Realtime API documentation notes that audio tokens are significantly more expensive than text tokens: input audio is priced at roughly 40x the cost of text input tokens. This reflects the higher computational load of processing audio, and it also means there is strong economic pressure to run the most efficient model that still clears a quality bar, rather than the most capable model available.

This Pattern Is Not New

Voice assistants have always occupied a lower capability tier than their text counterparts. Siri, Google Assistant, and Alexa all ran smaller, faster models than what the same companies offered through text-based products. The constraints were different then (on-device latency, power budgets, network round trips) but the outcome was the same: the talking interface was weaker.

What changed with the GPT-4o generation is that the gap between voice and text became harder to see. The native audio model sounds fluent and coherent. It handles context well. It does not stumble on ambiguous speech the way older ASR pipelines did. But fluency is not the same as capability, and the surface quality of the conversation can mask the underlying model’s limitations on harder tasks.

Willison’s observation is a useful corrective. Benchmarks for LLMs are almost entirely text-based. When you evaluate GPT-4o on MMLU, GSM8K, or any of the standard reasoning suites, you are evaluating the text checkpoint. The Realtime checkpoint does not have public benchmark numbers, which means there is no easy way to quantify how much capability is being traded away for latency.

What Developers Building Voice Applications Should Know

If you are building on the Realtime API, or designing a product that routes users into voice mode by default, a few things follow from this.

First, task complexity matters more in voice than in text. Asking the model to look something up, summarize a document you have read aloud, or set a reminder, these are tasks where the weaker model probably does fine. Asking it to debug a complex system design, evaluate an argument with multiple premises, or produce precise structured output, those are tasks where the gap between the Realtime checkpoint and the full text model is more likely to surface.

Second, the chained pipeline approach (Whisper to GPT-4o to TTS) that OpenAI moved away from in their consumer product is still available through the API and still uses the full text model. If you need maximum reasoning quality and can tolerate higher latency and the complexity of stitching three services together, that architecture is still a legitimate choice. The latency on that pipeline has improved significantly since 2023, and for many non-conversational use cases it is the better option.

Third, the cost structure of the Realtime API should inform your product design. Audio tokens at that price point make it expensive to run long context windows or process lengthy documents through voice. Applications that require extensive context should push that context in through text system prompts, not through audio, even in a voice-first interface.

The Modality Tier Problem

There is a broader principle buried in this observation. As AI systems expand across modalities, images, audio, video, multimodal combinations, the quality of reasoning may not be uniform across all of them. The model that handles your typed request might be categorically more capable than the one handling your image or your voice, even if both are labeled as the same product.

OpenAI is not unique in this. Google’s Gemini family also has separate variants optimized for different latency and modality requirements. The naming conventions suggest equivalence (Gemini 2.0 Flash, Gemini 2.0 Pro) but the actual capability profiles differ across tasks and modalities.

From a user perspective, this creates a situation where the most accessible interface, speaking rather than typing, may consistently deliver worse results. That asymmetry is worth naming clearly, because product decisions get made on the assumption that the model is the model.

From a developer perspective, it means that choosing a modality is also choosing a capability level, and that choice should be explicit rather than accidental. The routing logic that decides whether a user gets a text response or a voice response is also, implicitly, deciding how much reasoning capacity they get access to.

Willison’s note is short. The implication runs longer than the post.