OpenAI's New Voice Stack: What Changes When the Model Can Reason Mid-Conversation
Source: openai
OpenAI announced a new set of realtime voice models for its API, pitched around three capabilities: reasoning over speech, live translation, and higher-quality transcription. The headline is that voice is no longer a thin wrapper around a text model with a TTS layer bolted on. The Realtime API now exposes a unified speech-in, speech-out path with reasoning behavior closer to what the chat models do on text.
I’ve been building a Discord bot that does voice channel transcription on the side, so I’ve spent more time than I’d like staring at audio buffers, VAD timings, and the gap between what a model hears and what it understands. This release is worth unpacking because the architecture choices behind it affect anyone building voice agents, not just OpenAI’s own demos.
What’s in the box
The Realtime API has been around since late 2024, originally launched with gpt-4o-realtime-preview (release notes). It used a websocket transport and accepted PCM16 audio at 24kHz, streaming model output as audio deltas back to the client. The original model could hold a conversation but had limited reasoning depth and couldn’t translate in any meaningful interpreter sense.
The new generation introduces:
- A realtime model that can reason within a turn before emitting audio, similar to how the o-series reasoning models work for text.
- A dedicated speech-to-speech translation mode that preserves prosody across languages instead of round-tripping through English.
- Updated transcription models that improve over
whisper-large-v3on noisy and accented speech.
The API surface stays close to the existing Realtime spec. You open a websocket to wss://api.openai.com/v1/realtime, send a session.update event with your config, and stream input_audio_buffer.append events with base64-encoded PCM. The server emits response.audio.delta events as the model speaks. What changed is the session config now accepts reasoning parameters and a translation mode flag.
const ws = new WebSocket('wss://api.openai.com/v1/realtime?model=gpt-realtime', {
headers: { Authorization: `Bearer ${process.env.OPENAI_API_KEY}` }
});
ws.on('open', () => {
ws.send(JSON.stringify({
type: 'session.update',
session: {
modalities: ['audio', 'text'],
voice: 'alloy',
input_audio_format: 'pcm16',
output_audio_format: 'pcm16',
turn_detection: { type: 'server_vad', threshold: 0.5 },
input_audio_transcription: { model: 'gpt-4o-transcribe' }
}
}));
});
The turn detection block is where most production headaches live. Server VAD (voice activity detection) decides when the user has stopped talking. Set the threshold too low and the model interrupts itself; too high and it waits an awkward beat after every sentence. The new release tightens the VAD model’s responsiveness, though I’d still recommend client-side VAD for anything latency-sensitive.
Reasoning inside a turn
The interesting architectural shift is reasoning. Older realtime models did one-shot generation: audio in, audio out, no intermediate compute budget. That works for chitchat but falls apart for anything that needs multi-step thinking, like “summarize the last meeting and tell me which action items are mine.”
The new model can take a reasoning pause before speaking. In practice that means a noticeable delay (somewhere in the 400-900ms range based on early reports) before the first audio delta arrives on hard queries, but the answer is meaningfully more accurate. You can configure the depth via reasoning.effort on the session, similar to the chat completions reasoning controls.
This matters because the old workaround was function calling: the realtime model would detect intent, call a tool that ran a reasoning model on text, then read the result back. That pipeline added a full round-trip of latency and routinely broke prosody because the realtime model would change voice tone mid-response. Doing it in one model removes a class of integration bugs.
Translation that isn’t pivot-translation
Most “realtime translation” demos in the last two years used a pipeline: STT to source language text, MT (machine translation) to target language text, TTS in the target language. Google’s Interpreter Mode and Samsung’s Live Translate both work roughly this way (Meta’s SeamlessM4T paper covers the limitations well). The trouble is the model loses everything that isn’t text: tone, hesitation, emphasis, the speaker’s emotional register.
OpenAI’s new translation mode operates closer to direct speech-to-speech, where the audio representation is translated without a hard text bottleneck. This is the same direction Meta’s SeamlessExpressive went, and the AudioPaLM work from Google before it. The practical result is that a tentative-sounding question in Japanese stays tentative-sounding in English, instead of being rendered in the model’s default neutral voice.
Whether OpenAI is actually doing end-to-end speech translation or a tightly coupled multitask model isn’t documented publicly. Based on the latency profile, I’d guess the latter: a shared encoder with parallel decoding heads for source-language transcription and target-language audio synthesis. That would also explain why the transcription quality improvements ship in the same release.
Transcription, quietly
The transcription model gets less attention but is the most useful update for a lot of existing apps. Whisper has been the open-weight default since 2022 and still ships in basically every voice product, including my Discord bot. The new gpt-4o-transcribe and gpt-4o-mini-transcribe models reportedly improve word error rate (WER) on the FLEURS benchmark, especially for non-English languages (FLEURS paper).
The relevant trade-off: Whisper runs on a laptop, and the new transcription models don’t. If you’re building anything that needs to work offline or where audio can’t leave the device, whisper.cpp is still the right answer. For server-side transcription where latency matters more than cost, the new models look like a clear upgrade.
Where this lands in the ecosystem
The realtime voice space has gotten crowded fast. ElevenLabs Conversational AI ships agent infrastructure on top of bring-your-own LLM. Deepgram’s Voice Agent API bundles STT, LLM, and TTS with claimed sub-300ms turn-taking latency. Cartesia’s Sonic focuses on the TTS side with very fast first-token times. Each of these is essentially gluing together best-of-breed components, which gives flexibility but means you own the integration.
OpenAI’s bet is that integration is the bottleneck, and a single model that does everything natively wins on latency and coherence even if any individual component isn’t best-in-class. That’s a defensible position for general-purpose voice agents. It’s a weaker pitch for specialized cases: medical transcription where you need an EHR-tuned model, call-center analytics where you need speaker diarization with named-entity extraction, anything where the output isn’t a conversation.
Practical notes
A few things worth knowing if you’re integrating this:
- Pricing for realtime audio is still significantly higher than text. The audio modality is roughly an order of magnitude more expensive per token than chat completions (pricing page), so streaming long calls gets expensive fast.
- The websocket has a session length cap. Long-running agents need to handle reconnection and context rehydration; the SDKs help but don’t fully abstract this.
- Server VAD is fine for prototypes. For production, consider client-side VAD with Silero or WebRTC’s built-in VAD. You’ll get tighter latency and more control over barge-in behavior.
- Function calling still works in realtime sessions, and you’ll want it. The reasoning model is smart but it can’t read your database.
The pattern I’d recommend for new builds: start with the Realtime API for the conversational layer, keep your tool implementations independent and stateless, and treat the model as a swappable component. The realtime voice space is moving fast enough that whatever you build today will be worth rewriting in eighteen months.