· 6 min read ·

Translation That Fits: The Timing Problem at the Heart of AI Video Dubbing

Source: openai

Back in March 2026, OpenAI published a case study on how Descript uses their models to scale multilingual video dubbing. The headline framing was that the system optimizes translations for both meaning and timing. That conjunction is doing a lot of work, and it’s worth unpacking why those two goals are in tension at all, and what it takes to satisfy them together.

The core challenge in video dubbing is not translation. Translation, at this point, is largely solved. The hard part is translation that fits.

When a sentence takes 2.3 seconds to say in English, the Spanish or German or Hindi equivalent does not automatically take 2.3 seconds to say. Romance languages are typically 15 to 25 percent more verbose than English by word count. Mandarin can encode more meaning per syllable but has different prosodic rhythm. German compounds words in ways that affect utterance length unpredictably. The natural cadence of a sentence in one language doesn’t map cleanly onto another. This constraint has a name in professional dubbing: isochrony, the requirement that translated speech match the duration of the original.

How Professional Dubbing Studios Handle It

In traditional dubbing, isochrony is a human craft. The translation pipeline involves not just translators but dialogue adapters, sometimes called dubbing writers or lip-sync adapters, whose job is to produce target-language dialogue that fits the timing of the original. They work from the source transcript, a spotting list with timecodes for each segment, and the video itself. They rewrite, compress, expand, and restructure sentences until the translation fits.

For close-up shots, they also have to worry about phonetic synchrony: the visible mouth movements should roughly match what’s being said, at least on the vowels that the camera can pick up clearly. Bilabial consonants (p, b, m) are particularly visible; a scene where the actor is clearly saying something starting with a “p” sound will look wrong if the dub has them saying something that starts with an “s”.

This process is expensive and slow. A professional dub into a single language for a feature film costs between $15,000 and $100,000 depending on production quality, and takes weeks. For a YouTube creator or a software company localizing training content, that math doesn’t work.

What Changes With an LLM in the Loop

The shift that Descript’s pipeline represents is moving the isochrony constraint from a human adaptation step into the translation step itself. Instead of translating first and fitting later, the LLM receives the source text along with timing metadata for each segment and is instructed to produce a translation that respects those constraints.

This is a different kind of prompt than asking GPT-4 to translate a document. A segment-aware dubbing prompt might look something like this:

Translate the following English dialogue segment to Spanish.
The spoken duration of the original is 2.1 seconds.
The translation must be speakable in approximately 2.0 to 2.3 seconds at a natural pace.
Prioritize duration fidelity over literal word-for-word translation.
Preserve the speaker's intent and emotional register.

Segment: "We'll have the report ready by Friday, no question."

The model can compress: “El informe estará listo el viernes.” It can expand if the original was unusually dense. It can restructure entirely. And because it has context about what a natural speaking rate sounds like in the target language, it can make reasonable judgments about syllable count as a proxy for duration.

The rough heuristic, used in many systems, is that speaking rate in conversational speech is fairly stable within a language: somewhere between 120 and 180 words per minute for most speakers in most languages, though this varies. More useful than word count is syllable count, which correlates more directly with phonation time. For a system operating at scale, the LLM generates a candidate translation, the TTS model synthesizes it, and the actual audio duration is measured. If it’s outside an acceptable window, the system can loop: send the translation back with the measured duration and ask for adjustment.

This feedback loop is computationally cheap relative to human labor. Running a few extra TTS inferences per segment to converge on a well-timed translation is a tractable engineering problem.

The Voice Cloning Layer

Timing is one axis. Voice is another. A dubbed video where the audio sounds like a generic TTS voice reading a script is still jarring, even if the timing is right. What viewers want is the original speaker’s voice, speaking a language they may never have spoken.

Descript already had significant infrastructure here from their Overdub feature, which lets creators clone their own voice and use it to patch or extend recordings. Extending that to multilingual dubbing means generating speech in a cloned voice in a different language, sometimes called cross-lingual voice transfer.

This is a harder problem than same-language voice cloning. The acoustic characteristics of a voice, its timbre, resonance, and breathiness, transfer across languages reasonably well. But prosody, the patterns of stress, intonation, and rhythm, is deeply language-specific. A voice cloned from an English speaker will carry English prosodic patterns into the target language if the model isn’t careful, producing speech that sounds accented in an unusual and non-native way.

OpenAI’s TTS models support voice customization and have improved significantly in cross-lingual speaker conditioning, though the technical details of how they handle prosody transfer remain largely unpublished. The practical result, based on what Descript ships, is that the output is natural enough for most content use cases without requiring a professional voice actor.

Where This Sits in the Broader Landscape

Descript isn’t the only company in this space. ElevenLabs has a dubbing API that handles end-to-end voice-cloned translation. Papercup and Deepdub have been selling AI dubbing to media companies since around 2020 and 2021, with a focus on broadcast-quality output. Google added auto-dubbing to YouTube for creator content in 2023, initially in English to Spanish, Portuguese, and a handful of other languages.

What differentiates Descript’s approach is integration with an editing workflow rather than a batch-processing pipeline. In Descript, creators work in a text-based editing interface where the transcript is the primary editing surface. That architecture makes it natural to surface dubbed content as just another track alongside the original, editable in the same interface. If the timing on a particular segment is slightly off, a creator can adjust the transcript directly rather than filing a support ticket or re-submitting a job.

This matters for adoption. The reason AI video tools often don’t get used is not that they produce bad output at the median, it’s that the bad cases require leaving the tool entirely to fix. Keeping remediation inside the editing environment is a product decision with real workflow consequences.

The Remaining Hard Problems

A few things the current generation of AI dubbing doesn’t handle well are worth naming.

Speaker diarization at scale, separating multiple speakers in a single video with overlapping or adjacent speech, is still an error-prone step. Errors there cascade into attribution mistakes in the dubbed audio. Segment boundaries assigned to the wrong speaker mean the wrong voice gets synthesized.

Emotional consistency is another gap. TTS models can be prompted for tone, but calibrating the intensity of an emotional delivery to match the original speaker, who was genuinely excited or distressed during recording, remains difficult. The dubbed voice tends to be flatter.

There are also edge cases around on-screen text. If a video has captions or lower-thirds in the source language, the dubbing pipeline doesn’t address those automatically. A fully localized video requires visual localization too, which is a separate problem.

None of these are blockers for most content. A developer tutorial, a company all-hands recording, a YouTube explainer: these work well with current AI dubbing because the speech is clear, the stakes are lower, and viewers are tolerant of minor artifacts. Professional broadcast content is a different bar.

The Direction This Points

The Descript and OpenAI collaboration is a useful data point for where AI-powered localization is heading. The constraint that has historically made dubbing expensive is not the translation work, which is fast, but the adaptation work that enforces isochrony and the voice work that requires trained talent. Both of those constraints are softening as LLMs get better at constrained generation and TTS models improve at cross-lingual speaker conditioning.

What remains is the integration problem: surfacing this capability in tools that creators already use, with enough editorial control that the hard cases can be fixed without breaking the workflow. That’s the part Descript is specifically trying to solve, and it’s the right place to solve it.

Was this interesting?