The Timing Problem: Why AI Dubbing Is Harder Than It Looks

Translation is a solved problem in the sense that we have models that can convert meaning across languages with reasonable accuracy. Dubbing is not a solved problem, and the difference matters more than most people realize.

Descript recently shared how they use OpenAI models to scale multilingual video dubbing — and the interesting part of that writeup is not that they use AI for translation. It’s that they explicitly optimize for both meaning and timing.

That distinction is the whole ballgame.

What Makes Dubbing Hard

When you translate a script for subtitles, you have flexibility. You can rephrase, trim, break lines differently. The reader’s eye adapts. But when you’re generating dubbed audio that needs to play over existing video, you’re constrained in a way that pure translation isn’t:

The dubbed audio has to roughly match the on-screen speaker’s mouth movements
Sentence length varies dramatically across languages — a phrase that takes 2 seconds in English might take 3.5 seconds in German or 1.2 seconds in Japanese
Prosody matters: where the emphasis falls, the rhythm of speech, pauses
The speaker’s emotional tone in the original needs to survive into the target language

Naively translated text fed into a TTS system will produce audio that feels off — not because the words are wrong, but because the timing is wrong. The dubbed voice will run long, clip awkwardly, or sound robotic because the natural speech rhythm of the target language doesn’t map cleanly onto the original video’s pacing.

What Descript Is Actually Solving

Descript’s approach, at a high level, is to use OpenAI models not just to translate but to produce translations that fit the temporal constraints of the original audio. The model has to produce output that means roughly the same thing and occupies roughly the same duration when spoken aloud.

This is a genuinely difficult optimization target. You’re not just scoring semantic accuracy — you’re scoring something like “naturalness in the target language given a hard time budget.” Sometimes the best translation for meaning is not the best translation for timing, and the model has to make that tradeoff intelligently.

Add to that the voice cloning or voice matching layer — where Descript tries to preserve the character of the original speaker’s voice across languages — and you have a pipeline with a lot of moving parts that all have to work together.

Why This Matters for Video Creators

The practical upshot is that creators who previously couldn’t afford professional dubbing (which is expensive and slow) can now reach global audiences without sacrificing quality for a robotic experience. The bar for “good enough” dubbing has dropped dramatically.

For developers building on top of these kinds of pipelines, the interesting architectural challenge is evaluation. How do you measure whether a dubbed video is good? Automated metrics for translation quality don’t capture timing naturalness. Human evaluation is expensive. This seems like an area where the tooling is still catching up to what the models can produce.

Descript’s work here is a good example of a product team taking a genuinely hard problem — not just the AI part, but the product problem of what “good” even means — and engineering toward it deliberately.