The Hard Part of AI Dubbing Is Not the Translation

Most people, when they hear “AI dubbing,” think the hard part is translation. Get the words right, match the speaker’s voice, done. But anyone who has watched a poorly dubbed film knows that accurate translation is almost beside the point if the timing is off. A perfectly translated sentence that runs two seconds longer than the original clip sounds wrong in a way that is immediately obvious to every human brain.

That is the actual problem Descript is solving. According to their case study with OpenAI, they are using OpenAI models not just to translate speech, but to optimize translations simultaneously for meaning and timing — so the dubbed audio fits naturally within the original video’s pacing.

Why Timing Is the Real Constraint

Different languages have different information densities. Spanish tends to run longer than English. German even more so. Japanese can be radically shorter or longer depending on register. If you take an English sentence and translate it to German and then synthesize speech from that, you might get audio that is 30% longer than the original clip. The result is a speaker whose mouth stops moving while words are still coming out.

Professional dubbing studios solve this with localization writers — people who rewrite translated lines to fit the timing while preserving meaning. It is skilled, slow, expensive work. Doing it at scale for user-generated content (which is Descript’s market) is essentially impossible without automation.

What Descript appears to be doing is collapsing that rewriting step into the translation step itself. Instead of translating and then fitting, the model is generating translations that are already constrained by duration — producing output that means roughly the right thing and takes roughly the right amount of time to say.

The Interesting Engineering Problem

From a systems perspective, this is a constrained generation problem. You are not just maximizing translation quality; you are optimizing across two objectives simultaneously:

Semantic fidelity: the translated text should convey what the speaker actually said
Duration fidelity: the synthesized speech should fit within the original segment’s time window

These objectives are often in tension. The most accurate translation might be too long. The shortest adequate paraphrase might lose nuance. Getting a model to navigate that tradeoff automatically, at scale, across dozens of language pairs, is genuinely non-trivial.

I suspect there is also a third constraint operating here: the synthesized voice has to sound natural at the target duration, which means you cannot just speed up or slow down TTS output arbitrarily. Speech rate has perceptual limits. You have to actually write different words, not just play them faster.

Why This Matters Beyond Descript

Descript is primarily a tool for podcasters and video creators, but the underlying capability is significant. The cost of creating multilingual video content has historically been a distribution barrier — small creators cannot afford professional dubbing, so their content stays monolingual. If AI can compress the cost of dubbing to near zero, the economics of global content distribution change substantially.

This is the kind of infrastructure-level shift that tends to be underreported because the product it enables (more dubbed YouTube videos) sounds mundane. But the underlying shift — language as a distribution constraint dissolving — is not mundane at all.

I am curious how well the timing optimization actually holds up across language pairs with extreme length divergence. That would be the real test of the approach.