· 7 min read ·

The Isochrony Problem: What Makes AI Dubbing Actually Hard

Source: openai

Translating text is a solved problem. Dubbing a video is not, and the difference comes down to a constraint that translation engines traditionally ignore: time.

When you translate a sentence, the output can be any length. When you dub a video, the synthesized audio for that sentence must fit inside the original utterance’s time window. A 4-second English sentence that translates to 6 seconds of German is not just a translation problem; it is a scheduling problem with a hard deadline. Linguists call this constraint isochrony, and it is why professional dubbing studios bill by the hour and not by the word.

Descript’s AI dubbing pipeline, built on OpenAI’s APIs, is an attempt to automate that constraint satisfaction at scale. The architecture is worth examining not just for what Descript built, but for what the problem structure reveals about where AI tooling is actually useful and where it still requires careful engineering around the gaps.

Why Isochrony Is Harder Than It Looks

Different languages encode the same semantic content at different densities. German tends to run 20 to 30 percent longer than English for equivalent meaning, partly because of compound nouns and partly because of verb placement. Spanish runs slightly longer. Mandarin often runs shorter because each syllable carries more information. When you are building a dubbing pipeline, these differences are not occasional edge cases; they are systematic properties of language pairs that you have to handle for every segment of every video.

The naive solution is to speed up the synthesized audio. If the German translation runs 5 seconds but the original slot is 4 seconds, render it at 1.25x playback speed. This works within limits. Human speech can be sped up by about 20 percent before it starts sounding unnatural. Beyond that, the increased articulation rate makes the voice sound rushed and the consonants start to blur.

A better solution is to generate duration-aware translations in the first place. This is where large language models can do something genuinely useful: rather than translating once and then wrestling with the output length, you can prompt the model to produce a translation that fits inside a target character count or estimated speech duration. The translation is not just semantically accurate; it is also prosodically constrained.

In practice, this means treating translation as a constrained optimization. Given a source segment of N seconds, produce target text whose synthesized speech duration falls within some tolerance of N seconds, while preserving meaning and sounding natural in the target language. The LLM does not directly control synthesis duration, so the pipeline has to estimate it, typically using character count as a proxy weighted by language-specific speech rate estimates.

The Four-Stage Pipeline

Descript’s implementation breaks the problem into four sequential stages, each with its own set of failure modes.

Transcription with timestamps. The pipeline starts with speech recognition that produces not just text but word-level timestamps. Whisper, OpenAI’s transcription model, has become standard infrastructure for this kind of work because its word-level alignment is accurate enough to use for downstream timing decisions. The timestamps are the scaffolding the rest of the pipeline builds on. If they are wrong, the synthesized audio will be out of phase with the video.

Duration-aware translation. The timestamped transcript gives the pipeline segment boundaries. Each segment gets a duration budget. The translation stage, using GPT-4 class models, produces target-language text with awareness of that budget. The implementation details matter here: whether you pass the duration budget as a system prompt constraint, whether you generate multiple candidates and pick the best fit, and whether you allow the model to slightly paraphrase to hit the timing target.

Voice-matched synthesis. OpenAI’s TTS API generates audio from the translated text, attempting to match the original speaker’s characteristics. The synthesis produces per-segment audio files timed to fit the original timestamps. When the translation runs slightly long or short despite the constraints, the synthesis stage can apply modest time-stretching to close the gap.

Source separation and mixing. The original audio track is a mix of speech and everything else: music, room tone, ambient sound. The dubbing pipeline has to isolate the speech layer, remove it, and replace it with the synthesized audio while keeping the background layer intact. Meta’s Demucs and similar models handle this source separation step. It is computationally expensive and imperfect; for recordings with significant reverberation or bleed between speech and music, the separation quality degrades visibly.

What Descript’s Data Model Makes Possible

Descript’s core product treats audio and video as text. You edit a recording by editing its transcript. Delete a sentence in the transcript, and the corresponding audio is removed. This is not a dubbing feature; it is the fundamental data model of the product. But it turns out to be exactly the right data model for AI dubbing.

When a user loads a project in Descript, the transcript is already there, the timestamps are already computed, and the project is already structured around the relationship between text and time. Adding dubbing means adding a translation layer on top of an existing data structure, not building a new pipeline from scratch.

More importantly, the translated transcript is editable. If the AI translation produces something that is semantically wrong, uses domain-specific terminology incorrectly, or sounds awkward in the target language, the user can fix it directly in the interface and re-synthesize. That feedback loop is tighter than anything a black-box dubbing service offers. A traditional dubbing service returns finished audio; Descript returns editable text that generates audio on demand.

This distinction matters because AI translation is still far from perfect on specialized vocabulary. A software tutorial will produce terms like “pull request,” “runtime,” or “containerization” that a general-purpose translation model may handle inconsistently. A user who can see and edit the translated transcript can catch and fix those errors before they reach the final audio.

The Lip-Sync Gap

The current generation of audio-only dubbing pipelines have one visible limitation: when the speaker is on camera, the mouth movements do not match the synthesized audio. This mismatch ranges from barely noticeable to immediately obvious depending on camera angle and shot size. For wide shots and voiceover-style narration, audiences adapt quickly. For close-up dialogue scenes, the disconnect is more disruptive.

Several companies are working on a fifth pipeline stage that would re-render the speaker’s lip movements to match the new audio using video synthesis. HeyGen and Synthesia have been pushing this direction. The results are technically impressive but not yet consistently convincing at the pixel level; viewers who know what to look for will spot the artifacts.

Descript’s current positioning focuses on audio quality over video manipulation, which is the correct prioritization for their primary content types: podcasts, tutorials, product demos, and documentary-style explainer videos. These formats rarely have extended close-up dialogue sequences, so the lip-sync limitation has less impact on perceived quality.

The Cost Argument Is Real

Professional dubbing studios charge between $15 and $50 per finished minute depending on language pair, speaker count, and quality tier. A 20-minute tutorial video in three languages runs $900 to $3,000 before distribution. That cost has historically meant that localization was a project that required a budget and a business case.

AI pipeline costs are roughly two orders of magnitude lower. OpenAI’s TTS API is priced per million characters. A 10-minute video at average speaking pace contains around 1,500 words, or approximately 8,000 characters. Translation via the GPT-4 API adds a comparable increment. Three language versions of a 10-minute video costs somewhere between $1 and $5 in API calls. That math changes who can afford to localize content from a budget decision into an assumption.

The quality ceiling is lower than professional studio work for narrative content with complex emotional range. But for developer tutorials, technical documentation videos, product walkthroughs, and educational content, the quality is well past the threshold of useful. The audience for a Spanish-language version of a software tutorial does not expect it to sound like a theatrical release.

The Competitive Landscape

ElevenLabs offers dubbing via API, aimed at developers building localization pipelines. Papercup and Deepdub are enterprise services targeting broadcast and streaming clients with higher quality requirements and correspondingly higher price points. Adobe is integrating AI voice tools into Premiere Pro and Audition, adding dubbing-adjacent capabilities to a professional editing workflow that already has a large installed base.

Descript’s advantage is not that it has the best models; the underlying API components are available to everyone. The advantage is that dubbing in Descript is a step within a workflow rather than a separate product. The user does not export to a dubbing service and import back; the translated, synthesized version of the project lives alongside the original and can be edited with the same tools. For content that requires any manual correction, which currently means most specialized content, that integration reduces friction significantly.

The constraint problem of dubbing is not fully solved. Duration matching is good but not perfect. Voice consistency across long videos can drift. Audio source separation fails on noisy recordings. These are active research areas, and the models will improve. What Descript has already demonstrated is that the pipeline is good enough to be genuinely useful for a large category of content, and that the editing-centric architecture makes it correctable when it is not.

Was this interesting?