Implementing Duration-Constrained Translation: The Prompt Engineering Behind AI Dubbing

Most developers building localization pipelines reach for the obvious architecture: transcribe, translate, synthesize. The pipeline chains cleanly, each step is independently testable, and you can swap any component without touching the others. The problem is that it produces consistently bad output, not because any step is implemented poorly, but because the constraints that matter most are distributed across steps that have no shared context.

Descript’s multilingual dubbing system, detailed in an OpenAI case study, resolves this by collapsing the constraint into the translation step itself. Duration is not a property you check after generating a translation. It is a parameter you provide during generation. The shift sounds simple. The implementation consequences are substantial.

The Constraint Structure

A dubbed video segment has three constraints that must be satisfied simultaneously. The translated phrase must mean approximately what the original said. The synthesized audio must fit within the time window occupied by the original speech. And the synthesized voice must sound like the original speaker, adapted to a phoneme inventory that may be structurally different from their native one.

In the naive pipeline, these constraints are addressed in sequence: translation maximizes semantic fidelity with no timing awareness, TTS synthesis takes whatever text arrives and renders it, and any timing overflow is either ignored or addressed with speed adjustments. Speed adjustment has a hard perceptual ceiling. Listeners detect unnaturally fast speech around 210 to 220 words per minute in English. Above that threshold, comprehension drops and the audio sounds machine-generated regardless of voice quality. You cannot solve a structural timing mismatch by speeding up playback.

The constraint is structural because languages differ dramatically in phoneme density. Spanish conveys the same semantic content as English using roughly 20 to 25 percent more syllables at natural speaking pace. German, with its compound nouns and verb-final clause structure, runs 25 to 30 percent longer. A pipeline that translates without duration awareness will overflow nearly every segment when targeting these languages, and no downstream adjustment can recover without losing either meaning or naturalness.

Structuring the Prompt

The practical implementation starts with the ASR output. Whisper and comparable models return segment-level transcripts with start and end timestamps. Most developers treat these as metadata for display. For a constrained dubbing pipeline, they are primary inputs to the translation step.

A segment record from Whisper looks like this:

{
  "id": 12,
  "seek": 8800,
  "start": 22.04,
  "end": 25.40,
  "text": "The key insight is that duration has to be baked in from the start.",
  "tokens": [...],
  "temperature": 0.0,
  "avg_logprob": -0.18,
  "compression_ratio": 1.41,
  "no_speech_prob": 0.01
}

From start and end, you have a duration budget: 3.36 seconds. That budget governs the translation. The prompt needs to make this explicit.

import anthropic
from dataclasses import dataclass

@dataclass
class Segment:
    text: str
    duration_seconds: float
    source_lang: str
    target_lang: str

def translate_constrained(segment: Segment, client: anthropic.Anthropic) -> str:
    # Estimate target character budget from duration.
    # English averages ~13 characters per second at natural pace.
    # Adjust the ratio based on target language phoneme density.
    density_adjustment = {
        "es": 0.82,  # Spanish runs longer; compress by ~18%
        "de": 0.78,  # German runs even longer
        "fr": 0.88,
        "zh": 1.15,  # Mandarin is denser; can afford more characters
        "ja": 1.05,
    }
    base_chars_per_second = 13.0
    adjustment = density_adjustment.get(segment.target_lang, 1.0)
    char_budget = int(segment.duration_seconds * base_chars_per_second * adjustment)

    system = (
        f"You are a professional dubbing localization writer. "
        f"Translate the following {segment.source_lang} text to {segment.target_lang}. "
        f"The synthesized speech of your translation must fit within exactly {segment.duration_seconds:.2f} seconds "
        f"when read at a natural, unhurried pace. "
        f"Aim for approximately {char_budget} characters in the target language. "
        f"Preserve the meaning as closely as possible, but shorten or rephrase if necessary to fit the duration. "
        f"Return only the translated text, no explanation."
    )

    message = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=256,
        messages=[{"role": "user", "content": segment.text}],
        system=system,
    )
    return message.content[0].text.strip()

The character budget estimate is a heuristic, not a guarantee. Synthesized speech duration depends on the TTS model, the specific voice, and prosodic decisions the model makes about emphasis and pause placement. The translation model does not have ground truth about how the downstream TTS will render a given phrase. What it can do is reason about relative phoneme density and produce a phrase that is likely to fit, given what it knows about the target language’s structure.

For higher precision, you can run a calibration pass: synthesize a set of reference phrases in the target language with your chosen TTS model, measure actual durations, and fit a per-voice, per-language characters-per-second estimate. This converts the heuristic into an empirically grounded parameter.

The single-shot approach works adequately for most segments. For segments where timing precision matters more, you can add a refinement loop: synthesize the initial translation, measure the output duration, and if it is outside an acceptable tolerance, prompt the model again with the measured error.

import subprocess
import json
import tempfile
import os

def get_audio_duration(audio_path: str) -> float:
    """Get duration of audio file using ffprobe."""
    result = subprocess.run(
        ["ffprobe", "-v", "quiet", "-print_format", "json", "-show_format", audio_path],
        capture_output=True, text=True
    )
    info = json.loads(result.stdout)
    return float(info["format"]["duration"])

def translate_with_refinement(
    segment: Segment,
    client: anthropic.Anthropic,
    tts_synthesize,  # callable: (text, lang) -> audio_path
    tolerance_seconds: float = 0.15,
    max_iterations: int = 3,
) -> tuple[str, str]:
    """
    Returns (translated_text, audio_path).
    Refines translation if synthesized audio does not fit within tolerance.
    """
    translation = translate_constrained(segment, client)

    for attempt in range(max_iterations):
        audio_path = tts_synthesize(translation, segment.target_lang)
        actual_duration = get_audio_duration(audio_path)
        error = actual_duration - segment.duration_seconds

        if abs(error) <= tolerance_seconds:
            return translation, audio_path

        # Prompt for correction with measured error.
        direction = "shorter" if error > 0 else "longer"
        overage_ms = int(abs(error) * 1000)

        message = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=256,
            system=(
                f"You are a dubbing localization writer. Revise this {segment.target_lang} translation "
                f"to be approximately {overage_ms}ms {direction} when spoken aloud. "
                f"The target duration is {segment.duration_seconds:.2f} seconds. "
                f"Current translation runs {actual_duration:.2f} seconds. "
                f"Preserve meaning as much as possible. Return only the revised text."
            ),
            messages=[{"role": "user", "content": translation}],
        )
        translation = message.content[0].text.strip()
        os.unlink(audio_path)  # Clean up intermediate audio.

    # Return last attempt even if outside tolerance.
    audio_path = tts_synthesize(translation, segment.target_lang)
    return translation, audio_path

This refinement loop converges quickly in practice. The model understands that it needs to compress or expand and typically reaches acceptable output within two iterations. The exception is segments where the duration is very short and the source phrase is semantically dense, leaving almost no room to maneuver without losing meaning.

The Voice Cloning Layer

Duration-aware translation solves the timing problem. It does not address speaker identity. A viewer watching dubbed content expects to hear something that sounds like the original speaker, not a generic synthetic voice. Descript addresses this through Overdub, which extracts a speaker embedding from the original audio and conditions the TTS synthesizer on it.

The technical challenge is that voice identity is not purely a phonetic property. It includes prosodic habits: how a speaker distributes emphasis, where they place pauses, how their pitch moves through a sentence. These patterns are partly language-specific and partly individual. A speaker embedding that captures timbre may not capture prosody, and cross-language synthesis may strip out the prosodic characteristics that make a voice recognizable even when the phoneme inventory changes completely.

For developers building outside Descript’s ecosystem, ElevenLabs and OpenAI’s TTS API offer voice cloning with varying degrees of cross-language support. ElevenLabs’ multilingual v2 model supports 29 languages with speaker identity preservation. The quality of identity preservation varies across language pairs; pairs that share phoneme inventory tend to preserve more speaker character than pairs that do not.

Where the Pipeline Breaks Down

Three failure modes appear consistently in duration-constrained dubbing pipelines.

ASR errors propagate into semantic loss. If Whisper misrecognizes a word, the translation model translates the misrecognized text. The timing constraint may be satisfied perfectly while the semantic content is wrong. For high-stakes content, human review of the ASR output before translation is worth the cost.

Named entities resist compression. When a segment contains a proper noun, product name, or technical term that must appear verbatim, the duration budget shrinks by the fixed cost of that term. In Spanish or German, a segment that is already tight with an English technical term in it may be nearly impossible to fit without losing surrounding context.

Segment boundary placement affects difficulty. If the ASR system places a segment boundary mid-phrase, the duration budget for each half may be individually difficult to satisfy. Whisper tends to segment at natural utterance boundaries, but speaker idiosyncrasies and background noise can cause misalignment. A post-processing pass that merges very short adjacent segments before translation reduces this problem.

The Pattern Beyond Dubbing

The architecture Descript has built is a specific instance of a general pattern: constrained generation where the constraint is temporal and the output will be rendered in a medium with fixed time slots.

Game localization encounters this for voiced dialogue. Characters speak over animations with pre-baked durations. A localization pipeline that produces accurate translations without duration awareness generates lines that require animation rework on a significant fraction of characters. Studios like Pole to Win employ human localization writers for this reason, doing the same constrained paraphrase work that Descript’s pipeline automates.

Accessibility audio description for video has the same structure. Narrators must describe visual content during natural dialogue gaps, which have fixed durations. Automated audio description that does not model duration produces descriptions that overflow into dialogue, which is worse than no description at all.

The implementation pattern transfers directly: extract timing windows from the source medium, compute duration budgets, pass budgets into the generation prompt, synthesize into the time slot, measure, refine if needed. The model, given a clear duration constraint and an honest estimate of the current error, can iterate toward a solution that a purely post-hoc approach cannot reach.

The underlying reason this works is that large language models have absorbed a substantial amount of information about phoneme density and speech rate across languages. They cannot directly predict synthesized audio duration, but they can reason about relative density and produce compressed paraphrases that preserve meaning. That reasoning capability is what makes duration-as-constraint viable, rather than duration-as-filter. Translate first and discard what does not fit, and you waste most of the model’s output. Translate to fit from the start, and the model can direct its reasoning toward the problem that actually matters.