· 6 min read ·

What It Actually Takes to Translate a Real Codebase with an LLM

Source: lobsters

Daniel Janus recently wrote about translating a non-trivial codebase using Claude, and the account is worth spending time with. Not because the result is surprising, but because the friction points he hits illuminate something most “AI translates your codebase” demos carefully avoid: the hard part was never syntax.

Let me dig into what makes codebase translation genuinely difficult, where large language models change the calculus, and what a serious workflow for this looks like in practice.

The Old Approach and Its Ceiling

Before LLMs entered this space, language migration fell into two camps. You either wrote a mechanical transpiler targeting a specific source-destination pair, or you rewrote manually. Transpilers work well when the semantic gap between languages is narrow. Tools like js2coffee or the TypeScript compiler’s --allowJs pathway can carry you far when the type systems and runtime models are close enough to map one-to-one.

The ceiling appears when idioms diverge. A Clojure codebase using transducers has no syntactic equivalent in Python. A Rust codebase leaning on the ownership system encodes invariants in the type system that simply evaporate when you cross to a garbage-collected language. AST-to-AST transforms can rename variables and restructure control flow, but they cannot reason about what a piece of code means well enough to replace an idiomatic pattern with its idiomatic counterpart in the target language.

This is the gap LLMs fill, and it is a real one.

Why Context Windows Changed the Problem

The first generation of LLM coding tools hit an obvious limit: a file-at-a-time translation cannot preserve cross-file semantics. If a function in utils.py relies on a convention established in core.py, translating each file independently produces a coherent-looking result that does not actually work.

Claude’s 200k token context window (and the extended context available via the API) changed what is feasible. A 200k token window holds roughly 150,000 words of text, which is enough to fit a substantial portion of a mid-sized codebase in a single prompt. Janus’s experiment works partly because he can feed enough of the surrounding codebase to let the model maintain consistent naming, preserve call signatures, and understand the shape of the data flowing between modules.

This is not free. Longer contexts increase latency and cost, and model attention degrades somewhat across very long inputs, a phenomenon sometimes called lost-in-the-middle where information in the center of a long prompt receives less reliable attention than content near the edges. Structured approaches that put the most critical context near the beginning and end of the prompt partially mitigate this.

What “Non-Trivial” Actually Means

It is worth being precise about what makes a codebase hard to translate mechanically. The obstacles are roughly:

Idiomatic patterns. Every language community develops idioms that do not map cleanly to other languages. Python’s list comprehensions, Clojure’s threading macros, Haskell’s monadic composition, Rust’s iterator chains: each is idiomatic in its home language and awkward when transliterated literally. A good translation replaces the source idiom with the target idiom rather than producing a literal but unidiomatic rendering.

Concurrency models. Python’s GIL, Go’s goroutines, JavaScript’s event loop, Rust’s async/await with its Send and Sync bounds: these are not interchangeable. Code that works correctly in one model can silently deadlock or race in another. No static transform can verify correctness across this boundary; you need the model to understand the semantics well enough to reach for the right primitive.

Type system differences. Moving from a dynamically typed language to a statically typed one requires reconstructing type information that was implicit. Moving the other direction requires deciding what to do with safety guarantees that simply do not exist at runtime. This is reasoning work, not text substitution.

Library surface area. A Python NLP library might depend on NumPy broadcasting semantics, PyTorch tensor operations, or NLTK tokenization behavior. The Clojure or Rust equivalent either does not exist, has different semantics, or requires calling into the original library via FFI. The model has to make judgment calls about equivalence that are domain-specific.

A Workable Workflow

Janus’s approach, as far as I can reconstruct it from his writeup, is iterative rather than one-shot. This matches what works in practice. The one-shot approach, where you paste in a codebase and ask for a translation, produces something that looks plausible but is riddled with subtle errors. The better workflow:

  1. Translate the data model first. The core data structures define the contract everything else works against. Get these right and validated before touching the logic layer.

  2. Translate leaf functions before callers. Bottom-up translation lets you test each piece before it gets composed into something harder to debug. Functions with no dependencies on other untranslated code can be run against a test suite immediately.

  3. Keep a running reference implementation. The original codebase stays live and runnable throughout. Property-based testing frameworks like Hypothesis (Python) or test.check (Clojure) are useful here: you can run the same generated inputs against both implementations and compare outputs, catching semantic divergence before it compounds.

  4. Treat the model as a draft generator, not an oracle. Every generated function gets reviewed. The review is faster than writing from scratch, but it is not skippable. LLMs hallucinate plausible-looking API calls for libraries they have seen less of, especially for niche or recently updated dependencies.

  5. Checkpoint frequently. Translation tasks that take multiple sessions benefit from keeping a log of what has been translated, what tests pass, and what decisions were made. The model has no memory between sessions; you have to reconstruct context each time.

Where the Model Earns Its Keep

The translation of boilerplate is fast and reliable. If you have fifty methods that each do a simple field access, map, and return, the model translates them correctly at speed. The cognitive overhead of doing this manually is real, even for trivial code, because attention is finite and reviewers get fatigued.

The translation of algorithmic core logic is where LLMs are genuinely impressive. A well-described dynamic programming solution in Python, fed to Claude with the request to produce idiomatic Rust, will often produce correct Rust with appropriate ownership structure. This is not always true, and it degrades for unusual algorithms or obscure library behavior, but the hit rate on clean algorithmic code is high enough to be practically useful.

Comment and documentation translation is a quiet win. Translating docstrings from one language’s convention to another, updating inline comments that reference language-specific behavior, and adapting README examples all come for free alongside the code translation. This kind of work is tedious manually and falls through the cracks in migration projects.

Where It Breaks Down

The failure modes are predictable once you see them a few times.

Code that relies on side effects in specific orderings is fragile. The model may preserve the behavior, but without being told explicitly about the ordering dependency, it may also refactor it away. Pure functions are much more reliably translated than stateful ones.

Performance-sensitive code requires human review regardless. The model will produce functionally correct code that is sometimes algorithmically correct but cache-hostile, or that uses a container with the wrong time complexity for the access pattern. Correctness and performance are separable concerns, and the model optimizes for correctness.

Fairly obscure library behavior is a consistent source of bugs. If the source code depends on an undocumented behavior of a specific library version, the model does not know about it and will translate to the documented API. This is often fine. Sometimes it is not.

The Broader Shift

Projects like Janus’s represent something real changing in how migration work gets done. The traditional calculus was: migration is so expensive that it often does not happen, and teams accumulate technical debt in a language or framework they would prefer to leave. LLM-assisted translation does not eliminate that cost, but it compresses it enough to make migrations feasible that were not previously worth attempting.

The interesting engineering work shifts from writing translation code to writing verification infrastructure. Test coverage that was nice-to-have for the original codebase becomes load-bearing for the translation. Property-based tests that characterize the behavior of a function mathematically are more valuable than example-based tests that only cover known inputs. The investment in verification tooling that makes AI-assisted translation trustworthy is the same investment that makes any future refactoring safer. You end up with a better-tested codebase almost as a side effect.

That seems like a reasonable trade.

Was this interesting?