The Coordination Problem in LLM-Assisted Codebase Translation

For decades, automated code migration was a rule-based discipline. Python’s 2to3 tool, Facebook’s jscodeshift, Microsoft’s Roslyn-based migration helpers: each generation of tooling could handle more of the mechanical work, but they all operated on syntax trees, and syntax is only the surface of what code means. A program’s meaning lives in its dependencies, its idioms, its implicit assumptions about library behavior, and none of that survives a pure syntactic transformation.

Daniel Janus’s recent writeup on translating a non-trivial NLP codebase with Claude lands at an interesting moment. LLMs are now genuinely capable at the file level: you can paste a moderately complex Python module, ask for a Rust port, and get back something that compiles and captures most of the intent. The question the post addresses, and that codebase-scale translation always eventually surfaces, is what happens when the scope exceeds a single file and the decisions start to interact.

The limits of the rule-based approach

The 2to3 tool is the canonical example of rule-based migration at scale. It handles print statements, dictionary iteration methods, division semantics, and a few dozen other Python 2 to Python 3 differences with perfect accuracy inside its scope. That scope is narrow by design: each rule is a syntactic pattern with a deterministic substitution. The tool cannot reason about what code is for, only what it looks like.

Facebook’s TransCoder research project, published in 2020, was an early attempt at using neural sequence models for cross-language code translation. On competitive programming problems where the I/O contract is fully specified, it achieved reasonable accuracy. On real-world code with library dependencies, complex state management, and implicit behavioral assumptions, accuracy dropped sharply. The CodeBLEU metric developed alongside this research captures syntactic and dataflow similarity between source and target code, but it cannot capture whether a translated module integrates correctly with the ten other modules that depend on it.

LLMs represent a genuine step change from both approaches because they can reason about intent rather than just structure. The failure modes shift accordingly.

What makes a codebase non-trivial

A codebase is non-trivial for translation purposes when the translation decisions made for one file constrain the decisions available for other files. Three problems dominate in practice.

The first is dependency ordering. If encoder.py imports a class from tokenizer.py, you need to translate tokenizer.py first, because the translated encoder.py needs to reference whatever Tokenizer became in the target language. In a real codebase, this is a topological sort over the module dependency graph. When that graph has cycles (which is common in Python packages that use lazy imports to avoid circular import errors), the ordering problem becomes a constraint-satisfaction problem. You have to break cycles somewhere, and the break points affect the translation of every module on either side.

The second problem is symbol consistency. When you translate a function called compute_sparse_attention in one module, every downstream module that calls it needs to use whatever name you settled on in the translation. LLMs will invent names independently for each file if you do not constrain them, and they will do so confidently. After translating twenty files independently, you can easily end up with computeSparseAttention, sparse_attention_compute, computeAttention, and SparseAttentionLayer all referring to what was one function in the source.

The third problem is semantic divergence at module boundaries. An NLP codebase will typically have modules that produce numpy arrays, consume them, and pass them between pipeline stages. The contract between modules is often implicit: caller and callee agree on shape, dtype, and memory layout through convention rather than types. When you translate such a codebase to a statically typed language, every one of those implicit contracts has to become explicit, and the LLM has to guess what the contracts were from local evidence.

The translation memory pattern

The most effective mitigation for symbol consistency is maintaining an explicit translation dictionary alongside the translation work: a growing mapping from source symbols to their chosen target equivalents. Before translating each file, you inject the current state of this dictionary into the prompt. After translation, you extract any newly introduced symbols and add their translations to the dictionary for the next file.

This mirrors how professional human translators handle long documents. Translation tooling in the localization industry calls this a “translation memory,” and the pattern transfers directly to code. A prompt preamble might look like:

The following symbol mappings have already been established in previously
translated files. Use these exact names when referencing these symbols:

  sparse_encoder          → SparseEncoder
  compute_attention       → computeAttentionWeights
  NLPPipeline             → NlpPipeline
  batch_size              → batchSize

Translate the following module, using these names for all referenced symbols.
For any new symbols you introduce, follow the naming conventions above.

Without this constraint, the reconciliation work at the end of a large translation project can exceed the translation work itself.

Where Claude’s context window actually helps

Claude’s 200,000-token context window is genuinely useful for codebase translation, but not primarily because it lets you fit large files. Most files in a real codebase fit comfortably within 8,000 tokens. The benefit is that you can include substantial supporting material alongside the file being translated: the full translation memory, type definitions from dependent modules that the current file imports, relevant documentation, and representative test cases.

The more context the model has about what the code is supposed to do, the more idiomatic the translation tends to be. A module that uses a custom tokenizer will produce a better translation if the prompt includes the translated tokenizer interface rather than just a stub type declaration.

There is also a valid case for multi-file translation: taking a cluster of tightly coupled files and asking for a coordinated translation in one pass. This works well when the cluster is small, the dependencies between files in the cluster are dense relative to their dependencies outside the cluster, and the total token count is manageable. It fails when the cluster is large enough that the model starts losing track of decisions it made for earlier files in the same pass.

The idiomatic versus literal tradeoff

Every large translation project eventually forces a position on this spectrum. A literal translation preserves the structure of the source code: the same functions, the same control flow, the same decomposition into modules. It is easy to verify by comparison, easy to debug, and easy to explain to reviewers who know the original codebase. It also tends to produce target-language code that looks foreign, uses patterns the target community does not use, and will eventually have to be rewritten anyway.

An idiomatic translation produces code that looks like it was written in the target language from the start. For an NLP codebase moving from Python to Rust, an idiomatic translation would use iterator chains and trait-based abstractions where the Python used list comprehensions and duck typing. The resulting code is more maintainable long-term, but verifying that it is semantically equivalent to the original requires substantially more work.

LLMs default to something in between: they will not preserve Python idioms in non-Python code, but they will not reach for advanced target-language features unless the prompt encourages it. For production migrations, specifying the desired position on this spectrum explicitly, in the system prompt or the per-file instructions, produces more consistent results than relying on the model’s defaults.

What requires human judgment

The translation process reliably surfaces decisions that the original code made implicitly. Error handling strategy is a common one: Python codebases often have exception flows that were designed for one context and silently swallowed in another, and neither the LLM nor any other automated tool will notice unless there are tests that exercise the error paths. Numeric precision is another: the difference between integer and floating-point division, or between 32-bit and 64-bit floats, is invisible in dynamic Python and consequential in statically typed targets. Memory sharing semantics, the difference between a view and a copy of a numpy array, have no equivalent in most target languages and have to be resolved case by case.

These are the places where translated code looks correct and is not. Catching them requires understanding both the source semantics and the target semantics, which is the part of translation that has not been automated. LLM-assisted translation shifts the labor: the mechanical work of producing a translated file moves from hours to minutes. The verification and refinement work, which depends on understanding what the code is supposed to do, does not shrink at the same rate.

The practical implication is that LLMs make codebase translation tractable for projects where it previously was not worth attempting, but they do not change what the final standard of correctness is. A translated codebase still has to pass its tests, produce the same outputs on the same inputs, and be understood by the people who will maintain it. The path to that standard is shorter with a good LLM and a disciplined process; it does not disappear.

The techniques above are developed from practical patterns in LLM-assisted migration work. Daniel Janus’s post at the link above covers one practitioner’s specific experience with a real NLP codebase and is worth reading alongside the general framing here.