What a Non-Trivial NLP Port Reveals About LLM-Driven Code Migration

The standard pitch for using LLMs in code translation is compelling in the abstract: instead of mechanical transpilers that convert syntax without understanding intent, you get something closer to how a skilled developer thinks about porting. The model reasons about what the code is doing and produces an idiomatic equivalent in the target language.

Daniel Janus tested this claim on real NLP code in a recent writeup worth reading alongside this analysis, and the results illustrate both where Claude’s capability is genuine and where the limits are hard.

NLP codebases are not a random selection of complexity. They combine properties that make translation particularly demanding: deep library coupling, data-dependent behavior, Unicode-intensive string processing, and pipeline architectures where errors compound across stages. Choosing NLP code as a test case is, intentionally or not, a good way to locate the actual boundary of what works.

The Library Equivalence Problem Is Harder Than It Looks

Most LLM code translation demonstrations use algorithmically pure code: sorting, graph traversal, data structure manipulation. These translate well because the only thing changing is syntax and idiom. The underlying algorithm is the same in any language.

NLP code is defined by its libraries. A Python tokenizer built on spaCy is not just using spaCy for convenience; it relies on specific tokenization decisions that spaCy makes, decisions encoded in trained models and hand-curated exception lists. When you translate that code to another language, there is often no library with semantically equivalent behavior. The closest tokenizer in the target ecosystem may produce different output on Unicode edge cases, handle punctuation differently, or segment compound words by different rules.

This creates a choice with no clean resolution: find the closest available library and accept the behavioral differences, reimplement the logic directly, or call back to the original library via FFI or subprocess. The third option sounds like a compromise but is often the correct engineering decision when the source library’s behavior is what the downstream system depends on. An LLM can suggest these options but cannot make the decision for you, because the decision depends on whether behavioral fidelity or codebase coherence matters more in your specific context.

The pattern generalizes beyond NLP. Any codebase tightly coupled to ecosystem-specific libraries faces this problem during translation. NLP code makes it visible because the libraries are doing so much of the substantive work.

Unicode and the String Model Mismatch

Text processing code exposes a subtler problem: different language runtimes have different string models, and code that does any index arithmetic on strings will not translate correctly without careful review.

Python 3 strings are sequences of Unicode code points. JavaScript strings are UTF-16, and length counts code units, not characters, which means a string containing a single emoji has a length of 2. Rust’s String is validated UTF-8, and iterating over characters gives you code points, but iterating by grapheme cluster requires the unicode-segmentation crate. Go’s range over a string iterates by rune (code point), but indexing by position gives you bytes.

An LLM translating string processing code will often produce syntactically correct output that fails on inputs containing multi-byte characters. The failure mode is not a crash or a type error; it is subtly wrong output that passes tests written against ASCII inputs. This is exactly the kind of error that integration tests catch and unit tests miss.

The Anthropic documentation for long-context prompting notes that placing supporting material early in the context window improves the model’s attention to it. For translation tasks with Unicode-sensitive code, this means including explicit notes about the target runtime’s string model near the top of the prompt, before the source code. Without that framing, the model defaults to producing code that mirrors the source language’s string semantics rather than the target’s.

Context Window Strategy for Large Codebases

A non-trivial codebase does not fit in a single context window. Even with Claude’s 200,000 token limit, a codebase of tens of thousands of lines needs to be split into translation units. How you split it determines how consistent the output is.

The worst approach is translating file by file without attention to shared interfaces. You end up with inconsistent naming conventions, duplicated type definitions, and incompatible function signatures that need reconciliation in a second pass.

A better approach is to translate in dependency order, starting with the data types and interfaces the rest of the codebase depends on. Once those are stable, translate the implementations against them, providing the already-translated types as context. This is the same principle that makes human-driven port projects tractable: define the seam first, then fill in either side.

The most effective prompts for this kind of work combine the source module being translated, the already-translated interfaces it depends on, the tests that cover it, and an explicit constraint that the translated code must pass those tests. The tests function as a specification. They tell the model what the code must do, not just what it currently does, which substantially improves output quality on edge cases.

Where the Model Needs Explicit Guidance

Several categories of decision do not translate reliably without explicit direction.

Error handling philosophy is one. Python NLP code often uses exceptions pervasively. A translation to Go or Rust requires deciding where errors should be propagated versus handled locally, and how to handle conditions that have no direct equivalent in the target language. A mechanical translation produces code that is syntactically valid but architecturally wrong: deeply nested in ways the target language’s idioms are designed to avoid. Asking explicitly for idiomatic error handling and providing examples from the target codebase improves this substantially.

Performance-sensitive code is another. An NLP pipeline that relies on NumPy for vectorized operations cannot be translated to a language without equivalent SIMD-aware array libraries without algorithmic changes. The model will often produce a correct but slow equivalent. If performance is a requirement, you need to specify that as a constraint and review the output accordingly.

Stateful components need explicit attention. NLP pipelines frequently have shared state: vocabulary objects, model handles, configuration initialized once and passed through a pipeline. Translating these correctly requires understanding the ownership and lifetime semantics of the target language, which varies significantly between garbage-collected languages and languages with explicit ownership like Rust. Providing the model with the relevant ownership conventions for the target language, rather than letting it infer them, reduces the rate of subtle lifetime bugs in the output.

The Validation Gap

The gap between “syntactically translated” and “semantically equivalent” is where most LLM-assisted migration projects stall. Syntactic translation is fast. Validation is slow.

The useful mental model is to treat the translation itself as a draft and the test suite as acceptance criteria. If you start a migration project with inadequate test coverage, writing the tests before you begin is not extra work; it is the work. Tests written against the source code become the specification for the target code.

For NLP code specifically, the most valuable tests are round-trip tests: take a corpus of representative inputs, run them through the source pipeline, capture the outputs, and verify the translated pipeline produces the same outputs on the same inputs. This catches library equivalence failures, Unicode handling bugs, and pipeline stage mismatches in one test class.

Facebook Research’s TransCoder project, one of the earlier systematic approaches to neural code translation, found that functional equivalence on held-out test cases was substantially harder to achieve than syntactic correctness. The relationship between those two metrics has not changed as models have improved; the gap has narrowed, but it remains meaningful. That gap is exactly where human review is essential, regardless of which model is doing the translation.

What This Kind of Project Is Actually For

LLM-assisted code translation is not a tool for developers who do not understand the target language. It is a productivity multiplier for developers who understand both languages and want to eliminate the mechanical conversion work. The design decisions, the library choices, the validation work, and the review of semantic correctness all require someone who knows what the code is supposed to do.

What changes with a capable model like Claude is the cost of the mechanical parts: boilerplate conversion, API lookup, syntactic adaptation. That cost is real, and reducing it meaningfully accelerates migration projects. It does not reduce the cognitive work of verifying that the translation is correct.

NLP codebases are a useful stress test precisely because they contain enough of the hard cases to reveal where the mechanical parts end and the judgment calls begin. Janus’s writeup is worth reading as a current-state record of where that line sits.