· 7 min read ·

Rust's Borrow Checker Is an AI Stress Test, and the Survey Results Show It

Source: lobsters

When Niko Matsakis published a summary of the Rust project’s internal survey on AI tools, the most striking thing was not the diversity of opinions. It was that the diversity maps almost perfectly onto Rust’s core design decisions. The people who find AI tools genuinely useful for Rust and the people who find them frustrating are, in large part, disagreeing about what kind of correctness matters and where the compiler sits in the loop.

That alignment is not a coincidence. It tells you something specific about the relationship between formal type systems and large language models.

What the Survey Is Actually Measuring

The Rust project surveyed its own contributors, not a random sample of Rust users. That distinction matters. These are people who understand the language deeply, have opinions about borrow checking, and are attuned to the difference between code that compiles and code that is correct. When they express skepticism about AI tools, it carries more signal than a general developer survey where respondents may not be able to distinguish a subtle lifetime error from working code.

The survey also comes from a project with an unusually strong culture of epistemic care. The Rust RFC process, the emphasis on edition-based compatibility, the detailed rationale behind type system decisions: these reflect a community that takes correctness seriously as a design value, not just a marketing claim. Collecting structured data from contributors rather than issuing a top-down position on AI is consistent with that culture.

What emerges is less a verdict on AI tools than a map of where those tools succeed and fail against a specific kind of formal constraint.

The Borrow Checker as a Discriminator

Large language models are trained to predict likely next tokens given prior context. For code, this means they are pattern-matching against the distribution of code they have seen. Python, JavaScript, and Go code is abundant in training data, and those languages offer relatively weak static guarantees. A model can generate syntactically valid, functionally reasonable code in those languages by approximating common patterns without needing to track complex cross-scope invariants.

Rust’s ownership and borrowing rules require something different. Whether a reference is valid at a given program point depends on control flow, move semantics, and the lifetimes of the values it points into. The compiler enforces these constraints using a region-based analysis that has no direct parallel in natural language structure. It is not the kind of reasoning that benefits straightforwardly from having seen more examples.

The result is a characteristic failure mode. LLMs produce Rust that looks locally plausible, uses idiomatic syntax, and passes a surface-level review, but fails to compile because a value was moved into a closure that is called more than once, or because a mutable borrow and an immutable borrow of the same data overlap in time:

// A common shape of AI-generated error
fn process(data: &mut Vec<i32>) {
    let first = &data[0];           // immutable borrow starts
    data.push(42);                  // error: cannot borrow `data` as mutable
                                    // because it is also borrowed as immutable
    println!("{}", first);
}

This specific failure is not a hallucination in the colloquial sense. The model understands what it is trying to do. It just cannot maintain a consistent model of ownership across a non-trivial scope, because that constraint is not encoded in the kind of patterns that high-frequency training data teaches.

Training Data and the Compounding Problem

Rust has substantially less public training data than Python, JavaScript, or C++. The 2024 Stack Overflow Developer Survey showed Rust with strong enthusiasm but still used by a small fraction of professional developers compared to the dominant web and scripting languages. That translates directly to less code for models to learn from.

More importantly, Rust idioms are not always derivable from type-level principles. The conventions around ? operator propagation with custom error types, the subtleties of Deref coercions, the patterns for structuring async code with explicit Pin and Future combinators: these are learned behaviors that rely on having seen the pattern applied in context. A model with limited exposure to production Rust code has genuine gaps in these areas, not just reduced confidence.

The standard library itself provides some help. The Rust error index contains structured, machine-readable error explanations with examples of invalid code and the corrected version. This is unusually good signal for compiler-in-the-loop generation workflows, and it partially explains why iterative approaches (generate, compile, correct on error output, repeat) work better for Rust than naive single-shot generation.

The compiler’s diagnostic quality is a genuine asset here. When an LLM generates invalid Rust, rustc does not just say the code is wrong. It identifies the specific lifetime or ownership rule being violated, often with a suggested fix. A loop that feeds compiler output back into the model can converge on valid code for many common tasks, though it requires more rounds than the same workflow would need for Python.

The unsafe Problem Is Different in Kind

For safe Rust, the compiler acts as a verifier. An LLM’s errors are caught mechanically before the code runs. For unsafe Rust, that guarantee disappears. Code inside unsafe blocks can compile cleanly while violating the aliasing rules, creating invalid references, or constructing values with incorrect layout assumptions. The undefined behavior is latent, detectable only through testing, sanitizers, or careful manual review.

LLMs do not have a strong model of what constitutes undefined behavior in a systems context. They have seen examples of unsafe Rust code, including examples of correct unsafe patterns, but they also conflate patterns across language boundaries. A model that has seen transmute used correctly in a few cases may apply it in contexts where the type layout assumptions do not hold:

// The unsafe block compiles; the UB is not statically caught
let bytes: Vec<u8> = vec![0x68, 0x65, 0x6c, 0x6c, 0xf0]; // invalid UTF-8
let s: &str = unsafe {
    std::str::from_utf8_unchecked(&bytes) // UB: bytes are not valid UTF-8
};

This asymmetry between safe and unsafe Rust is probably the sharpest version of the concern that appears in the project survey. The case for using AI tools for safe Rust is defensible: the compiler verifies the output, and errors are caught before they cause harm. The case for trusting AI-generated unsafe Rust without careful review is much weaker, and in a systems programming context where memory safety violations have serious consequences, the asymmetry matters.

The Evolving Target

One thing that tends to get missed in these discussions is that the borrow checker itself is a moving target. The next-generation borrow checker, Polonius, models borrows using Datalog facts rather than the current region-based analysis. It handles a broader class of valid programs, particularly ones involving conditional borrows that return to different paths, and eliminates a category of valid-but-rejected code that experienced Rust programmers have learned to work around.

The non-lexical lifetimes improvement that shipped with Rust 2018 already moved this boundary once. Code that required explicit scope manipulation before NLL was valid and idiomatic after it. Polonius will move it again. The patterns that LLMs have learned from pre-Polonius Rust code will partially apply to post-Polonius Rust and partially not.

This creates an unusual dynamic: the language’s correctness surface area is expanding in ways that will eventually produce better-verified programs, but that also require new idiom exposure before models can reason about them reliably. Training data has to catch up to each language improvement before the corresponding idioms are available to models at scale.

Where the Tool Actually Helps

The practical consensus among Rust developers who have incorporated AI tools into their workflow tends to converge on a similar set of use cases: generating boilerplate impl blocks for standard traits like Display, Debug, or From; suggesting method chains over iterators; explaining crate APIs from documentation; drafting error type hierarchies; and producing skeleton structures for common patterns like the builder pattern or a state machine.

These are tasks where the compiler’s verification provides a safety net and where the pattern-matching strength of LLMs is genuinely useful. Trait implementations follow consistent structural patterns. Iterator chains are compositional and locally checkable. Builder types have a recognizable shape.

For tasks that require reasoning about lifetimes across non-trivial scope boundaries, complex async combinators, FFI boundaries, or custom allocator implementations, the gap between what a model generates and what is actually valid is wider, and the cost of not catching an error is higher. The survey’s spread of opinions likely tracks this distinction, with the most skeptical contributors working in precisely the areas where the formal constraints bite hardest.

What This Says About AI and Language Design

The Rust project’s decision to survey its membership rather than issue a policy position reflects something worth naming. The project has a sophisticated enough model of its own design philosophy to ask a substantive question: how does a language whose value proposition is about formal correctness interact with tooling whose value proposition is about statistical pattern matching? The answer is nuanced enough that it required asking the people closest to the work.

Rust is useful as a lens here precisely because its constraints are explicit and formally verified. When an LLM fails at Rust, the failure is legible: the compiler says exactly what rule was violated. This makes Rust a better diagnostic for understanding where probabilistic code generation actually sits than languages with weaker static guarantees, where errors may be silent or require runtime observation.

The borrow checker is not just a hurdle for AI tools. It is a probe. It reveals the specific class of reasoning that statistical pattern matching cannot reliably approximate: reasoning about the temporal and spatial validity of references across a program’s control flow. That observation has implications for any language with similar formal properties, and for the broader question of what role LLMs can play in correctness-critical software development. The Rust project is in a good position to think carefully about that question, and the survey suggests it is doing exactly that.

Was this interesting?