Why AI Code Assistants Fail at Lisp

A few weeks ago, working on a Common Lisp side project, I noticed how quiet the AI suggestions had gotten. Copilot had nothing useful. Claude’s completions were hesitant and frequently wrong in ways they simply are not when I am writing TypeScript. The background cadence of AI assistance that has become normal for most programming was gone.

Dan Haskin’s recent post names this directly: writing Lisp is AI resistant, and it is a genuine productivity cost. The explanation he offers, that there is simply less Lisp training data, is correct as far as it goes. But there is more to it than corpus size, and the full picture reveals something important about how these tools actually work.

Training Data Is the Floor, Not the Ceiling

The training data explanation is easy to verify. According to GitHub’s Octoverse, Python, JavaScript, TypeScript, Java, and C# account for the overwhelming majority of repository activity year over year. Common Lisp does not appear in the top 25. Clojure surfaces occasionally around position 20-25 on the TIOBE index, which sounds decent until you compare it to Python’s dominance. Scheme and Racket are more obscure still.

When large code models are trained on datasets like The Stack v2, they are working from hundreds of gigabytes of code where Lisp variants represent a fraction of a percent. The coverage is there, in that these languages are not completely absent, but the density is orders of magnitude lower than what the model has for mainstream languages.

The effect on output quality is measurable. Benchmarks like MultiPL-E, which evaluates code generation across multiple languages using translations of the HumanEval dataset, consistently show a drop in pass rates as you move from popular languages to less common ones. The correlation between a language’s presence in training data and a model’s accuracy in that language is tight. It is not the only variable, but it is the dominant one.

For Lisp specifically, this means the model has enough exposure to write syntactically valid s-expressions and use core standard library forms, but the confidence and reliability you get with Python completions is not there. It knows the grammar. It does not have the idioms.

The Macro Problem Is Harder Than Training Data Alone

There is a second problem that is specific to Lisp and does not get resolved by simply adding more training data: macros create new syntax.

In Python, the syntax is fixed. with open("file.txt") as f: is always that. The model learns it once and applies that knowledge everywhere. In Common Lisp, any programmer can define a macro that looks syntactically identical to a function call but expands into arbitrary code with its own argument structure and semantics.

Consider the loop macro in Common Lisp:

(loop for x in list
      when (evenp x)
      collect (* x x))

This is not a function call with keyword arguments. It is a sub-language with its own keywords: for, in, when, collect, while, across, being the hash-keys of, sum, count, do. The iterate library offers a different set of keywords for the same class of problems. The series library takes yet another approach. Each library author has effectively written a mini-language, and the model needs to have seen substantial examples of each one to predict correct usage.

The same pattern repeats across popular Lisp libraries. with-open-file, handler-case, restart-case, cl-ppcre:register-groups-bind, and uiop:with-temporary-file all define their own syntactic conventions. The model has to independently learn each one from whatever training examples exist.

Compare this to Python, where even the most complex third-party libraries are built on the same fixed syntax. Learning how contextlib.contextmanager works transfers directly to using it anywhere. In Common Lisp, understanding one library’s resource-management macro does not automatically help you predict another library’s resource-management macro, because they may use completely different syntactic conventions.

The training data problem and the macro problem compound each other. It is not just that there is less Lisp code in general. It is that the Lisp code that does exist uses a wide variety of macro vocabularies, each of which needs to be learned separately from a small base.

What This Says About LLM Code Generation

The Lisp experience makes visible something that stays hidden when you work in popular languages: LLM code generation is primarily corpus-driven pattern matching, not reasoning from language specifications.

When Copilot writes accurate Python, it is not because it has formally understood Python’s semantics. It is because it has seen enough Python to pattern-match your current context to millions of similar situations in training data. That is genuinely useful, and it works well for common patterns in well-represented languages. But it is not comprehension.

Lisp removes the pattern-matching crutch. The model does not have enough training examples to fall back on similarity matching, and it has not built the kind of structural reasoning that would let it work from first principles. The result is unreliable output.

This is worth keeping in mind when evaluating AI code tools more generally. The strong completions you get for Python and TypeScript reflect training data density as much as any fundamental model capability. You see the same degradation when working with Rust features introduced in the last six months, or with niche libraries that have sparse public usage. Lisp is an extreme point on a spectrum, not a special case.

There is also a structural issue with how these models handle s-expression syntax specifically. Completing inside nested parentheses requires tracking syntactic depth across long spans of context, knowing whether you are in a function position or an argument position, and understanding what the surrounding form expects. These are things a human reader of Lisp learns to do through familiarity with the language’s structure, but a model trained mostly on C-style syntax does not have that same structural intuition baked in.

Whether This Improves

Larger models and expanding training sets will likely improve minority language support incrementally. Fine-tuning on specific Lisp dialects is another path; some developers have reported better results with models specifically tuned on Common Lisp or Clojure corpora, though these are not widely available or well-maintained.

The macro problem is more durable. The total volume of idiomatic Common Lisp code on the public internet is small enough that even a substantially larger training set covers it unevenly. And the nature of Lisp means novel macro usage, code written with a macro in a context the model has not seen, will remain a persistent failure case regardless of overall corpus growth.

Lisp’s capacity to define new syntax, to build languages that fit the problem domain rather than the implementation, is what makes it expressive and worth using for the people who use it. That same property means every serious Lisp codebase is partly written in a dialect specific to that project. The more idiomatic the code, the less the model can help, because idiomatic Lisp is by definition code the model has not seen enough of.

Practical Adjustments

Lisp developers working today have a few workable approaches. Using the AI for algorithmic design in pseudocode and translating the logic manually preserves the thinking assistance without depending on Lisp-specific syntax knowledge. Asking for documentation generation rather than code generation often works better, since the model can explain concepts it cannot reliably write. Treating the AI as a design sounding board rather than a completion engine sidesteps the accuracy problem entirely.

The productivity gap described in the original post has concrete costs. Lisp’s minority status was already a liability in hiring and tooling. Losing the AI assistance multiplier that mainstream language developers now take for granted adds another layer to that. Whether this changes depends on whether the Lisp community generates significantly more public code, and whether AI tool developers treat minority language support as worth optimizing for specifically. Neither trend is moving fast.