· 6 min read ·

The Parenthesis Problem: Why AI Code Tools Fail at Lisp

Source: lobsters

Dan Haskin wrote recently about a frustration that resonates with anyone who writes Lisp seriously: AI coding assistants are nearly useless for Lisp development. He’s sad about it. So am I. But the reasons this happens are more layered than “not enough training data,” and tracing them out reveals something uncomfortable about what AI code assistance actually depends on.

The Training Data Gap Is Severe

The most obvious cause is representation in training corpora. GitHub’s data tells a blunt story. The 2024 Octoverse report tracks top languages by repository count and pull request activity; Lisp dialects don’t appear in the top fifteen. Clojure is the most visible of the family at roughly 0.3% of public repositories. Common Lisp and Scheme are smaller still. Python sits above 17%. JavaScript and TypeScript together account for over 30%.

This isn’t just a GitHub problem. Stack Overflow’s annual developer survey, which shapes how training data is weighted and curated, has listed Common Lisp in the “other” bucket for years. The Lisp that exists on the public internet is disproportionately old: comp.lang.lisp archives, pre-GitHub mailing list threads, and documentation for SBCL that predates modern indexing. LLMs trained on web crawls pick up a skewed sample of even that small corpus.

The practical result is that when you ask an AI assistant to help you write idiomatic Common Lisp, it frequently hallucinates. It invents functions that don’t exist in the standard, misuses the LOOP macro’s clause ordering, confuses DEFGENERIC with DEFMETHOD semantics, or suggests Alexandria utility functions with slightly wrong names. Clojure fares better because its ecosystem is younger and more GitHub-native, but even there, models consistently confuse clojure.core behavior with ClojureScript behavior, get core.async channel semantics wrong, and invent macro expansions that don’t typecheck.

Tokenization Makes It Worse

The training data gap compounds with a tokenizer-level problem. Modern LLMs use byte-pair encoding (BPE) or similar subword tokenization schemes. These are learned from the training corpus, which means token boundaries reflect the statistical structure of that corpus. A tokenizer trained predominantly on Python, JavaScript, and Rust will have learned rich subword units for those languages: def , async, impl, interface, useState, and so on become single tokens. Parentheses, brackets, and common Lisp structural elements get fragmented.

In Lisp, structure is syntax. The parentheses in (defun square (x) (* x x)) carry the same grammatical weight as def, :, and return do in Python’s def square(x): return x * x. But a BPE tokenizer will split that Lisp form into more tokens with less semantic density per token, because the model hasn’t seen enough Lisp to learn that (defun is a meaningful unit. More tokens per expression means the model’s attention window fills up faster and the contextual signal per token is weaker.

This is not a fundamental barrier, but it means Lisp code generation requires more of the model’s capacity for less output, relative to well-represented languages. The returns on scale are lower.

The Macro System Is a Genuine Modeling Problem

Beyond data and tokenization, there is a deeper issue: Lisp’s macro system creates a language-within-a-language that changes at the project level. In Common Lisp, LOOP is a macro that introduces an entire sublanguage for iteration. WITH-OPEN-FILE, DEFINE-CONDITION, DEFCLASS, and every library’s WITH-* pattern are all macros that expand into arbitrary code at compile time.

For an LLM to generate correct Lisp using a project-specific macro, it would need to understand what that macro expands to, which requires either seeing the DEFMACRO form in context or having enough training examples of that macro’s usage to infer its behavior statistically. Neither condition is reliably satisfied. The model doesn’t have a Lisp compiler to run macroexpansion; it has pattern matching over token sequences. When the pattern isn’t in the training data, the generation fails silently: syntactically valid but semantically wrong.

Compare this to Python, where the language doesn’t give library authors the ability to extend syntax in the same way. A library’s API is a set of functions and classes. The LLM can generalize from function-call patterns it has seen across thousands of Python libraries to new ones. Lisp libraries can introduce entirely new syntactic forms, and the LLM has no general mechanism to handle that.

Scheme’s hygienic macros via syntax-rules and syntax-case create the same problem. Racket’s macro system is even more expressive. The more powerful the macro system, the more the language can diverge from what the model has seen, and the worse code generation becomes.

Benchmarks Don’t Cover It

The standard code generation benchmarks reinforce this neglect. HumanEval uses Python exclusively. MultiPL-E, which extends HumanEval to eighteen languages, includes Lua, D, Julia, and Perl, but not Common Lisp or Scheme. Clojure is absent. SWE-bench, which evaluates repository-level code changes, is Python and JavaScript. When benchmark performance on Lisp isn’t measured, there’s no signal pushing model developers to improve it.

This creates a feedback loop. Models perform poorly on Lisp, so developers don’t use AI assistance for Lisp, so there’s little demand for better Lisp performance, so benchmarks don’t include Lisp, so there’s no measured pressure to improve. The gap stays fixed while every other capability of these models advances.

Emacs Lisp Is the Partial Exception

One Lisp dialect where AI assistance is actually tolerable is Emacs Lisp. There’s a straightforward reason: .el files are everywhere on GitHub. Every Emacs configuration, every plugin, every package manager manifest is Emacs Lisp. The training corpus is comparatively large, the API surface (Emacs’s built-in functions) is well-documented and frequently discussed online, and the use cases are repetitive enough that models have learned useful patterns. Asking an AI to write a simple Emacs hook or advice often produces working code.

This is an accident of distribution, not language properties. Emacs Lisp is not simpler than Common Lisp; it’s less hygienic, has dynamic rather than lexical scope by default (though lexical binding has been available since Emacs 24), and has a decades-old API with significant inconsistency. It works with AI because it’s represented in training data, not because it’s easier to model.

What This Reveals About AI Code Assistance Generally

The Lisp situation exposes an assumption baked into the “AI accelerates all developers” narrative: that the acceleration is uniform. It isn’t. The productivity gains from AI code generation are concentrated in the languages and frameworks that dominate training corpora. Python developers writing web APIs, TypeScript developers building React applications, Rust developers using the standard library idioms, these people are getting a genuine productivity benefit. The tail of the distribution, anything outside the top ten languages on GitHub, is largely untouched.

For Lisp specifically, the irony is pointed. Lisp was designed with metaprogramming as a first-class concern precisely because its authors understood that the right abstraction for a problem might not exist in the language yet. The macro system exists to let you build the language you need. That same property, extensibility at the syntactic level, makes it harder for statistical pattern matching over token sequences to generalize correctly. The language’s strength is part of what makes it resistant to this particular form of assistance.

Dan’s sadness is reasonable. Lisp development involves enough friction already: smaller ecosystems, fewer libraries, less tooling investment, fewer job postings. Watching every other language get a productivity multiplier from AI assistants while Lisp stays manual is genuinely dispiriting. The causes are structural, not temporary, and none of the obvious levers (more training data, better benchmarks, improved tokenization) are things individual developers can pull. The gap is likely to persist for as long as language representation in training data follows GitHub’s current distribution.

That doesn’t mean Lisp development is getting worse in absolute terms. SBCL’s compiler is excellent, SLIME and Sly are mature, and the Common Lisp ecosystem, though small, is more stable than most. The loss is relative: the rising floor for everyone else means the comparative cost of working in Lisp keeps increasing, even if nothing about Lisp itself changes.

Was this interesting?