· 6 min read ·

The Linux Kernel's AI Disclosure Rule Is About Legal Risk, Not Code Quality

Source: lobsters

The Linux kernel project has formalized its stance on AI-assisted contributions in a new process document, coding-assistants.rst, added to the Documentation/process/ tree alongside the other foundational contributor guidelines like submitting-patches.rst and coding-style.rst. The placement is deliberate. This is not a blog post from a maintainer, not a LKML reply, not an informal policy wiki entry. It is canonical process documentation, sitting in the same tree as the code itself.

The policy does not ban AI-generated contributions. That framing, while common in press coverage, misses what the document actually does. It requires contributors to disclose when they used AI coding assistants to generate or assist with submitted code, and it explicitly preserves maintainer discretion to reject such contributions based on quality or legal grounds. The distinction matters because the policy’s teeth are not in an outright prohibition; they are in the interaction between that disclosure requirement and the kernel’s existing legal accountability mechanism: the Developer Certificate of Origin.

What the DCO Actually Requires

Every patch submitted to the Linux kernel must include a Signed-off-by line. This is not just a formatting convention. The Developer Certificate of Origin is a legally meaningful assertion by the contributor that they have the right to submit the code under the kernel’s license, that the code is their original work or they have received permission to submit it, and that they understand it will become part of the public record under GPLv2.

The DCO process is how the kernel establishes a chain of legal provenance for every line of code. Maintainers sign off as patches move up the tree. Linus Torvalds’ final merge is itself a form of sign-off. The whole system depends on each person in the chain being able to truthfully assert what the DCO says.

When a contributor uses an AI coding assistant and the output involves non-trivial code generation, the DCO assertion becomes complicated. Copilot, ChatGPT, and similar tools are trained on large corpora that include GPL-licensed code. The training data for these models is not fully disclosed by the companies that built them. A contributor submitting AI-generated code cannot fully verify whether that code is substantially derived from something in the training set, and if so, under what license that training source was available.

This is not a hypothetical concern. The class action lawsuit Doe v. GitHub, Microsoft, and OpenAI, filed in 2022, specifically alleged that Copilot reproduced GPL-licensed code verbatim without attribution or license compliance. That litigation is ongoing, and the unresolved legal questions it raises are exactly what the kernel’s new policy is designed to address. When a contributor’s Signed-off-by covers code whose lineage they cannot trace, the integrity of the DCO chain is compromised.

The Quality Problem Is Secondary

Greg Kroah-Hartman became the public face of kernel maintainer frustration with AI-generated contributions when he publicly rejected a batch of trivially generated patches submitted to the staging tree in early 2023. His replies were direct: the patches were low-quality, clearly not reviewed by the submitter, and were wasting maintainer time. That episode was widely covered and became the entry point for most discussions of this topic.

But the quality complaint, while valid, is actually the lesser concern. Kernel maintainers are accustomed to rejecting bad patches. The review process exists precisely to catch errors. A maintainer can look at a patch, find a locking bug or an incorrect memory barrier, and send back a detailed rejection. The quality problem is manageable with existing tools.

The license contamination problem is not manageable with existing tools. There is no review process that can reliably identify whether a function generated by a language model is a derivative work of some GPL-licensed code in its training set. The model does not tell you. The contributor does not know. The maintainer cannot determine it from inspection. The only mechanism available is disclosure, which at least creates a record, and maintainer discretion to reject contributions where the risk is unacceptable.

This is why the kernel’s policy is structured around disclosure rather than prohibition. A blanket ban on AI tools would be unenforceable and probably counterproductive. Contributors would simply not disclose. A disclosure requirement, backed by the legal weight of the DCO and the community norm that lying in a Signed-off-by has consequences, at least puts contributors on record.

Why the Kernel Had to Formalize This First

Most major open source projects have not yet adopted formal AI contribution policies. CPython has discussed it without reaching a formal written policy. The Rust compiler project, LLVM, PostgreSQL, and others have seen informal discussions but nothing in their official process documentation. The Linux kernel moved first, and the reasons are specific to its situation.

First, the kernel is GPLv2-only. Not GPLv2-or-later, not LGPL, not dual-licensed. The license choice was deliberate and Torvalds has been clear about it for decades. The kernel’s license is a load-bearing part of its governance structure, and any contamination risk is treated with proportional seriousness.

Second, the kernel’s scale of contribution means the risk is proportionally larger. Thousands of patches are submitted across hundreds of subsystems. Even a low percentage of license-problematic AI-generated patches would represent a real exposure at that volume.

Third, the kernel’s maintainer hierarchy is mature and has the organizational capacity to formalize policy. Smaller projects cannot afford the overhead. Younger projects do not yet have the institutional structure. The kernel has both, and it has Jonathan Corbet, who maintains the Documentation/ tree and has spent years ensuring that process documentation is actually up to date.

Fourth, the stakes of getting it wrong are higher. The Linux kernel is the foundation of an enormous fraction of production infrastructure. Companies with significant legal exposure depend on the kernel’s license hygiene. Those companies have lawyers who pay attention to this, and the kernel community knows it.

What Good AI-Assisted Kernel Contribution Looks Like

The policy does not say contributors cannot use AI tools. It says they must disclose when they do, must review and understand every line they submit, and must be able to stand behind the DCO assertion. In practice, this means AI assistance for low-level mechanical tasks, such as generating boilerplate, reformatting code to match kernel style, or getting a first pass at a straightforward implementation, can be legitimate if the contributor has genuinely reviewed and understood the output.

The kernel’s coding style is strict and its conventions are non-obvious. An AI tool that can generate a first draft of a seq_file implementation that gets the locking right, formats the output correctly, and handles error paths properly would be genuinely useful for a contributor who then reviews it carefully. That use case is not what the policy targets.

What the policy targets is the pattern Kroah-Hartman identified: contributors using AI to generate patches they have not understood, submitting them to inflate contribution counts or because the barrier to generating plausible-looking patches is now near zero. The disclosure requirement makes that pattern risky. A contributor who discloses AI use and submits something they obviously have not reviewed is in a worse position than one who submits a bad patch without AI involvement, because the disclosure makes clear they are not taking responsibility for the work.

The Broader Open Source Implication

The Software Freedom Conservancy and the Free Software Foundation have both raised concerns about AI-generated code in GPL-licensed projects. The FSF’s position is that current AI coding tools cannot reliably produce code suitable for contribution to GPL projects, precisely because of the training data provenance problem. That position may be too conservative for practical purposes, but the underlying concern is sound.

Open source infrastructure was built on legal clarity. The GPL is enforceable because there is a clear chain of authorship and licensing from every line of code in a covered project. The DCO mechanism for the kernel, the CLA mechanisms used by projects like the Apache Foundation, the Contributor License Agreements used by many others, all exist to maintain that clarity. AI-generated code inserts opacity into that chain.

The kernel’s policy is one project’s answer to a question that the entire open source ecosystem is going to have to answer. The answer it gives, transparency over prohibition, with legal accountability for what contributors sign off on, is a reasonable starting point. Other projects will likely converge on something similar, because the alternative, ignoring the problem until a lawsuit forces the issue, has obvious downsides.

For contributors, the practical guidance is simple: if you use AI tools, say so, read what they produced, and only sign off on code you actually understand and can stand behind. That standard was always implied by the DCO. The kernel has now made it explicit.

Was this interesting?