When Vibe Coding Meets OpenBSD's ext4 Ambitions

The phrase “vibe coding” entered the developer lexicon in early 2025 courtesy of Andrej Karpathy, who used it to describe a workflow where you describe what you want to an LLM, accept the output without reading it carefully, and iterate by describing what broke. It is a reasonable approach for glue scripts, throwaway prototypes, and CRUD endpoints. It is a genuinely interesting stress test when applied to kernel filesystem code and then submitted to OpenBSD’s maintainers for review.

That is what this LWN article documents: an attempt to add ext4 read support to OpenBSD using LLM-assisted development, then putting that code in front of the community that famously rewrites submissions on principle and considers “clever” a pejorative.

What ext4 Actually Requires

OpenBSD already has ext2fs support in sys/ufs/ext2fs/, which handles read and write access to ext2-formatted volumes and tolerates ext3 by ignoring the journal. That baseline matters because ext4 is not a clean break from ext3 the way ext3 was from ext2. It is an accumulation of incompatible feature flags, each controlled by a bit in the superblock’s s_feature_incompat field.

The one that breaks everything is EXT4_FEATURE_INCOMPAT_EXTENTS (bit 6, value 0x0040). When this flag is set, files with the EXT4_EXTENTS_FL inode flag store their block addresses not through the traditional indirect block tree but through an extent B-tree. The root of that tree lives in the 60-byte i_block field of the inode, which ext2/ext3 use for direct, indirect, double-indirect, and triple-indirect block pointers. An ext4 extent tree starts with an ext4_extent_header (magic 0xF30A, entry count, max entries, depth), followed by either ext4_extent leaf nodes or ext4_extent_idx internal nodes pointing deeper into the tree.

Each leaf extent maps a contiguous logical range of file blocks to a contiguous physical range on disk, using a 48-bit physical block number split across two fields. This design replaces worst-case O(n) indirect block traversal with O(log n) extent lookups and dramatically reduces fragmentation for large files, but it means that code written against the ext2 block addressing model will silently misread every file on a modern ext4 volume, because modern mkfs.ext4 enables extents by default.

Beyond extents, a correct read-only ext4 implementation needs to handle:

Journal recovery: If EXT4_FEATURE_INCOMPAT_RECOVER is set in the superblock, the filesystem was not cleanly unmounted. Mounting without replaying the journal means reading potentially stale or inconsistent metadata. For read-only access you can skip replay, but you must at minimum refuse to expose the filesystem as consistent if this flag is set without offering an explicit override.
Metadata checksums: EXT4_FEATURE_RO_COMPAT_METADATA_CSUM (value 0x0400) indicates that block group descriptors, inode tables, and other structures carry CRC32c checksums. Silently ignoring these means passing corrupt data upward without detection.
64-bit block numbers: EXT4_FEATURE_INCOMPAT_64BIT extends block group descriptors from 32 to 64 bytes to support block counts beyond 2^32. The s_desc_size field controls which size to use.
Flexible block groups: EXT4_FEATURE_INCOMPAT_FLEX_BG allows the metadata for multiple block groups to be clustered together rather than distributed uniformly across the disk. Code that assumes a fixed layout between the superblock copy and the block group descriptor table will compute wrong offsets.

None of these are obscure corner cases. Any ext4 volume created with a modern Linux system in the last decade will have extents, metadata checksums, and flexible block groups all enabled. An implementation that does not handle these correctly is not a partial implementation, it is an incorrect one that will silently serve wrong data.

Why LLMs Struggle With This

Large language models are good at generating plausible code for well-documented problem domains. ext4 is well-documented: the kernel source, the ext4 wiki, and e2fsprogs provide thorough references. An LLM trained on this material can produce structs that look right, traverse extent trees in a loop that looks right, and check feature flags in conditionals that look right.

The gaps tend to appear at the seams: error paths, integer widths, bounds checking on tree depth, what to do when a checksum fails, how to handle a feature flag that is not yet implemented. These are the parts of filesystem code where bugs cause data corruption or kernel panics rather than wrong output, and they are precisely the parts where “plausible-looking” is not the same as “correct.”

Consider extent tree traversal. The specification allows trees up to 5 levels deep. A correct implementation needs a depth limit check to prevent infinite loops or stack overflows on corrupt filesystems. It needs bounds checking on the entry count in each header against the space available in the block. It needs to verify the magic number at each level. These requirements are documented, but they are easy for an LLM to omit because the happy-path code is longer and more interesting, and the edge-case code does not appear in most code examples online.

OpenBSD’s reviewers look for exactly this. The project’s history of security audits and its coding style guide both reflect a culture of treating error paths as first-class code. A missing bounds check or a panic() where a graceful error return belongs will not survive review there.

The Useful Part of the Experiment

This is not an argument that LLM-assisted development is useless for systems code. The more interesting reading is that vibe coding a filesystem driver and then submitting it to rigorous review is a reasonable way to generate a high-coverage first draft that human reviewers can then work through systematically.

The OpenBSD ext2fs code that would serve as the base for ext4 is around 2,000 lines. A correct ext4 extension adding extent support, 64-bit addressing, flexible block groups, and checksum verification would add several hundred lines of careful, structured code. Generating a first pass that gets the struct layouts right, the basic traversal logic right, and the feature flag checks in the right places is genuinely useful even if the error paths need rewriting and the bounds checks need adding. The alternative is a developer spending days writing the same structs from the kernel source by hand.

The question is whether the person doing the submission understood the difference between “this compiles and mounts a test image” and “this is correct and safe.” Filesystem code in a kernel has two audiences: the common case, which is straightforward reads from well-formed volumes, and the adversarial case, which is a maliciously crafted image or a disk that partially failed. A kernel that panics or misreads data when given a corrupt extent tree entry is a security vulnerability as much as a bug.

OpenBSD as the Right Stress Test

The choice of OpenBSD as the target is what makes this experiment worth paying attention to. A submission to a project with lighter review standards would either get merged with its problems intact, or get quietly ignored. OpenBSD’s review process will find the problems and say so in public, on the mailing list, in the kind of specific technical language that makes clear exactly what the LLM got wrong and why it matters.

That feedback is useful independent of whether the patch eventually lands. It documents the gap between “code that handles the documented happy path” and “code that handles the full problem space,” and it does so in the most concrete possible terms: here is the specific invariant this code violates, here is the specific input that would trigger the bug, here is what the correct code looks like.

For anyone evaluating how much to trust LLM-generated systems code, that gap documentation is more informative than any benchmark. Vibe coding produces a first draft quickly. The cost is that the first draft needs the kind of review that is rare outside projects like OpenBSD, and the review needs to be done by someone who understands what they are looking at.

OpenBSD may or may not end up with ext4 read support from this submission. The more durable outcome is a concrete, public record of where the automated generation process produced code that looked right but was not. That is a useful data point for anyone who wants to use LLMs to write low-level code without pretending the output needs less scrutiny than it does.