· 6 min read ·

The Architecture That Makes PostgreSQL a Kernel Regression Canary

Source: hackernews

An AWS engineer posted to LKML last week reporting that PostgreSQL throughput roughly halves on Linux 7.0 compared to the 6.x series. The kernel developers who responded were careful not to dismiss the report, but they were also honest: finding a clean fix may not be straightforward. That combination, a severe regression and a murky fix path, is worth understanding in detail.

The short version is that PostgreSQL applies unusual and simultaneous pressure to several Linux kernel subsystems that most workloads touch only lightly or one at a time. Each kernel cycle that rewrites memory management, the scheduler, or futex handling introduces risk specifically for database-shaped workloads, and that risk is not always caught before release.

Why PostgreSQL Keeps Getting Hit

PostgreSQL’s architecture was designed for correctness and portability, not for friendliness to a specific kernel’s internal abstractions. Three aspects of that architecture create concentrated exposure to kernel changes.

The first is the shared buffer pool. PostgreSQL allocates a large anonymous mmap segment for shared_buffers, often 32 GB or more in production. Every backend process maps that same segment into its own address space simultaneously. On a server running 200 connections, you have 200 separate mm_struct instances all pointing at the same physical pages. Any kernel operation that must synchronize across address spaces, TLB shootdowns, VMA splits, page reclaim, NUMA rebalancing, scales with the number of processes mapped to that segment rather than with the total memory used.

The second is the process-per-connection model. PostgreSQL does not use threads. Each client connection is a separate OS process. At high connection counts, the scheduler runs hundreds of short-lived, frequently blocking tasks that each touch the shared segment. They wake up, perform a few microseconds of CPU work, access shared memory, and block on a lock or I/O. This pattern stresses context-switching overhead and wake-up latency in ways that a threaded application or a long-running compute job never would.

The third is lock coordination through futex. PostgreSQL’s lock manager, buffer manager, and WAL writer coordinate using POSIX semaphores backed by futex system calls. Under heavy load, the futex fast path matters enormously. Any change to the futex implementation that adds latency or contention on that path compounds across every lock acquisition in the system.

These three properties together mean that a pgbench run is effectively a stress test of the scheduler, the VM subsystem, and the futex path, all running simultaneously, all at scale. That is why PostgreSQL finds regressions that most other benchmarks miss.

What Changed in Linux 7.0

No single confirmed culprit has been identified yet, but the list of plausible candidates from the 6.x to 7.0 window is long.

EEVDF replacing CFS. Linux 6.6 introduced the Earliest Eligible Virtual Deadline First scheduler, replacing the Completely Fair Scheduler that had been default since 2007. EEVDF has different characteristics for latency-sensitive, frequently-blocking workloads. PostgreSQL backends are exactly that: short CPU bursts interleaved with waits on locks and I/O. EEVDF’s virtual deadline model changes wake-up ordering and timeslice allocation in ways that cascade across hundreds of concurrent processes. Database throughput regressions were reported when 6.6 shipped, and Linux 7.0’s sched_ext extensible scheduler framework may interact with those changes in ways that compound the problem on specific AWS instance topologies.

Per-VMA locking. Linux 6.3 through 6.7 progressively replaced the global mmap_lock read-write semaphore with finer-grained locking at the individual VMA level. The old mmap_lock was a known scalability bottleneck, particularly on multi-core systems where many processes faulted into the same mapping simultaneously. Per-VMA locking should help. But the interaction of this new locking model with PostgreSQL’s many-processes-to-one-segment topology is complex, and an incorrect assumption in the implementation could introduce serialization where parallelism was expected.

Folio conversion. The kernel has been migrating from struct page to struct folio across the memory subsystem for several releases. This is largely correct and beneficial work, but incomplete or edge-case bugs in the folio path for large anonymous shared mappings could produce exactly the kind of per-operation overhead that becomes visible at PostgreSQL’s access rates.

Multi-size transparent huge pages. Linux 6.8 through 6.11 expanded transparent huge page support beyond the traditional 2 MB granularity. THP collapse and split behavior under large anonymous mappings, exactly the shared_buffers segment, can introduce latency spikes that appear as throughput variability in pgbench results, especially when the kernel’s THP heuristics disagree with PostgreSQL’s access patterns.

Any one of these could be the proximate cause. The harder scenario, and the one kernel developers seem to be worried about, is that the regression emerges from the interaction of several of them.

Why the Fix Is Not Easy

Kernel regressions that trace to a single bad commit are fixable. You bisect, you identify the commit, you write a targeted patch, and you revert or correct the specific behavior. That process, while tedious, is well understood.

This regression may not work that way. Each of the candidate changes above is individually correct and ships with legitimate benchmark improvements for other workloads. Reverting EEVDF to fix PostgreSQL would harm the latency profile of other applications. Reverting per-VMA locking would reintroduce the mmap_lock scalability problem it was designed to solve. The challenge is finding a configuration of the new behavior that works well for database-shaped workloads without degrading what the original change was designed to improve.

There is also the bisect problem. Running a meaningful pgbench benchmark requires booting a custom kernel on suitable hardware, warming the buffer pool, running under load for long enough to average out noise, and comparing results across kernel versions. That is not a five-minute process, and the regression may bisect to a different commit depending on which AWS instance type and PostgreSQL configuration you test against.

Past regressions of this class followed a consistent arc: LKML thread with perf data, a long discussion about which subsystem is responsible, a patch that addresses the immediate symptom, followup patches for edge cases, and eventual stabilization over two or three kernel releases. The Spectre and Meltdown mitigations in 4.15 caused noticeable PostgreSQL slowdowns due to increased syscall overhead; it took workload-specific tuning recommendations and targeted kernel patches before the situation stabilized. The MGLRU and VMA maple tree changes in 6.1 produced similar reports. Each time, the fix was neither fast nor simple.

What to Do Now

If you run PostgreSQL in production on Linux, do not upgrade to 7.0 without running pgbench against your workload first. The regression is severe enough that it will be obvious in a short benchmark run.

If you are already running 7.0 or are evaluating it, several kernel parameters are worth testing. Setting transparent_hugepage=madvise rather than always removes the kernel’s automatic THP collapse behavior from the shared memory path. Disabling NUMA automatic balancing with echo 0 > /proc/sys/kernel/numa_balancing prevents the kernel from migrating PostgreSQL processes at inopportune times, particularly on multi-socket AWS instance types. Adjusting kernel.sched_latency_ns and kernel.sched_min_granularity_ns can shift EEVDF’s timeslice behavior in ways that reduce context-switching overhead for short-burst workloads.

On the PostgreSQL side, if you are running 16 or later with io_method = io_uring, switching back to io_method = posix isolates whether the regression involves the io_uring submission path. Pre-allocating huge pages via vm.nr_hugepages rather than relying on transparent huge page promotion gives the kernel fewer decisions to make about the shared buffer segment.

None of these are guaranteed to recover the lost performance. They are diagnostic steps as much as workarounds.

The Broader Pattern

This is not a story about Linux 7.0 being broken or PostgreSQL being fragile. The Linux kernel has an enormous mandate: perform well across cloud workloads, embedded systems, desktop environments, and everything in between. Database-shaped workloads represent a small fraction of that space, and changes that benefit the majority often have side effects at the database corner of the performance landscape.

PostgreSQL’s architecture makes it an accurate sensor for those side effects. The combination of shared anonymous memory, high process counts, and futex-heavy coordination creates a workload profile that exercises kernel code paths that most applications never stress simultaneously. When PostgreSQL performance drops 50% after a kernel upgrade, something significant changed in a hot path. The work of figuring out exactly what changed, and fixing it without breaking everything else, is what makes kernel development genuinely hard.

The LKML thread will produce a diagnosis eventually. It may take a few kernel point releases before a clean fix lands. In the meantime, the PostgreSQL community and AWS will work around it, as they have before.

Was this interesting?