When the Kernel Moves: Linux 7.0, PostgreSQL, and the Architecture Problem That Never Goes Away
Source: hackernews
A report on LKML from an AWS engineer caught attention this week: PostgreSQL throughput dropped by roughly 50% when moving from Linux 6.x to 7.0. Kernel developers acknowledged the regression and noted that a fix may not come quickly. The Hacker News thread filled with database administrators recognizing the pattern.
A 50% drop is not scheduler drift. Routine major-version transitions produce 5-15% variance. Half your throughput gone is a behavioral shift in a critical code path, and understanding why PostgreSQL keeps finding itself on the wrong side of these shifts requires looking at its architecture against the kernel’s workload assumptions.
Why PostgreSQL Stresses the Kernel Differently
Three architectural choices in PostgreSQL combine to create an unusually demanding kernel workload, and each of the major subsystems that changed between Linux 6.x and 7.0 happens to touch one of them.
First, the process-per-connection model. Every client connection spawns a separate OS process. Under moderate load, a production server runs 200-500 backend processes plus background workers: WAL writer, checkpointer, autovacuum, background writer, parallel query workers. The kernel sees hundreds of unrelated tasks with no visibility into their shared coordination semantics. They are, from the scheduler’s perspective, independent.
Second, a large shared anonymous mapping. PostgreSQL’s shared_buffers, the in-process page cache that typically consumes 25-40% of physical RAM, is a single large anonymous mmap region mapped into every backend’s address space simultaneously. On a 192 GB instance this region might be 50-70 GB. Every TLB shootdown affecting one page in that region propagates to every process with the mapping. Every VMA operation the kernel performs on that region multiplies across hundreds of mm_struct instances. The kernel has no concept that these are all the same physical pages being manipulated in coordination.
Third, futex-heavy IPC. PostgreSQL’s locking stack, LWLocks and heavyweight locks, runs through POSIX semaphores backed by futex syscalls. Backends wake, acquire a lock, do a small burst of work, then block again. Under a loaded system this happens millions of times per second. Any increase in futex wake-up latency or serialization in the futex path translates directly into throughput loss.
These three pressures hit exactly the subsystems that changed significantly across the 6.x series and into 7.0.
Four Candidate Kernel Changes
No single commit has been confirmed as the culprit. The regression likely emerges from an interaction, which is part of what makes it hard to fix. But the main candidates are well-identified.
VMA maple tree and per-VMA locking (6.1-6.7). The red-black tree backing VMA lists was replaced with a maple tree, a B-tree variant optimized for range operations, across releases 6.1-6.5. Layered on top was per-VMA locking: the single mmap_lock reader-writer semaphore was replaced with fine-grained per-VMA locks from 6.3 onward. The goal was reducing contention under concurrent mmap operations, which it accomplishes for the target workload. For PostgreSQL’s pattern, hundreds of processes sharing one massive VMA region, the interaction between the new lock granularity and operations that must touch all mapped mm_struct instances simultaneously introduces overhead that the old coarse lock did not have. Connection-churning workloads, web applications without a connection pooler that create and destroy processes at high frequency, are specifically exposed because each connection’s process creates and destroys many VMAs.
EEVDF scheduler (6.6). The Earliest Eligible Virtual Deadline First scheduler replaced CFS entirely in 6.6. EEVDF assigns virtual deadlines to tasks and schedules the task with the earliest eligible deadline. It improves latency distribution for interactive workloads and reduces tail latency on mixed systems. Its wake-up preemption behavior and timeslice allocation differ from CFS in ways that matter for IPC-heavy workloads. PostgreSQL backends are short-duration CPU consumers: they wake from a lock wait, do a small amount of work, then immediately block again. CFS was tuned over 15 years to handle this profile. EEVDF was never specifically validated against it. The failure mode is a lock convoy: the scheduler’s wake-up ordering causes many backends to do a small amount of work before re-contending on the same lock, with context-switch overhead dominating. A 50% throughput drop is consistent with a severe convoy.
Multi-size THP and folio conversion (6.8-7.0). Multi-size Transparent Huge Pages, landed across 6.8-6.11, allows THP promotion at intermediate sizes between 4K and 2M. The ongoing folio conversion project is rewriting the kernel’s internal page abstraction from struct page to struct folio. The conversion is not complete. Incompletely or incorrectly optimized paths in shared anonymous memory during the transition are a plausible regression source, and THP promotion behavior on large anonymous mappings has caused PostgreSQL problems before the current regression.
MGLRU (6.1, now default). Multi-Generational LRU page reclaim became the default in 6.1. Its hot/cold classification heuristics differ from the classic two-list approach. PostgreSQL’s buffer manager already does its own page lifecycle management; it knows which pages are worth keeping better than the kernel does. When MGLRU’s classification diverges from PostgreSQL’s buffer replacement policy, the result is excess page faults even under abundant RAM.
Why the Fix Is Structurally Hard
The difficulty is not technical ignorance. The kernel developers know these subsystems well. The difficulty is structural.
No single commit has been isolated. A regression that only manifests when the folio infrastructure, updated NUMA balancing, and EEVDF interact in a specific way under a particular AWS NUMA instance topology has no clean commit to revert. The investigation requires building and running a full PostgreSQL benchmark at each bisection step, across a major version range. Each iteration is expensive.
Moreover, every change that is a candidate for reversion was an intentional improvement for a different workload. The maple tree and per-VMA locking genuinely reduce contention on systems with high mmap churn. EEVDF measurably improves interactive latency. MGLRU improves memory reclaim for mixed workloads. Accepting a patch that makes PostgreSQL fast again at the cost of regressing those workloads is not acceptable to kernel maintainers, and correctly so.
The fix may also require changes in both the kernel and PostgreSQL. PostgreSQL’s conservative patch policy and its independent release cycle mean that cross-project coordination under a regression timeline is slow by default.
Mitigations Available Now
While the investigation continues, several operational levers are available.
THP behavior is the first thing to control:
# Aggressive option: disable THP entirely
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
# Conservative option: let PostgreSQL request huge pages explicitly
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
And in postgresql.conf:
huge_pages = try # or 'off' to avoid THP entirely
Under madvise mode, PostgreSQL’s shared memory only gets huge pages if it requests them, which removes the kernel’s automatic collapse and split behavior from the equation.
On multi-socket or NUMA instances, disabling NUMA auto-balancing prevents the kernel from migrating processes across NUMA nodes at inopportune times:
echo 0 > /proc/sys/kernel/numa_balancing
For EEVDF tuning, comparing against CFS-era baseline values can help isolate whether the scheduler is the primary contributor:
sysctl kernel.sched_latency_ns=6000000
sysctl kernel.sched_min_granularity_ns=750000
For PostgreSQL 16+, testing io_method = posix against io_method = io_uring in postgresql.conf can determine whether part of the regression is in the io_uring submission path, which changed substantially across recent kernel versions.
Reducing max_connections and routing through PgBouncer or pgpool-II reduces the number of competing processes and directly lowers scheduler pressure, which is useful regardless of which subsystem is responsible.
A Recurring Pattern
This is not new. The specific kernel changes are new; the pattern is not.
When CFS shipped in Linux 2.6.23 in 2007, database workloads regressed measurably from the O(1) scheduler. Group scheduling features in 2.6.24 made things worse before they improved. The Spectre and Meltdown mitigations in 4.15 increased syscall overhead through KPTI page table isolation; PostgreSQL’s process-per-connection model made it far more sensitive than threaded applications because each backend crosses the user-kernel boundary more frequently per unit of work, and resolution took multiple releases of retpoline tuning and IBRS mode selection. THP compaction stalls have produced throughput variance across the 4.x-5.x range, leading to the standing recommendation that predates this regression to use madvise mode. Linux 5.14 NUMA balancing changes produced regressions on multi-socket deployments.
The common thread is that PostgreSQL’s architecture, sensible and well-understood for what it is, places extreme and unusual demands on exactly the kernel subsystems most likely to change for good reasons. Process-per-connection is not going away; neither is the large shared anonymous buffer pool. The kernel’s general-purpose workload assumptions will continue to evolve.
Kernel developers will likely resolve this through a new sysctl or scheduler hint giving workloads a way to request different wake-up behavior, or a new madvise flag giving applications more control over their shared memory treatment. A PostgreSQL-side adaptation, batching lock operations or restructuring backend coordination, is a multi-release engineering effort. A clean patch that makes PostgreSQL fast on Linux 7.0 without touching either the database or the responsible kernel change is considered unlikely in the near term.
The practical near-term advice is to pin kernel versions on production PostgreSQL deployments until the investigation concludes, apply the THP and NUMA mitigations, and watch the LKML thread for bisection results. The longer-term structural situation is unchanged from what it has always been: databases and general-purpose kernels have always required deliberate co-evolution, and no amount of upstream goodwill removes the need to verify major kernel upgrades against production workloads before deploying them.