· 7 min read ·

Half the Throughput: Linux 7.0, PostgreSQL, and a Recurring Architecture Problem

Source: hackernews

The report landed on the kernel mailing list with a straightforward finding: upgrading to Linux 7.0 cut PostgreSQL throughput roughly in half on production-class AWS instances. Discussion on Hacker News filled quickly, the kernel team acknowledged the regression, and coverage noted that a fix “may not be easy.” The usual work of bisecting commits and running pgbench across kernel versions is underway.

The headline number deserves closer examination. This is not the first time something like this has happened, and the reasons it keeps happening are more structural than accidental.

PostgreSQL’s Unusual Kernel Footprint

Most applications look predictable to the kernel: a single process with private memory, or a thread pool sharing one address space. The abstractions in Linux, from virtual memory management through the process scheduler, are shaped around those patterns.

PostgreSQL is different in several compounding ways.

Process per connection. Each client connection gets its own OS process. A moderately loaded server runs 200 to 500 backend processes simultaneously. This design has genuine advantages in isolation and crash containment, but it means the kernel is managing the full lifecycle, scheduling state, and memory accounting for hundreds of processes where a different architecture would use threads or goroutines sharing a single address space.

A large shared anonymous mapping. shared_buffers is a single large anonymous mmap region, typically 25 to 40% of physical RAM, mapped into every backend process. On a 192 GB instance that is 50 to 70 GB visible through 200+ separate mm_struct instances. Every TLB shootdown affecting one page in that region propagates to every process with it mapped. Every VMA split the kernel performs multiplies across all those mappings at once.

Futex-heavy coordination. The lightweight lock subsystem PostgreSQL uses for buffer pins, relation locks, and catalog access runs through POSIX semaphores and futexes. These operations happen millions of times per second under load. Any increase in futex syscall overhead, or new serialization in the futex wake-up path, shows up directly as throughput loss.

These three characteristics together produce a kernel workload profile that differs substantially from what standard benchmarks like LMBench exercise and from what most server applications look like under realistic load.

The Candidate Causes

No single confirmed culprit has emerged from the mailing list thread. The “fix may not be easy” acknowledgment reflects genuine uncertainty about which subsystem introduced the regression and whether it is one change or an interaction between several. Multiple developments across the Linux 6.x and 7.0 timeline are plausible contributors.

VMA Maple Tree and Per-VMA Locking

Between Linux 6.1 and 6.7, the kernel replaced the red-black tree backing VMA lists with a maple tree, a B-tree variant optimized for range operations, and progressively replaced the single mmap_lock reader-writer semaphore with per-VMA fine-grained locking. For most workloads this is a strict improvement: fewer contention points under concurrent mmap operations. For PostgreSQL, which has hundreds of processes sharing one massive VMA region, the interaction between the new per-VMA lock granularity and the cost of operations that touch all mapped mm_struct instances may introduce overhead that did not exist under the old coarse lock. TLB shootdowns and VMA splits in this model are no longer localized events.

EEVDF Scheduler

Linux 6.6 replaced the Completely Fair Scheduler with the Earliest Eligible Virtual Deadline First scheduler. EEVDF’s virtual deadline model changes how wake-up latency is distributed across runnable tasks. PostgreSQL backends are short-duration CPU consumers: they wake from a lock wait or I/O, do a small amount of work, then block again immediately. CFS was tuned over fifteen years with this kind of latency-sensitive, frequently-blocking workload in mind. EEVDF changes timeslice and wake-up preemption behavior in ways the CFS era never required validating against, and there were documented regressions for database-style workloads shortly after 6.6 shipped.

Multi-Size THP and Folio Conversion

Multi-size Transparent Huge Pages, developed across Linux 6.8 through 7.0, allows THP promotion at intermediate sizes between 4K and 2M. THP promotion and splitting behavior on large anonymous mappings has caused PostgreSQL problems before; the earlier recommendation to use transparent_hugepage=madvise rather than always exists for this reason. The parallel folio conversion project, which is rewriting the kernel’s internal page abstraction from struct page to struct folio, is still in progress; incomplete optimizations in shared anonymous memory paths are a plausible regression source while the conversion is underway.

MGLRU

Multi-Generational LRU became the default in Linux 6.1. Its hot/cold classification heuristics differ from the classic two-list approach. PostgreSQL’s buffer manager does its own page lifecycle management and tracks which pages matter; when MGLRU’s classification diverges from that, the result is excess page faults even under abundant RAM, because the kernel promotes and evicts pages independently of PostgreSQL’s own buffer replacement policy.

A 50% throughput drop implies something changed in a hot path. That kind of regression typically points to memory management or lock contention rather than I/O scheduling, since I/O bottlenecks tend to manifest as latency spikes rather than a sustained halving of throughput. The VMA locking changes and EEVDF are the strongest candidates based on the shape of the regression and prior patterns.

Why There Is No Simple Revert

This is not the first time PostgreSQL has encountered a performance cliff from a major kernel change. The Spectre and Meltdown mitigations in Linux 4.15 increased syscall overhead through KPTI page table isolation. PostgreSQL’s process-per-connection model made it significantly more sensitive to that overhead than threaded applications were, because each backend process crosses the user-kernel boundary far more frequently per unit of work. The resolution took multiple releases of retpoline tuning and IBRS mode selection before the overhead was absorbed across common hardware configurations.

An earlier THP interaction required adding huge_pages = try as a PostgreSQL configuration option and extensive documentation around MADV_NOHUGEPAGE. The EEVDF scheduler generated bug reports from database operators in the 6.6 cycle. Each case shared a structure: the kernel made a change that was correct or necessary for the general workload, and PostgreSQL’s architecture meant it experienced the change differently from most other software.

In the current situation, with multiple subsystem changes as potential contributors and no confirmed single commit, the investigation is harder. The kernel team needs pgbench data across bisection points for each candidate area. Interaction effects between subsystems make clean attribution difficult. Even once the contributing change is identified, the fix options are constrained: reverting a change that benefits other workloads is rarely acceptable, so the resolution typically involves a new madvise hint, a new sysctl, or an interface change that gives PostgreSQL a way to opt into different kernel behavior. Getting that agreed upon and merged takes months.

Mitigations Available Now

There are configuration changes that reduce exposure for deployments that cannot wait on an upstream fix.

Setting transparent_hugepage=madvise at the OS level restricts THP promotion to regions that explicitly request it, removing unsolicited THP operations from PostgreSQL’s shared memory segment. The recommendation predates this regression; the reasoning behind it is more relevant now.

On multi-socket or NUMA-topology AWS instances, disabling automatic NUMA balancing prevents the kernel from migrating processes across nodes based on observed access patterns:

echo 0 > /proc/sys/kernel/numa_balancing

That migration can increase shared memory access latency in ways that compound an existing regression on high-core-count instances.

For PostgreSQL 16 or later, testing io_method = posix versus io_method = io_uring can determine whether part of the regression is in the io_uring submission path, which changed substantially across recent kernel versions. This is a low-cost way to narrow the search space.

Adjusting the EEVDF scheduler parameters can shift time-slicing behavior for the process-per-connection workload shape:

sysctl kernel.sched_latency_ns=6000000
sysctl kernel.sched_min_granularity_ns=750000

These are CFS-era values and serve as a useful comparison baseline, not a guaranteed fix. The defaults were not specifically tuned for hundreds of short-burst processes sharing a large memory region.

For production deployments: do not upgrade database hosts to Linux 7.0 without first running a representative pgbench workload at realistic connection counts and scale factors. This has been sound practice for major kernel upgrades on database infrastructure for years, and this regression makes it a requirement.

The Structural Tension

The Phoronix coverage frames this as a PostgreSQL problem. It is, but it is equally a story about how the kernel’s model of a normal process has grown increasingly distant from PostgreSQL’s architecture over time. The kernel optimizes for processes with private memory and for workloads where its scheduling and memory management assumptions hold across a relatively uniform process shape. PostgreSQL was built around a process model and a shared memory design that predates most of the subsystems now involved in these regressions.

None of that is a criticism. PostgreSQL’s architecture predates Linux itself. The process model and shared memory approach deliver real benefits in isolation, crash containment, and operating system portability. The kernel’s changes to EEVDF, the maple tree, multi-size THP, and folio conversion each address genuine problems in the general case. The tension is structural and neither side caused it.

What it means in practice is that these regressions will keep surfacing every few years as major kernel subsystems turn over. The engineers who caught this one, traced it to a kernel version, and reported it upstream with benchmark data are doing the only work that leads anywhere. The fix just takes longer than anyone would like.

Was this interesting?