· 6 min read ·

When the Kernel Pulls the Floor Out: PostgreSQL, Linux 7.0, and a Familiar Kind of Regression

Source: hackernews

An AWS engineer posted to the Linux kernel mailing list recently with a result that should get the attention of anyone running PostgreSQL in production: throughput dropped by roughly half on Linux 7.0 compared to the 6.x series. The phrase in the Phoronix headline, “fix may not be easy,” is not hedging for drama. In kernel development, that phrase means something specific, and understanding what it means here requires a bit of context about how PostgreSQL and Linux interact at a level most operators never think about.

Why 50% Is Different

Kernel regressions happen. Schedulers get retuned, memory management heuristics shift, I/O paths get restructured. Database workloads are among the most sensitive to these changes, and 5-15% throughput swings are not unusual across major kernel versions. A 50% drop is different. It signals that something fundamental changed, not a tunable that drifted, but a behavioral shift in a hot path that PostgreSQL exercises constantly.

PostgreSQL’s architecture puts unusual pressure on several kernel subsystems simultaneously. Each client connection is a separate OS process. They share a large anonymous mmap segment (the shared_buffers pool). They coordinate through POSIX semaphores backed by futex system calls. They write sequentially to WAL and do large random I/O against data files. The lock manager, buffer manager, WAL writer, autovacuum, and checkpointer all run as separate processes with distinct memory access patterns. When you run pgbench at scale, you are simultaneously stressing the scheduler, the VM subsystem, the futex path, and the block layer, often all at once.

A regression that cuts throughput in half on a workload like that is almost certainly not isolated to one subsystem. More likely, a single kernel change disrupted a hot path that multiple subsystems depend on, or introduced a serialization point that converts parallel work into sequential work.

The Pattern Is Not New

This is not the first time Linux kernel evolution has caught PostgreSQL in a difficult position.

When Linux 6.6 shipped with the EEVDF scheduler replacing CFS, several database workloads reported throughput regressions. EEVDF was a well-reasoned replacement with better theoretical properties for interactive and mixed workloads, but its latency-nice handling and timeslice allocation behaved differently under the lock-heavy, process-per-connection model that PostgreSQL uses. Fixes came gradually, through sysctl tuning guidance and targeted kernel patches improving EEVDF’s behavior under contention-heavy workloads.

The multi-size Transparent Huge Page work, which landed across Linux 6.8 through 6.11, caused regressions in workloads using large anonymous mmap regions, precisely the kind PostgreSQL uses for shared_buffers. THP collapse and split behavior under those regions added latency that appeared as throughput variability in pgbench results.

The VMA maple tree conversion, which replaced red-black trees with maple trees for virtual memory area management starting in Linux 6.1, also produced regressions in connection-churning workloads, where processes with many VMAs were being created and destroyed at high frequency. PostgreSQL with short-lived connections (common in web workloads without a connection pooler) fits that profile.

The MGLRU page reclaim algorithm became the default in 6.1 and introduced new heuristics for classifying pages as hot or cold. PostgreSQL’s buffer manager does its own page lifecycle management; it knows which pages are worth keeping far better than the kernel can infer from access patterns alone. When MGLRU’s classification diverges from what PostgreSQL actually needs in memory, the result is excessive page faults and degraded I/O performance, even when the system has plenty of RAM.

Each of these cases followed a similar arc: a Linux subsystem was improved for the general case, database workloads fell into a behavioral pattern the improvement did not account for, and the fix required either kernel-side tuning, PostgreSQL-side workarounds, or both.

Why the Fix Is Hard

In kernel development, a regression with a clear bisect point is a manageable problem. You find the commit, understand the trade-off it made, and either revert it, fix the regression in the new code, or document the workload as needing a different configuration. The process is painful but well-understood.

The harder case is when the regression emerges from the interaction of multiple changes, none of which is individually wrong. Linux 7.0 incorporates changes accumulated across a long 6.x development period. If the PostgreSQL regression only manifests when, say, the new memory folios infrastructure, updated NUMA balancing behavior, and sched_ext scheduling extensions interact in a specific way under the AWS instance topology, there is no single commit to revert. The fix requires understanding the interaction, which may require changes across multiple subsystems.

The other hard case is when the responsible change was necessary for a good reason, such as a security fix, a correctness improvement, or essential infrastructure work that cannot be unwound without breaking the thing it was meant to fix. In that scenario, the kernel is not going to revert the change for PostgreSQL’s benefit; PostgreSQL has to adapt, and database-level changes to accommodate kernel behavior are significantly harder to develop, test, and deploy than a kernel patch.

The LKML thread likely contains pgbench output, perf stat data, and possibly a bisect result pointing toward a subsystem. What it probably does not contain is a patch, or even a consensus on where to look. That is what “fix may not be easy” means in practice: the people who understand the relevant subsystems well enough to fix it need time to agree on what they are actually looking at.

What Operators Can Do Right Now

If you are running PostgreSQL on a kernel in the 7.0 range and seeing unexpected throughput degradation, there are a few things worth testing before the upstream fix materializes.

Transparent huge pages are a consistent source of database regressions. Setting transparent_hugepage=madvise at the OS level (rather than always) and configuring PostgreSQL with huge_pages = try or huge_pages = off can isolate whether THP is a factor. Under madvise, PostgreSQL’s shared memory only gets huge pages if it explicitly requests them, which removes the kernel’s automatic collapse/split behavior from the equation.

NUMA balancing is worth disabling on multi-socket or NUMA instances: echo 0 > /proc/sys/kernel/numa_balancing. PostgreSQL processes accessing shared memory from the wrong NUMA node pay a memory latency penalty, and the kernel’s automatic NUMA balancing can make this worse under some workloads by migrating processes at inopportune times.

If you are using PostgreSQL 16 or later with io_method = io_uring, switching back to io_method = posix is worth testing. The io_uring path in PostgreSQL is still relatively new, and any regressions in the io_uring submission or completion path in Linux 7.0 would only affect that code path.

For scheduler-related regressions, sysctl kernel.sched_latency_ns and kernel.sched_min_granularity_ns can shift EEVDF’s time-slicing behavior. There are no universal values, but the PostgreSQL wiki on server configuration and community benchmarks from kernel regression reports in previous cycles can provide starting points.

The Deeper Tension

This kind of regression reveals something real about the relationship between Linux and the software that depends on it. The kernel is not a stable ABI in the behavioral sense. System call semantics are stable; how the kernel schedules processes, manages memory, and handles I/O are not. They evolve with every release, for good reasons, and the general case benefits. But PostgreSQL is not the general case. It is a process model designed in an era when assumptions about process scheduling, memory management, and I/O performance were very different, and it has been successful enough that hundreds of thousands of production deployments depend on it.

AWS running PostgreSQL at scale makes them uniquely positioned to catch these regressions early and raise them upstream, which is what this report represents. The kernel community benefits from the signal; AWS benefits from the fix; everyone running PostgreSQL on Linux benefits eventually. But the gap between the report and the fix, in cases like this, can be measured in kernel release cycles, and for operators on the receiving end of a 50% throughput drop, that is not an abstract problem.

Was this interesting?