Linux 7.0 Cut PostgreSQL Performance in Half and the Fix Has No Easy Path
Source: hackernews
A report landed on the Linux kernel mailing list recently that stopped a few people cold: a PostgreSQL deployment on Linux 7.0 was running at roughly half the throughput observed on 6.x. The engineer behind the report works at AWS, which means this was measured at scale, on production-class hardware, under real workloads. Phoronix covered the report with a headline note that the fix may not be easy, which is the part that deserves more attention than the performance number itself.
A 50% regression sounds like a bug. Usually it is. But when the kernel developers say the fix is not easy, they are usually saying something more uncomfortable: the change that caused the regression is not wrong. It is correct for most workloads, and PostgreSQL is not most workloads.
PostgreSQL’s Architecture Is Unusual From the Kernel’s Perspective
PostgreSQL uses a process-per-connection model. Every client connection spawns its own OS process. Those processes share a large shmem region for the buffer pool, coordinate through semaphores and futexes for lock management, and communicate via signals for events like checkpoint triggers and vacuum coordination. The postmaster process orchestrates all of it.
This design dates to an era when OS threads were unreliable and fork() was the sensible building block for concurrent server software. It delivers real benefits: a crashed backend process does not take down the whole instance, memory isolation between connections is free, and the system inherits decades of OS process tooling. The cost is that PostgreSQL is unusually demanding on the kernel’s inter-process coordination paths.
From the scheduler’s perspective, a busy PostgreSQL instance looks like a large population of processes that sleep frequently, wake up briefly to do work, exchange signals, contend on shared memory locks, then sleep again. That pattern is different from CPU-bound workloads, different from event-loop servers like nginx, and different from thread-pool databases like MySQL InnoDB. Kernel changes that improve performance for typical workloads can easily penalize PostgreSQL’s specific pattern of short bursts, frequent wakeups, and high IPC density.
Why a Kernel Change Can Produce a 50% Drop Without Touching Any Database Code
The mechanism is not mysterious. Consider what happens when a connection wakes up to process a query:
- The backend process is scheduled onto a CPU.
- It acquires shared memory locks, potentially waking other processes that were waiting.
- Those processes are scheduled, possibly migrating between CPU cores or NUMA nodes.
- Each lock acquisition and release involves a futex operation, a kernel boundary crossing.
- Buffer pool access may trigger page faults or TLB shootdowns across processes sharing the shmem mapping.
Each of those steps is sensitive to scheduler policy. If the kernel changes how it makes wake-up decisions, or how it handles priority between a sleeping process and the one that just signaled it, the latency of each lock operation can increase. Multiply that latency across thousands of lock operations per second across hundreds of connections, and a modest per-operation regression compounds into a severe throughput drop.
The EEVDF scheduler that replaced CFS in Linux 6.6 changed the fundamental fairness algorithm. EEVDF schedules based on virtual deadlines rather than virtual runtime, which improves latency distribution for interactive workloads and reduces worst-case tail latency on mixed systems. But its wake-up behavior differs from CFS in ways that affect IPC-heavy workloads. When process A signals process B, the scheduler must decide how quickly B gets CPU time. CFS and EEVDF make different trade-offs there, and PostgreSQL’s throughput depends heavily on that decision.
Memory management is the other likely contributor. The memory folio work that has been landing across Linux 5.x through 7.0 changes how the kernel manages compound pages internally. PostgreSQL’s shared buffer pool, which is a large contiguous shmem allocation, is exactly the kind of region that folio management touches during reclaim and migration events. Transparent Huge Page promotion decisions, changes to mmap_lock contention patterns during memory reclaim, and how the kernel accounts for pages shared across many processes can all show up as PostgreSQL latency.
This Has Happened Before
The Linux 5.14 release cycle produced a well-documented PostgreSQL scheduler regression. Members of the PostgreSQL performance team reported significant throughput drops on multi-core systems, traced to a change in how the CFS scheduler handled wake-up preemption. The regression was real enough that some distributions shipped with workaround kernel parameters.
Linux 6.2 brought another round of reports around memory management changes. The pattern in both cases was the same: a change that improved general-purpose performance metrics had an asymmetric effect on PostgreSQL’s specific usage pattern, and the fix took months to arrive.
The vm.nr_hugepages and THP settings have been tuning targets for PostgreSQL administrators for years precisely because of this sensitivity. Disabling THP (echo never > /sys/kernel/mm/transparent_hugepage/enabled) is a standing recommendation in PostgreSQL performance guides, and it matters more as the kernel’s THP management becomes more aggressive.
Why AWS Reporting This Matters
AWS runs PostgreSQL at a scale that produces reliable regression signals. Their engineers benchmark kernel upgrades systematically, on NUMA hardware configurations that amplify scheduler and memory management effects. A regression that is barely visible on a single-socket laptop with four connections can be severe on a 192-core NUMA instance running five hundred backends.
The report landing on lkml is significant not just as a data point but as a negotiation. AWS has enough engineering weight in the kernel community to get the report taken seriously. The HN thread has the predictable range of responses: operators who have seen this pattern before, kernel developers pointing out that the underlying change is correct on its own terms, and PostgreSQL developers noting that the multi-process architecture is a recurring surface for this kind of collision. None of them are wrong.
What the Fix Looks Like, Realistically
Kernel regressions of this type rarely get clean reverts. The change that caused the problem usually improves something else, and reverting it would trade one regression for another. The more likely outcomes are:
A scheduler hint or tunable. The kernel could expose a parameter that lets workloads request CFS-like wake-up behavior, or a per-cgroup scheduling policy that approximates the old behavior for latency-sensitive IPC workloads. This is the path of least resistance and the most historically common resolution.
A PostgreSQL-side adaptation. The database could reduce IPC frequency by batching lock operations, using more efficient shared memory primitives, or restructuring how backends coordinate. These changes are in progress in various forms but represent multi-release engineering work.
Hardware-level mitigation. On AWS specifically, changing the instance type, CPU topology, or NUMA configuration can sometimes sidestep scheduler effects. This is not a fix but it is a operational lever.
What will not happen quickly is a clean patch that makes PostgreSQL fast on Linux 7.0 without touching either PostgreSQL or the kernel change in question. The mismatch is structural. PostgreSQL’s design assumes a kernel that schedules IPC-heavy processes with low latency. Linux’s kernel has competing priorities across a much broader workload distribution.
What to Do Before Upgrading
If you are running PostgreSQL in production and considering a Linux 7.0 upgrade, benchmark your specific workload first. The regression is not uniform. Read-heavy workloads with low connection counts are less exposed than write-heavy OLTP workloads with high concurrency. Workloads that spend most time in sequential scan or index scan are less sensitive to lock overhead than workloads with high contention on shared data structures.
The key tunables to evaluate on 7.0:
# Disable THP if not already done
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
# Check NUMA balancing behavior
cat /proc/sys/kernel/numa_balancing
# Review scheduler latency tuning
cat /proc/sys/kernel/sched_min_granularity_ns
cat /proc/sys/kernel/sched_wakeup_granularity_ns
The lkml thread will be worth tracking. If the kernel developers identify the specific commit responsible, the workaround will likely emerge from there before any official fix lands in a stable release.
This is not a new category of problem. It is the same problem that has surfaced with every major Linux scheduler change for the past decade. Linux 7.0 brought it back at an unusually visible magnitude, which is useful. The more clearly this kind of regression gets reported and documented, the more pressure builds toward a durable solution, whether that is a scheduler policy for IPC-heavy workloads, a PostgreSQL architectural shift away from process-per-connection, or something in between that neither project has landed yet.