· 7 min read ·

How a Linux 7.0 Kernel Change Left PostgreSQL Running at Half Speed

Source: hackernews

The report landed on LKML with the kind of specificity that gets attention: an AWS engineer running PostgreSQL workloads on Linux 7.0 observed transaction throughput cut roughly in half compared to Linux 6.x. Phoronix covered the thread, and the community response included something that rarely accompanies a performance regression report: an acknowledgment that the fix may not be simple.

That acknowledgment changes the character of the story. A bug that halves performance is serious, but bugs get fixed. A regression described as “not easy” to resolve suggests the underlying change was deliberate, that the new behavior is arguably correct from the kernel’s perspective, and that PostgreSQL happens to be on the wrong side of a trade-off. Resolving that kind of problem takes longer and involves harder conversations.

Why PostgreSQL Is So Exposed to Kernel Changes

Most applications sit far enough above the kernel that scheduler tuning or VM behavior changes show up as noise in their benchmarks. PostgreSQL does not have that buffer. It builds a significant portion of its performance model directly on top of OS primitives, and several of them are the same primitives that receive the most aggressive development attention each kernel release cycle.

The shared buffer pool is the clearest example. PostgreSQL allocates a large region of shared memory, typically configured at 25% of total RAM via shared_buffers, and maps it into the address space of every backend process. All concurrent connections read and write through this mapping. Changes to how the kernel manages virtual memory areas, particularly anything affecting how mappings are split, merged, or locked, propagate immediately into every buffer access PostgreSQL makes.

The process model compounds this exposure. PostgreSQL spawns a separate OS process for each connection rather than using threads. This is a deliberate architecture decision with real advantages for isolation and stability, but it means that anything touching per-process virtual address space management scales with connection count rather than CPU count. The mmap_lock, which protects modifications to a process’s address space, becomes a contention point when many PostgreSQL backends are running and the kernel needs to modify their mappings concurrently.

WAL behavior adds another dimension. PostgreSQL’s durability guarantees depend on specific fsync semantics. When kernel changes affect dirty page tracking, writeback timing, or how fsync interacts with the block layer, checkpoint latency and WAL commit throughput change in ways that can cause sustained regressions or unpredictable latency spikes. These interactions are notoriously hard to reproduce in synthetic benchmarks because they depend on memory pressure, I/O queue depth, and checkpoint timing all converging.

On top of all this, the scheduler matters considerably on the high-core-count, multi-socket instances where production PostgreSQL typically runs. PostgreSQL’s backend processes wake up for short bursts of work: lock acquisitions, buffer pin operations, WAL writes, vacuum passes. Scheduling decisions about preemption timing, CPU affinity, and wakeup latency determine whether working sets stay cache-warm or get scattered across NUMA nodes between operations.

What a 50% Regression Usually Means

A regression of this magnitude typically points to serialization somewhere that was previously parallel. A lock that was rarely contended becomes a bottleneck under the new kernel’s behavior. An operation that ran concurrently across many backends now queues. A change in CPU placement causes cache misses to multiply across all workers, compounding until throughput craters.

PostgreSQL has encountered several Linux changes that produced regressions in this range, and each one is instructive about the failure mode.

The mmap_lock contention problem became acute across the Linux 5.x series. High page fault rates combined with many concurrent PostgreSQL processes created lock contention that effectively serialized work that should have been parallel. The lock, protecting virtual address space modifications, was held too broadly and too often. The kernel community spent several release cycles addressing it through lock splitting, reduced hold times, and new locking granularity. The fix was not a single patch but an extended effort across multiple subsystems.

Transparent Huge Pages compaction produced a different pattern: not a steady throughput regression but unpredictable latency spikes. The kernel compacts memory in the background to build 2MB pages, and this process occasionally stalls application threads. The mitigation required applications to call madvise(MADV_NOHUGEPAGE) on their shared memory regions, and PostgreSQL eventually documented this as a recommended production configuration. The kernel fix came later, through better compaction heuristics, but the operational workaround preceded it by years.

The EEVDF scheduler, which replaced the Completely Fair Scheduler in Linux 6.6, required tuning across subsequent releases as workloads that depended on CFS’s specific latency and fairness characteristics encountered different behavior from EEVDF. Database workloads, combining many latency-sensitive short operations with longer background tasks, are structurally exposed to scheduler fairness assumptions in a way that most web application workloads are not.

Why This Fix Is Structurally Hard

If the responsible Linux 7.0 change were simply a mistake, the fix would be straightforward: revert it. The “may not be easy” framing rules that out. The kernel made a deliberate trade-off, some workloads benefit from the new behavior, and PostgreSQL is regressing because it depended on behavior that the kernel has now changed.

This creates a harder negotiation. Reverting the change regresses whoever the change was meant to help. Adding a new sysctl or tunable to restore old behavior for specific workloads is possible, but the kernel community is cautious about accumulating application-specific knobs. The preferred resolution is usually a more general fix that works for everyone, or an adaptation in the application itself, neither of which is quick.

PostgreSQL adapting its architecture is the slowest path. The project’s process-per-connection model, its shared buffer design, and its WAL implementation have been stable for decades. They are correct and extensively battle-tested. Changes to any of them require years of testing, careful migration planning, and community consensus. The PostgreSQL team does not make architectural changes in response to a single kernel release.

In practice, these situations usually resolve through a middle path: the kernel community identifies a more targeted fix that addresses the regression without fully reverting the original change, or they expose a tunable that lets performance-sensitive workloads opt into old behavior. This is how the THP compaction story concluded, and how several of the mmap_lock issues were handled. It takes time, and interim releases often ship with the regression in place.

Why AWS Filing the Report Matters

An AWS engineer filing an LKML report is not procedurally different from anyone else filing one, but the context changes how it lands. AWS runs PostgreSQL at a scale where kernel regressions surface in operational metrics before anyone goes looking for them. They employ engineers who work across both the application layer and kernel internals. When they report a specific regression with production numbers attached, the kernel community has high confidence in the reproducibility and the severity.

AWS also has skin in the game in a way that matters for resolution. They contribute to upstream kernel development, they maintain their own kernel builds for Amazon Linux, and they have engineering resources to participate in the debugging and patch review cycle. A regression that AWS reports is more likely to receive sustained attention than one filed without that organizational backing.

For the discussion on Hacker News, the practical implication is immediate: if you are running PostgreSQL in production on Linux 7.0, benchmark against your prior kernel version before assuming your workload is unaffected. A 50% regression in the reporter’s workload does not mean 50% in every workload. Connection count, query patterns, memory pressure, and checkpoint timing all affect where exactly PostgreSQL intersects the changed kernel behavior. Some deployments may see smaller regressions; some may see larger ones.

The Structural Relationship Between Databases and the Kernel

Database engines and the Linux kernel have always had a complicated relationship, and not because either side is doing something wrong. PostgreSQL pushes on exactly the parts of the kernel that receive the most active development: memory management, I/O scheduling, process scheduling, and synchronization primitives. Every release cycle, kernel developers improve these subsystems in ways that benefit the median workload. Databases are often not the median workload.

The median workload does not hold shared memory mappings across dozens of concurrent processes. It does not fsync on every transaction commit. It does not run long-lived background processes that mix IO-bound and CPU-bound work in patterns that frustrate scheduler heuristics. PostgreSQL does all of these things, and it does them in configurations that can strain any abstraction layer the kernel provides.

The long-term response from both sides has been more communication: kernel developers running database benchmarks in their CI, database projects publishing configuration guidance for kernel tunables, and cloud providers sitting at the intersection of both communities. This report fits that pattern. The regression was found, it was reported with enough detail to be useful, and the conversation about resolution has started.

The resolution will arrive in one form or another. The question, as always, is whether it requires a kernel patch, a PostgreSQL configuration change, or architectural work in one or both codebases, and which of those paths proves viable first.

Was this interesting?