· 6 min read ·

Half the Throughput: PostgreSQL, Linux 7.0, and Why Kernel Regressions Are Hard to Undo

Source: hackernews

An AWS engineer posted to the Linux kernel mailing list with a finding that should concern anyone running PostgreSQL in production: throughput on certain workloads dropped by roughly half after upgrading to Linux 7.0. Phoronix covered the report, and the lkml thread has been generating significant discussion. The headline number alone is alarming. What makes it worth examining more carefully is the qualifier attached to the coverage: a fix may not be easy.

That qualifier is not unusual in kernel development, but it deserves unpacking. When someone says a kernel regression is hard to fix, they usually mean one of a few things: the change that caused it was intentional, reverting it would hurt other workloads, or the regression exposes a fundamental mismatch between how the kernel now behaves and how the affected software expects it to behave. All of those scenarios are harder to resolve than a plain bug.

Why PostgreSQL Is Particularly Exposed

PostgreSQL interacts with the Linux kernel in ways that most applications do not. It is a multi-process architecture rather than a multi-threaded one: each client connection spawns a separate backend process, and those processes coordinate through shared memory rather than a shared heap. This means PostgreSQL’s hot paths involve inter-process communication, shared memory access under concurrency, and dense process wakeup and sleep cycles, all of which sit directly on top of kernel primitives.

The shared buffer pool, PostgreSQL’s primary caching layer, is a large anonymous shared memory region that all backend processes access concurrently. Buffer management uses lightweight locks (LWLocks), which PostgreSQL implements as spin-then-sleep locks backed by futexes on Linux. The lock manager for row-level and table-level locks relies on process signaling and wait queues. WAL adds another layer: WAL writers, WAL senders, and backends synchronizing flushes create a pattern of short-duration sleeps and wakeups that runs continuously under any write workload.

The consequence is that PostgreSQL throughput is sensitive to anything that changes wakeup latency, scheduler behavior, shared memory access patterns, or futex performance. A kernel change that adds a few microseconds of overhead to process wakeup, or increases contention on a memory management lock, can cascade into visible throughput losses at scale, particularly on high-connection-count OLTP workloads of the kind AWS commonly runs.

A Pattern With Precedent

This is not the first time a kernel update has taken a significant bite out of PostgreSQL performance. The history is instructive.

The conversion of mmap_sem to mmap_lock across the 5.x kernel series introduced rwsem-based contention on high-core-count systems. PostgreSQL backends accessing the buffer pool in parallel would occasionally serialize on the lock, producing latency spikes that showed clearly in pgbench numbers. The fix required substantial work on the mmap_lock contention path and was spread across multiple kernel releases.

Transparent Huge Pages have caused intermittent compaction stalls in PostgreSQL workloads for years. When the kernel decides to compact memory to form a huge page, it can stall a backend process for milliseconds, which under a high-connection workload produces throughput variance that is difficult to reproduce and diagnose. The standard workaround for PostgreSQL on Linux has long been to set /sys/kernel/mm/transparent_hugepage/enabled to madvise rather than always, giving applications explicit control over huge page usage rather than letting the kernel make that decision asynchronously.

NUMA automatic balancing has caused its own regressions. The kernel’s NUMA balancer periodically migrates memory pages toward the CPU that accesses them most frequently, which sounds beneficial in theory but generates significant overhead when PostgreSQL’s shared buffer pool is accessed from processes running on different NUMA nodes. The balancer’s page migration stalls can dominate in OLTP workloads where the buffer pool is heavily shared across sockets.

The introduction of EEVDF (Earliest Eligible Virtual Deadline First) as a replacement for CFS in Linux 6.6 changed wakeup preemption behavior in ways that affected latency-sensitive workloads. While EEVDF improved fairness and interactive latency for many workloads, the transition surfaced regressions in specific profiles. PostgreSQL’s sleep and wake pattern, with many short-lived processes waiting on locks and then immediately doing work, is one of the more demanding tests for a scheduler’s wakeup decisions.

Each of these cases followed a similar arc: regression reported on lkml or by a major cloud provider, investigation period spanning weeks to months, partial fix via kernel parameter or targeted code change, and eventual resolution across one or two subsequent kernel releases.

The Structural Difficulty of the Fix

When a kernel engineer says a fix is not straightforward, the statement reflects one of a few structural problems.

First, the regression may be a consequence of a deliberate design decision. Kernel subsystems are maintained by teams with specific goals, and a change that improves memory management or scheduler behavior for a broad class of workloads can have unintended consequences for database-style access patterns. Reverting the change means accepting worse behavior for the workloads the change was intended to help. These tradeoffs require careful analysis and often result in targeted workarounds rather than clean reversions.

Second, reproducing the regression in a controlled development environment is genuinely difficult. PostgreSQL’s performance characteristics at AWS scale reflect a combination of hardware (large core counts, NUMA topology, NVMe storage), workload (specific query mixes, connection counts, shared buffer sizing), and kernel configuration that is hard to replicate outside of production-scale infrastructure. Kernel developers working on a fix need a reliable reproduction case. The closer that case is to production conditions, the more confidence they have in any proposed fix, and production-scale reproduction cases are not always easy to hand off.

Third, the fix may require changes in multiple places. If the regression comes from an interaction between a kernel subsystem change and PostgreSQL’s locking or memory access patterns, the right solution might involve changes in both the kernel and PostgreSQL itself. The PostgreSQL community and the Linux kernel community do communicate, but coordinating across both projects under a regression timeline is inherently slow. Kernel releases run on a roughly 8-week cycle; PostgreSQL has its own release calendar and a conservative patch policy.

The Workload Specificity Problem

Performance regressions of this magnitude typically do not affect all PostgreSQL workloads equally. A 50% throughput drop reported in production almost certainly refers to a specific benchmark profile, most likely something resembling pgbench’s default TPC-B-like mode, which is highly sensitive to lock contention and process wakeup latency. Read-heavy workloads that fit in the buffer pool and avoid lock contention may see a smaller effect. Write-heavy workloads that stress WAL and checkpoint behavior will have a different regression profile.

This specificity matters because it shapes how the kernel community interprets and prioritizes the report. A regression that manifests primarily under high-concurrency OLTP conditions on large-core-count machines is real and serious, but it is also a narrower target than a general slowdown. The eventual fix may well be a kernel parameter or a configuration path rather than a change to the default behavior, which would require PostgreSQL operators to actively apply it rather than receiving it passively through a kernel update.

What Operators Should Do Now

For anyone running PostgreSQL in production and considering a kernel upgrade, this report is a reason to benchmark before rolling out Linux 7.0 at scale. The standard approach is pgbench with a workload profile that approximates production: connection count, read/write ratio, and shared buffer sizing all affect whether a given workload is sensitive to the regression. Running it against your current kernel and against Linux 7.0 on equivalent hardware will tell you whether your specific setup is affected.

If the regression shows up, the available options are to hold on the current kernel until a fix lands, experiment with kernel tunables (scheduler parameters, NUMA balancing settings, THP configuration), or follow the upstream discussion where the AWS engineer’s lkml report has opened an active investigation thread. The Hacker News discussion has also surfaced additional operator observations worth reading.

The broader pattern here, one that surfaces every few kernel cycles, is that PostgreSQL and Linux have a relationship that requires active maintenance from both communities. The kernel evolves aggressively, and databases are among the most demanding consumers of the primitives it exposes. Regressions like this are the expected result of that dynamic, not an unusual failure. What matters is how quickly they get diagnosed, documented, and resolved, and whether production operators have enough warning to avoid a surprise in the meantime.

Was this interesting?