· 6 min read ·

When a Kernel Upgrade Cuts Your Database in Half: The PostgreSQL-Linux Scheduler Problem

Source: hackernews

An AWS engineer posted to the Linux kernel mailing list describing a roughly 50% throughput drop in PostgreSQL workloads after upgrading to Linux 7.0. Phoronix has coverage of the report, and the HN thread has the usual mix of “have you tried tuning X” and legitimate concern from people running PostgreSQL in production. What makes this one worth paying attention to is the second half: kernel developers responded that a fix may not be straightforward.

This is not an isolated incident. It is part of a recurring dynamic between PostgreSQL’s architecture and the Linux scheduler, and understanding that dynamic is more useful than waiting for a patch.

Why PostgreSQL Is Unusually Sensitive to Scheduler Changes

Most application servers handle concurrency with threads. PostgreSQL does not. Each client connection gets a dedicated OS process, forked from the postmaster. There are also background workers: the autovacuum launcher and workers, the WAL writer, the checkpointer, the background writer, and any parallel query workers. On a busy server, you can easily have hundreds of OS processes competing for CPU and waiting on synchronization primitives.

This matters because the Linux scheduler does not know that 200 of your processes are all PostgreSQL backends. From the kernel’s perspective, they are independent, unrelated tasks. When the scheduler makes decisions about who runs next, those decisions ripple through PostgreSQL’s internal lock system in ways the kernel cannot see.

PostgreSQL’s lock stack has several layers. At the bottom are spinlocks, used for very short critical sections. Above that are LWLocks (lightweight locks), which back internal structures like the shared buffer pool, the WAL insertion points, and the lock manager itself. At the top are regular heavyweight locks, which protect things like table-level and row-level access. LWLocks and heavyweight locks use OS-level semaphores or futexes for the waiting side, which is where the scheduler gets directly involved.

When many backends contend on a single LWLock, they queue up. The lock holder releases it, and the kernel decides which waiter wakes up next. If the scheduler’s wakeup ordering is suboptimal for this access pattern, you get lock convoys: a stream of processes each doing a small amount of work followed by re-contending on the same lock, with context switches dominating actual work. A 50% throughput drop is entirely consistent with a severe convoy scenario.

The EEVDF Background

Linux 6.6 replaced the Completely Fair Scheduler (CFS) with EEVDF (Earliest Eligible Virtual Deadline First) as the default process scheduler. EEVDF is designed to improve latency fairness by assigning each task a virtual deadline and scheduling the one whose deadline is earliest. This is theoretically better for mixed workloads and interactive responsiveness.

In practice, EEVDF changed wakeup ordering and preemption behavior in subtle ways. Some PostgreSQL users and database vendors reported minor regressions after 6.6, but nothing as dramatic as what is now being reported for 7.0. Either something changed further in the 7.0 scheduler, or a combination of scheduler and memory management changes pushed past a threshold.

The key diagnostic question is whether the regression is in the scheduling path itself, in futex wakeup behavior, in NUMA memory placement, or in something else entirely, like huge page handling or I/O path changes. Without the bisect result from the LKML thread, it is not possible to say definitively. But the pattern fits scheduler-driven lock contention.

Historical Precedents

This kind of regression has happened at several major kernel transitions.

The introduction of CFS in Linux 2.6.23 (2007) caused measurable regressions in some database workloads coming from the O(1) scheduler. The CFS group scheduling features added in 2.6.24 made things worse in some cases before they got better.

In the Linux 4.x era, changes to transparent huge page behavior affected PostgreSQL’s shared_buffers, which is a large shared memory segment. When the kernel started aggressively promoting pages to huge pages and then demoting them under pressure, the latency spikes were visible in PostgreSQL query times.

Linux 5.14 introduced changes to NUMA balancing that caused regressions on multi-socket systems running PostgreSQL, particularly on workloads with mixed read/write patterns where the memory access patterns changed enough to confuse the NUMA placement heuristics.

In each case, the kernel change was well-intentioned and improved average performance across a broad set of workloads. The database regressions were edge cases from the kernel’s perspective, even if they were severe from a PostgreSQL operator’s perspective.

Why the Fix Is Hard

Kernel regression fixes for database workloads run into a structural problem: the scheduler has to be fair to all workloads simultaneously. A change that improves PostgreSQL’s lock-convoy behavior might degrade latency for interactive workloads or hurt throughput for other server applications. Scheduler changes are among the most carefully reviewed patches in the kernel, and Linus Torvalds has historically been very reluctant to accept changes that fix one workload at the cost of another.

The investigation process itself takes time. A proper bisect across a major kernel version requires running a full PostgreSQL benchmark suite at each step, which is not fast. Identifying the exact commit is step one. Understanding why that commit causes the regression requires reading the change carefully and reasoning about how it interacts with PostgreSQL’s synchronization primitives. Proposing a fix requires even more care to avoid regressions elsewhere.

There is also the possibility that the fix needs to happen partly on the PostgreSQL side. PostgreSQL has historically adjusted its use of spinlocks, semaphores, and lock-free structures in response to kernel behavior changes. The addition of the lwlock sema-based fallback path and various tweaks to the LWLock implementation over the years reflect this. If the root cause is how PostgreSQL uses futexes in high-contention scenarios, a userspace change to the waiting strategy could help without requiring kernel changes.

What Operators Can Do Now

If you are running PostgreSQL on Linux and have not yet moved to 7.0, the straightforward advice is to wait. Run your own benchmarks on a staging instance before committing to the upgrade. This is always good practice for major kernel version changes, but the current report makes it especially important.

If you are already on Linux 7.0 and seeing regressions, there are a few kernel parameters worth examining:

# Check current huge page settings
cat /sys/kernel/mm/transparent_hugepage/enabled

# Check NUMA balancing state
cat /proc/sys/kernel/numa_balancing

# Check scheduler latency settings (these exist in /proc/sys/kernel/sched_*)
ls /proc/sys/kernel/sched_*

Disabling transparent huge pages (echo never > /sys/kernel/mm/transparent_hugepage/enabled) has historically helped in some PostgreSQL regression scenarios. Disabling NUMA auto-balancing is worth testing on multi-socket systems. Neither of these is guaranteed to address a scheduler-driven regression, but they are low-risk experiments.

On the PostgreSQL configuration side, reducing max_connections and using a connection pooler like PgBouncer can reduce the number of competing processes, which directly reduces scheduler pressure. This is a valid architectural response regardless of the kernel version.

The Broader Pattern

What makes this frustrating is that it is predictable in the abstract and hard to prevent in practice. PostgreSQL’s multi-process model was designed for correctness and isolation, and it works extremely well for that purpose. The tradeoff is that each process is an independent scheduling entity, which means PostgreSQL’s internal lock semantics become entangled with OS scheduling decisions in ways that are invisible to both the kernel and the database.

The kernel team optimizes for a workload distribution that PostgreSQL does not fit cleanly into. Major version bumps tend to be where accumulated scheduler changes produce threshold effects, where individually minor behavioral changes add up to something measurable.

The AWS report landing on LKML with 319 points on HN means it will get attention. The mailing list thread will produce a bisect, and someone will eventually produce a patch. The question is whether the patch lands in a 7.0.x stable update quickly or requires waiting for 7.1. In the meantime, staying on a known-good 6.x kernel is the conservative path for production PostgreSQL deployments.

Was this interesting?