· 6 min read ·

When a Kernel Upgrade Cuts Your Database in Half

Source: hackernews

An AWS engineer posted to the Linux Kernel Mailing List with a stark finding: upgrading to Linux 7.0 cut PostgreSQL performance roughly in half on their workloads, and the path to a fix is unclear. For anyone running production databases on Linux, this is worth understanding in detail, not just as a news item but as a window into a recurring and underappreciated problem.

The Pattern Is Not New

Database performance regressions caused by kernel changes have a long history. PostgreSQL in particular sits in a complicated relationship with the Linux kernel because it depends on a wide surface of OS behavior: memory mapping semantics, page cache eviction policy, process scheduling, file system flush behavior, and futex-based locking. Any of these can shift under a new kernel release.

A well-known historical example: PostgreSQL has long recommended disabling Transparent Huge Pages on Linux. The recommendation exists because THP interacts poorly with PostgreSQL’s shared memory usage patterns, causing latency spikes and memory pressure that don’t show up in benchmarks designed for other workloads. That workaround was stabilized over years of production pain, not a clean kernel fix.

The Linux 6.6 introduction of the EEVDF scheduler to replace CFS raised similar concerns in the database community. EEVDF changes how processes compete for CPU time, and workloads that rely on tight latency budgets or careful process group behavior can behave differently even when throughput numbers stay comparable. Database benchmarks can look fine while tail latencies shift noticeably.

A 50% throughput regression is a different order of magnitude from those concerns. That number implies something in a critical path changed fundamentally, not incrementally.

What Actually Causes These Regressions

PostgreSQL uses a multi-process architecture. There is a postmaster, backend processes per connection, background workers, autovacuum processes, and the WAL writer, among others. These processes communicate through shared memory using the kernel’s SysV shared memory or POSIX mmap interfaces, and they synchronize using a combination of spinlocks, lightweight locks, and futexes.

This architecture means PostgreSQL is acutely sensitive to several kernel subsystems:

Memory management. PostgreSQL’s shared_buffers is a large mmap region shared across all backend processes. Changes to how the kernel handles page table entries, TLB shootdowns on multi-core machines, or page fault behavior on first access can all affect throughput. On AWS instances with high core counts and NUMA topology, changes to how the kernel handles memory locality can be especially punishing.

Lock contention. The kernel’s futex implementation underpins most of PostgreSQL’s locking. If a kernel change increases futex syscall overhead, introduces new serialization points, or changes how contended locks are queued and woken, the effect multiplies across every backend process. A small per-lock overhead becomes catastrophic at high concurrency.

I/O path. PostgreSQL calls fsync frequently for durability. The interaction between PostgreSQL’s write patterns and the kernel’s writeback and journaling behavior varies significantly across kernel versions. Changes to io_uring, the block layer, or how the VFS handles concurrent fsyncs can shift throughput dramatically.

Scheduler behavior. With dozens or hundreds of backend processes, how the scheduler interleaves them matters. If wake-up latency increases, or if processes that should be scheduled together get spread across NUMA nodes, the shared memory contention picture changes.

A 50% drop points toward one of the first two categories. I/O regressions at that magnitude typically show up as latency spikes rather than throughput halving. Scheduler issues tend to be more visible under high concurrency. Memory management or lock path changes are the more likely culprits for a broad, consistent throughput regression.

Why the Fix Is Hard

The phrase “fix may not be easy” in the Phoronix headline reflects something real about how kernel regressions work, especially for workloads that were not in the test matrix when a change landed.

Kernel developers run regression suites, but those suites cannot cover every production workload profile. A change that improves memory reclaim behavior under memory pressure might do so by shifting when and how page tables are modified in ways that hurt a steady-state database workload that isn’t under memory pressure at all. The change passes all the tests because the tests don’t model PostgreSQL at scale on a 96-core instance.

Finding the specific commit that caused the regression requires bisecting across a kernel version boundary, which means building and running a complex database benchmark against potentially dozens of kernel builds. AWS has the infrastructure to do this, but it is still a significant investment. Once the offending commit is identified, the question becomes whether it can be reverted, whether PostgreSQL can be modified to avoid triggering the slow path, or whether the kernel needs a new mechanism to let processes hint at their usage pattern.

The last option is where things get politically complicated in kernel development. Adding a new madvise flag, a new prctl option, or a new sysctl to accommodate one application’s workload requires kernel maintainers to agree that the new interface is worth the maintenance burden. That conversation can take months.

In the meantime, the workaround is usually a sysctl setting or a compile-time configuration that was not the intended default. Users on production systems end up carrying a configuration patch that the kernel community may eventually fold in as a default, or may not.

The AWS Scale Dimension

This report coming from an AWS engineer matters beyond just the technical details. AWS runs PostgreSQL at a scale that surfaces regressions invisible to smaller deployments. A performance issue that shows up as a 2% deviation in a lab becomes a 50% throughput regression when you’re running hundreds of instances with high connection counts and sustained write workloads.

AWS also ships managed database services, including Amazon RDS and Aurora PostgreSQL, where kernel versions are chosen and controlled by the provider. A regression in Linux 7.0 that affects PostgreSQL creates a real operational problem: either delay the kernel upgrade across a large fleet, apply workarounds that may not be fully validated, or accept degraded customer performance.

This is also why reports like this one matter to the kernel community. An engineer with access to production-scale benchmarks filing a detailed LKML report is significantly more useful than a synthetic benchmark showing a small deviation. The specifics of the AWS workload, the instance types, and the concurrency profile give kernel developers something concrete to reproduce and debug.

What Operators Should Do

If you are running PostgreSQL on self-managed Linux infrastructure, the practical advice is conservative:

Stay on a kernel version with a known good performance profile for your workload. For production PostgreSQL clusters, that typically means running a recent LTS kernel that has accumulated several patch releases, not the latest upstream release. Linux 7.0 is new enough that regressions like this one are being actively discovered.

If you need to move to Linux 7.0 for other reasons, benchmark before and after with a workload representative of your production traffic. Tools like pgbench with a realistic scale factor and connection count are a starting point, but they won’t catch everything. Throughput numbers can look fine while lock wait times or checkpoint behavior changes in ways that affect production differently.

Monitor kernel version changelogs for anything touching the memory management subsystem, the scheduler, or the futex implementation. Those are the areas where PostgreSQL is most sensitive. The LWN weekly kernel development summaries are a reliable way to track what changed and why.

The Broader Lesson

Database software and the kernel exist in a relationship that neither side fully controls. PostgreSQL cannot easily change its architecture to avoid every kernel quirk; the multi-process shared memory model is fundamental to how it works and how it achieves isolation between connections. The kernel cannot optimize for every application; its job is to be a general-purpose platform.

What makes regressions like this one difficult is that they are not bugs in the traditional sense. The kernel change that caused this probably solved a real problem for some other workload. The challenge is building enough shared context between database developers and kernel developers that trade-offs can be made with full information, and that interfaces exist for applications to communicate their needs.

That collaboration does happen. PostgreSQL’s huge page support, its use of posix_fadvise to hint at sequential access patterns, and its tuning of O_DIRECT behavior all reflect years of that kind of work. The Linux 7.0 regression will likely result in something similar: a new tuning knob, a kernel default that accounts for database workloads, or a PostgreSQL configuration option that routes around the slow path.

The frustrating part is that it takes a production-scale report from an AWS engineer to get the process started.

Was this interesting?