· 5 min read ·

PostgreSQL at Half Speed: What a Linux 7.0 Regression Reveals About Database-OS Coupling

Source: hackernews

An AWS engineer posted to LKML last week with a straightforward, alarming observation: PostgreSQL throughput drops by roughly half when running on Linux 7.0. Phoronix covered the report, and the Hacker News thread lit up with 319 points and nearly a hundred comments from people who have been quietly dreading exactly this kind of news. The response from kernel developers was not reassuring: a clean fix may not be easy.

That last part deserves attention. Kernel regressions happen. What is less common is a maintainer looking at a 50% throughput drop in a major production database and saying the path forward is unclear. Understanding why requires looking at how PostgreSQL actually uses the Linux kernel, and what has been changing in the kernel’s most fundamental subsystems over the past several releases.

PostgreSQL Is Not a Normal Workload

Most applications sit several layers above the kernel. They call libc, which calls syscalls, which touch the kernel for a moment before returning. PostgreSQL is different in a few important ways.

First, PostgreSQL allocates a large shared memory segment, configured via shared_buffers in postgresql.conf, that all backend processes map into their address space simultaneously. On a well-tuned production server this might be 32 GB or more. Every query that touches a cached page is directly reading from and writing to this shared anonymous mapping. The kernel’s virtual memory subsystem is not incidental to PostgreSQL performance; it is in the hot path of every single operation.

Second, PostgreSQL uses one operating system process per connection. This is a design choice with well-understood tradeoffs, but it means that at 200 active connections, you have 200 processes all competing for scheduler time, all taking TLB misses on the same shared mapping, and all potentially contending on the kernel data structures that track virtual memory areas. The scheduler and the VM subsystem interact constantly and at scale.

Third, PostgreSQL issues explicit madvise() calls to hint at access patterns for the buffer pool, relies on fsync() semantics for WAL durability, and on modern kernels increasingly uses io_uring for async I/O. Each of these is a surface where a kernel change can land badly.

What Changed in the Linux 6.x to 7.0 Window

The Linux kernel has spent the past several releases restructuring the virtual memory subsystem in ways that are genuinely necessary but that carry regression risk for any workload that hammers the VM.

The most significant change in recent memory was the replacement of the red-black tree backing the VMA (Virtual Memory Area) list with a maple tree, completed across Linux 6.1 through 6.5. This was not a superficial change: the maple tree is a B-tree variant designed for better cache behavior and reduced lock contention, but touching the core data structure for process address spaces always carries risk. The VMA is consulted on every page fault, every mmap call, and every address space operation.

Layered on top of that was the per-VMA locking work. Previously, all modifications to a process’s address space went through a single reader-writer semaphore called mmap_lock. This was a known scalability bottleneck for multi-threaded processes that do a lot of mapping and unmapping. The per-VMA locking work, landed progressively through the 6.3 to 6.7 range, moved toward finer-grained locking at the individual VMA level. For PostgreSQL, which has many processes each with their own mm_struct but all mapping the same large shared segment, the interaction of this new locking model with shared anonymous memory is a plausible location for a regression.

The EEVDF scheduler, which replaced the Completely Fair Scheduler in Linux 6.6, is another candidate. CFS and EEVDF have different latency characteristics, particularly around how they handle the mix of CPU-bound and I/O-bound work that characterizes a busy database. PostgreSQL backends doing buffer lookups are extremely short-duration CPU consumers, waking up, touching memory, doing work, then blocking on I/O or waiting for locks. EEVDF’s virtual deadline model can change wake-up latency in ways that cascade across hundreds of concurrent processes.

The folio conversion project, which has been rewriting the kernel’s internal page abstraction from struct page to struct folio across the page cache and anonymous memory paths, is a third possibility. An incomplete or incorrectly optimized conversion in the shared memory path could introduce overhead that is invisible on most workloads but shows up dramatically under PostgreSQL’s access patterns.

Why the Fix Is Not Easy

When a userspace application has a performance regression, you instrument it, find the bottleneck, and fix the code. When the regression is in the kernel, the process is slower and more constrained.

Bisecting the specific commit requires building and booting many kernel versions, running a reproducible benchmark each time, and narrowing the window. For a regression in a complex subsystem like memory management, the introducing commit might not be the logical place to apply the fix. The regression might be an emergent property of several related changes that are each individually correct.

More fundamentally, the kernel’s memory management changes over the past two releases were made intentionally and they do help other workloads. A fix that reverts the problematic behavior for PostgreSQL might reintroduce the scalability problem the original change was solving for something else. Kernel maintainers do not accept patches that trade one workload’s performance for another without very careful analysis.

PostgreSQL’s shared memory model also exposes interactions that are genuinely unusual. The same physical pages are mapped into dozens or hundreds of separate mm_struct instances simultaneously. When the kernel reclaims pages, does TLB shootdowns, or splits VMAs, it has to do so across all those mappings at once. This multiplies the cost of operations that would be cheap for a normal process, and it means that a change designed for typical single-process or fork-based workloads can behave unexpectedly when stress-tested with PostgreSQL’s topology.

A Recurring Pattern

This is not the first time a kernel change has hit PostgreSQL hard. The Meltdown and Spectre mitigations in Linux 4.15 increased syscall overhead and hurt workloads with high syscall rates; PostgreSQL was affected noticeably. Transparent huge page changes have caused periodic regressions by interfering with shared_buffers mapping, causing unexpected THP promotion storms or fragmentation stalls. Each time, the resolution came through a combination of kernel-side fixes and PostgreSQL-side workarounds, often huge_pages = try settings or explicit madvise tuning.

The workaround path may be available here too. If the regression can be isolated to a specific behavior, PostgreSQL can often adapt its memory hints or I/O patterns. MADV_NOHUGEPAGE on the buffer pool, explicit huge page pre-allocation via HugeTLB instead of THP, or disabling certain kernel features via sysctl are all tools that database operators have used before. They are not satisfying answers, but they are answers.

For now, the practical guidance is straightforward: do not upgrade production PostgreSQL instances to Linux 7.0 without running your own pgbench workload against it first. The regression appears to be real and significant, the scope is not yet fully understood, and the kernel-side fix timeline is uncertain. AWS has the scale to benchmark continuously and catch this early; most shops do not, which is exactly why reports like this LKML thread are valuable before 7.0 reaches general availability on major distributions.

The deeper takeaway is that PostgreSQL and the Linux kernel are more tightly coupled than most software relationships. PostgreSQL does not just run on Linux; it depends on specific behaviors in the VM, the scheduler, and the I/O subsystems that most applications never need to think about. Every major kernel restructuring is a quiet compatibility test against the world’s most carefully written C, and sometimes the kernel fails it.

Was this interesting?