· 6 min read ·

The Shared Memory Assumption SQLite WAL Makes (and Docker Breaks)

Source: simonwillison

When you run SQLite with WAL mode enabled, three files live on disk instead of one: database.db, database.db-wal, and database.db-shm. Most guides mention the first two; the third one is where things get interesting, and where multi-container setups quietly fall apart.

Simon Willison documented the concrete problem: two Docker containers mounting the same volume, both opening a WAL-mode SQLite database. The result is either stale reads or, in the worst case, silent data corruption. The failure isn’t loud. There is no exception thrown. The database continues to accept reads and writes. It just stops being reliable.

Understanding why this happens requires getting into what the -shm file actually is.

WAL Mode and the Shared Memory Index

In the default rollback journal mode, SQLite modifies the database file directly and writes the original page content to a -journal file first, so it can roll back on crash. Writes block readers; readers block writers. It is safe and correct across nearly any filesystem, but contention is high.

WAL mode, introduced in SQLite 3.7.0 in 2010, inverts this. Writes go to the -wal file sequentially, never touching the main database file until a checkpoint runs. Readers continue reading the original database file, consulting the WAL only for pages that have been modified since the last checkpoint. One writer can proceed while many readers run concurrently. This is a significant improvement for read-heavy workloads.

The coordination mechanism that makes this work is the WAL index, stored in the -shm file. It contains:

  • A header with the current maximum valid WAL frame, a change counter, page size, and checksums.
  • A hash table mapping database page numbers to WAL frame positions, so readers can quickly find the latest version of any given page.
  • Five read mark slots, one per concurrent reader snapshot, recording the highest WAL frame each reader is allowed to see.
  • Lock byte ranges that processes use for fcntl() coordination.

The -shm file is explicitly documented by SQLite as a shared memory construct dressed as a file. Every process that opens the database mmap()s this file with MAP_SHARED. On a local filesystem, all processes mapping the same inode share the same physical memory pages in the kernel page cache. A write by one process is immediately visible to all others. This is how the WAL index stays consistent across concurrent readers and writers without locks on every read.

The SQLite documentation is direct about the scope of this design: “All processes using a WAL database must be on the same host machine.” This is not a soft recommendation. It is a load-bearing architectural assumption.

Where Docker Breaks the Assumption

Containers on the same Linux host share a kernel. In the ideal case, two containers both mapping the same -shm file from a local volume (ext4, xfs) would share the same kernel page-cache pages, and WAL coordination would work. The reality is more complicated.

Docker Desktop on macOS and Windows is the most common failure scenario. Docker Desktop runs all containers inside a Linux VM. Your local filesystem is exposed to that VM through a virtual filesystem layer: variously virtiofs, gRPC-FUSE, or the older osxfs. When a container mmap()s a file through this layer, the resulting memory region is not shared with any other container at the page-cache level. Each container has its own private mapping. One container’s writes to the WAL index are invisible to the other. The WAL coordination protocol becomes incoherent.

Overlay and network filesystems in Linux Kubernetes environments have the same problem. If your Docker volume is backed by NFS, an overlay filesystem, or any FUSE implementation, MAP_SHARED does not guarantee cross-process shared memory. SQLite’s documentation lists “Using WAL mode on a network filesystem” explicitly as a corruption vector.

Same-host Linux with local volumes is the closest to safe, but it is fragile. The behavior depends on whether the Linux kernel coalesces the page-cache entries for the two container processes mapping the same file inode. On a simple setup with a bind-mounted local directory, this often works. But “often works” is not a reliability guarantee you want for your database.

The insidious part is that WAL mode does not use traditional file locks for its index coordination; it uses byte-range locks (fcntl()) on specific offsets within the -shm file as a signaling mechanism. These lock calls succeed even when the underlying mmap regions are not truly shared. The processes believe they are coordinating; they are not.

The Rollback Journal Comparison

The original rollback journal mode uses POSIX advisory locks (fcntl() or flock()) on the main database file itself, not on a shared memory mapping. These locks work across separate processes and survive context switches. They do not require shared memory. This is why rollback journal mode is significantly more robust in multi-container setups:

PRAGMA journal_mode=DELETE;   -- or TRUNCATE or PERSIST

The trade-off is concurrency: rollback journal mode serializes readers and writers. For a low-traffic internal tool or a side project where two containers occasionally read the same SQLite file, this is often an acceptable trade. For anything with concurrent read throughput requirements, it hurts.

If you need WAL mode’s concurrency characteristics and multi-container access, you need a different architecture.

Solutions That Work

Single writer, multiple readers via architecture. If you can structure your containers so that only one container ever opens the database for writing, WAL mode works safely for that container. Read-only containers can open the database in read-only URI mode (?mode=ro), but they will still need the -shm file to be accessible and correctly populated by the writer. This requires careful coordination and is fragile if your architecture changes.

LiteFS is the most robust solution purpose-built for this problem. Developed by Fly.io, LiteFS is a FUSE-based filesystem that presents a virtual SQLite volume to each container. One container is elected as the primary; all writes go through it. The primary streams WAL frames to replica containers over HTTP. Replicas are read-only from SQLite’s perspective. This is eventually consistent, but reads within a transaction are always consistent, and the replication lag is typically sub-millisecond on the same host.

LiteFS uses Consul (or the Fly.io lease API) for primary election, so it handles container restarts and failover. The key insight it exploits is that WAL frames are a natural replication log; you can stream them as change events. The system is in production use at scale and is the correct answer if you need distributed SQLite with multi-container writes.

Litestream is a lighter tool from the same author (Ben Johnson) that streams WAL frames to object storage (S3, GCS, Azure Blob) for continuous backup and point-in-time recovery. It does not solve the multi-writer problem, but it solves the backup problem that often drives people toward multi-container SQLite setups in the first place. If your goal was “survive container restart without data loss,” Litestream with a single-writer architecture is simpler than LiteFS.

A proper client-server database. For concurrent writes from multiple containers, PostgreSQL or MySQL are the correct tools. They are designed for exactly this access pattern. The operational overhead is real but so is the correctness guarantee. SQLite’s own documentation recommends client-server databases for “high concurrency” or “very large datasets” — the multi-writer, multi-container case falls squarely into that category.

A Note on Silent Failures

What makes this problem particularly worth understanding is that it does not announce itself. A WAL-mode SQLite database opened by two containers with non-shared page-cache mappings will continue to accept reads and writes. Both containers will believe they are operating on a consistent database. The WAL index in each container’s private mapping will diverge from the other. Checkpoints will write inconsistent data back to the main database file. You may not see errors until you try to open the database file with a fresh process and find it unreadable, or until a user reports that data they saved has disappeared.

The safe rule is conservative: do not let more than one OS-level process from separate address spaces open a WAL-mode SQLite database. On Docker Desktop on Mac or Windows, that means every container is a separate address space regardless of the volume configuration. On Linux with local volumes, you can get away with it in some configurations, but the behavior is not guaranteed by SQLite’s documented contract.

If you are building something with SQLite in Docker and you reach for WAL mode because you read that it improves performance, that is a reasonable instinct. Just check first whether your containers are truly sharing memory when they map that -shm file. If they are not, you are not getting WAL mode’s benefits anyway; you are only getting its failure modes.

Was this interesting?