· 7 min read ·

What SQLite WAL Mode Assumes About the OS, and Why Docker Breaks It

Source: simonwillison

Most developers who use SQLite in production eventually switch to WAL mode. The reasons are straightforward: writers don’t block readers, reads don’t block writes, and sequential WAL appends are faster than the random page overwrites required by the default rollback journal. For a typical application running in a single process or container, WAL is almost always the right choice.

But when you mount a SQLite database file in two Docker containers and both try to use it with WAL enabled, something breaks. Simon Willison documented exactly this scenario: WAL-mode databases shared across Docker containers produce unreliable behavior, up to and including silent corruption. Understanding why requires looking at what WAL mode actually creates on disk and what it assumes about the environment.

What WAL Mode Creates on Disk

When SQLite switches a database to WAL mode (PRAGMA journal_mode=WAL), it creates two auxiliary files alongside the main .db file:

  • database.db-wal: the write-ahead log, where new writes are appended as “frames”
  • database.db-shm: the WAL index, also called shared memory

The WAL file is conceptually simple. Instead of modifying database pages in place, SQLite appends changed pages as frames to this log. A reader that wants page 42 checks the WAL first; if a committed frame for that page exists there, it reads from the WAL. Otherwise it reads from the main database file. Periodically, a process called checkpointing copies WAL frames back into the main database and the WAL gets truncated.

The -shm file is where the trouble lives.

What Shared Memory Actually Means

The WAL index exists to make reads fast. Without it, every read would require scanning the entire WAL file to find the most recent committed version of any given page. The WAL index is a hash map from page number to WAL frame offset, maintained as a series of 32KB blocks in the -shm file.

SQLite opens this file and calls mmap() on it. Memory-mapping the WAL index means all cooperating processes share the same in-memory data structure without explicit inter-process communication. When one process updates the WAL index after writing a new frame, every other process that has the file mapped sees the change immediately, because they are all reading from the same physical memory pages.

This works because of how the operating system kernel manages mapped files. When two processes on the same Linux host both call mmap() on the same file, the kernel maps them to the same physical page frames. An update from one process is immediately visible to the other. The coherency guarantee is provided by the kernel’s virtual memory subsystem, not by the filesystem itself.

The SQLite documentation on WAL is explicit about this dependency: “WAL mode requires that the VFS support shared memory primitives.” The VFS (virtual file system layer) must be able to open the -shm file and memory-map it in a way that is coherent across all processes accessing the database. On a single host with a local filesystem, that requirement is met. In other environments, it is not.

Docker Volumes and Where the Model Fails

Two containers on the same Docker host, sharing a volume via a bind mount of a local directory, are closer to safe than they might appear. They share the same kernel. If both containers mmap() the -shm file, the kernel may map them to the same physical pages, and the shared memory semantics largely hold. This is the least dangerous multi-container configuration.

Things degrade quickly from there.

Named Docker volumes backed by NFS, or Kubernetes PersistentVolumeClaims using NFS or networked storage drivers, introduce network-layer file access. NFS does not support coherent mmap() across hosts. Two hosts that open and memory-map the same NFS-exported file get independent mappings backed by separate kernel caches; neither knows the other’s mapping exists. The WAL index becomes inconsistent across processes, readers see stale data, and the database is silently corrupted.

SQLite’s documentation names this directly: WAL mode must not be used with databases on NFS. The restriction exists because the shared-memory coherency guarantee that WAL depends on is simply absent over the network.

POSIX file locking presents a related failure. WAL mode uses byte-range fcntl() locks on specific byte offsets within the -shm file to coordinate readers, writers, and checkpointers. There are eight “WAL lock” bytes starting at offset 120 in the shared memory region: a write lock, a checkpoint lock, a recovery lock, and six read-mark locks. On a local filesystem, these locks are tracked per-inode in the kernel and work correctly between processes sharing that inode. On NFS, locking is delegated to the lockd daemon and the Network Lock Manager protocol, both of which have a long and documented history of implementation failures, particularly around crash recovery.

The failure is insidious because it is intermittent. Under low concurrency, nothing obviously breaks. Corruption accumulates in edge cases: concurrent writes that bypass each other’s locks, a container that crashes mid-write leaving the WAL in a partially committed state, a checkpointer that truncates WAL frames another reader was still consuming.

The Checkpoint Starvation Problem

There is an additional failure mode specific to WAL that compounds in multi-container setups. Checkpointing requires that no active reader is positioned at an earlier point in the WAL than the frame being overwritten. Each reader records its “read mark” in the WAL index, and the checkpointer must respect all active read marks before it can reclaim WAL space.

In a correctly functioning single-host WAL setup, read marks advance quickly as transactions complete and the WAL stays bounded. But if multiple containers maintain independent views of the WAL index, and that index is not truly shared between them, the checkpointer reads stale WAL index data. It concludes that old WAL frames are still in use by readers that may have long since finished. The WAL grows without bound.

This failure does not produce an error. It produces degrading read performance as the WAL grows, because every read must search through progressively more WAL frames to find the current version of each page. Eventually the process requires a restart or the database needs to be rebuilt. The symptom looks like a memory leak or a slow disk, not a database concurrency bug.

Concrete Solutions

The safest approach for a database mounted in multiple containers is to use rollback journal mode:

PRAGMA journal_mode=DELETE;

Rollback journal mode uses POSIX locks on the main database file itself, requires no shared memory, and does not create a -shm file. Concurrent read/write performance is worse than WAL: writers hold an exclusive lock on the database file for the duration of a write transaction, blocking readers. For most workloads that share a single SQLite file across containers, this tradeoff is acceptable. The alternative is unpredictable corruption.

If WAL mode matters for write throughput and you are running on a single Docker host with bind-mounted local volumes, a strict single-writer architecture reduces risk substantially:

# Writer container: open read-write, WAL enabled
conn = sqlite3.connect("/data/app.db")
conn.execute("PRAGMA journal_mode=WAL")

# Reader containers: open strictly read-only
conn = sqlite3.connect("file:/data/app.db?mode=ro", uri=True)

Read-only connections in WAL mode do not update read marks in ways that interfere with checkpointing, and they do not hold write locks. They still need a consistent WAL index to function correctly, so this is not a complete solution for NFS-backed volumes, but it substantially reduces the coordination surface for local bind mounts.

For more demanding setups, Litestream offers a structurally different model. A sidecar process streams WAL frames to external storage (S3 or an S3-compatible endpoint) in near real-time. Each container runs against its own local copy of the database, synchronized via replication rather than sharing a single file. This eliminates the shared-volume problem by abandoning the shared-file architecture entirely. It adds operational complexity but removes the entire class of WAL-coherency failures.

For genuinely concurrent multi-container writes, rqlite wraps SQLite in a Raft consensus layer, providing distributed coordination that WAL mode was never designed to supply. Network overhead is real and measurable, but the correctness guarantees are also real.

The Assumption That Containers Violated

SQLite’s WAL mode was designed in 2010 for a world where “multiple concurrent processes” meant multiple processes on a single operating system sharing a unified kernel and virtual memory subsystem. The original WAL design document reflects this scope throughout: all the shared-memory semantics, all the byte-range locking protocols, all the coherency guarantees depend on a single host with coherent mmap() support and reliable local fcntl() locks.

Containers have made the single-host assumption less universal without making it less necessary. A service that once ran as multiple threads in a single process now commonly runs as multiple containers, each with its own filesystem namespace, even when they share the same physical host. Docker volumes make it trivially easy to mount the same database file in multiple containers, but a mounted volume is not the same thing as a local inode from the perspective of a process relying on mmap() coherency.

The outcome is a class of bugs that surfaces gradually, under specific concurrency conditions, and often does not produce clear error messages. SQLITE_BUSY and SQLITE_IOERR_LOCK are the optimistic cases, the ones where the locking layer detects the conflict. The pessimistic case is SQLITE_CORRUPT or a database disk image is malformed message appearing weeks after the underlying condition was introduced.

The fix is not to avoid WAL mode. It remains the right default for single-container SQLite deployments. The fix is to treat the Docker volume as a distinct architectural element rather than a transparent file path. Once that distinction is clear, the decision about journal mode, single-writer architecture, or replication-based approaches follows from the actual deployment topology rather than the application’s default settings.

Was this interesting?