The Shared Memory Assumption Behind SQLite WAL Mode and the Docker Configurations That Break It
Source: simonwillison
When multiple containers share a SQLite database over a Docker volume, the question of whether WAL mode works is not binary. It depends on the exact container runtime, the filesystem type backing the volume, and the host operating system. Simon Willison documented his experience with this recently, and the mechanism behind the answer is worth examining closely.
SQLite’s write-ahead logging mode, introduced in version 3.7.0 in July 2010, creates three files alongside your database: the main .db file, a .db-wal file where new writes are appended as “frames,” and a .db-shm file that serves as a WAL index. The -wal and -shm names are familiar to anyone who has left a SQLite database open by accident. The -shm file is the one that causes problems in container environments, and understanding why requires a close look at what it is.
The WAL Index Is Not a Regular Database File
The -shm file is an ordinary file on disk. SQLite opens it and then calls mmap() with MAP_SHARED, treating it as a shared memory region rather than storage. The file contains:
- A header written twice for crash safety, encoding the database size, a change counter, the last valid WAL frame number, and salt values for verifying WAL frame integrity
- Hash tables mapping database page numbers to the most recent WAL frame holding that page, so readers can find the current version of any page without scanning the entire WAL
- Eight read-lock slots, each recording the highest WAL frame a reader is currently positioned at
When a writer appends a new frame to the WAL, it updates the -shm hash tables so subsequent readers find the new frame. When a reader opens a read transaction, it pins a read-lock slot to its current frame position, which prevents the checkpointer from overwriting WAL frames the reader still needs.
All of this coordination happens through shared memory, not through file I/O. The reason is performance. File I/O through the kernel incurs system call overhead on every access; a mapped memory region is pointer arithmetic after the initial mmap() call. For a high-frequency operation like updating a hash table entry on every read, the difference is significant.
The correctness guarantee is this: when two processes both call mmap(MAP_SHARED) on the same file, the kernel backs both mappings with the same physical memory pages from its page cache. A write to that memory from one process is immediately visible to the other because they are writing to and reading from the same physical addresses. The synchronization is provided by the CPU’s cache coherency protocol and the processor’s memory model, not by any SQLite-level protocol.
This is why SQLite’s documentation on WAL mode states that all processes using a WAL database must be on the same host machine. It is a load-bearing architectural requirement, not advisory guidance.
Where Docker Does and Does Not Preserve the Guarantee
Linux containers are not virtual machines. They are processes in separate Linux namespaces, sharing one kernel. When two containers both open the same -shm file on a local filesystem (ext4, xfs, btrfs), they share the same kernel and the same page cache. The kernel coalesces their mmap() calls to the same physical pages, and the coherency guarantee holds. This is the scenario where multi-container SQLite WAL access can work.
Docker Desktop on macOS and Windows is different. All containers run inside a Linux VM, and file access from the host goes through a virtual filesystem layer: variously virtiofs, gRPC-FUSE, or the older osxfs depending on the version and configuration. When a container mmap()s a file through this layer, the mapping is backed by the VM’s kernel page cache. Two containers sharing a volume through this layer have independent mappings, not shared physical pages, and one container’s writes to the WAL index are not visible to the other.
NFS-backed Docker volumes and Kubernetes persistent volume claims over NFS have the same problem across nodes, with additional complications. NFS does not guarantee mmap() coherency across clients. Two hosts mapping the same NFS file get separate page caches. Byte-range locking on NFS requires the lockd daemon and the NLM protocol, which have a long history of reliability problems and are frequently disabled with nolock mount options. SQLite’s “How to Corrupt Your Database” page lists WAL mode on a network filesystem as an explicit corruption vector.
The Failure Modes Are Subtle
When the coherency guarantee fails, the failure profile is not an immediate crash or a thrown exception. It is gradual and often silent.
A reader in one container holds a stale view of the -shm WAL index. It reads pages from the main database file, bypassing frames in the WAL that represent more recent committed writes. The data returned is internally consistent but not current. No error is produced.
The checkpointer reads the read-lock slots in the -shm file to determine which WAL frames are still in use by active readers. If those slots reflect a stale state, the checkpointer either over-estimates reader positions and refuses to overwrite WAL frames that no longer have active readers, causing the WAL to grow without bound; or it under-estimates them and overwrites frames a reader is actively depending on, which can produce SQLITE_CORRUPT.
The SQLITE_CORRUPT error, when it does appear, often surfaces far from the write that caused the problem. WAL growth without an error signal is harder still to notice, since it manifests as degrading read performance over time rather than a clear failure.
The Pragmas That Help in the Safe Case
For the scenario that works, same Linux host with a local volume and containers sharing one kernel, the configuration that reduces contention is:
PRAGMA journal_mode=WAL;
PRAGMA synchronous=NORMAL;
PRAGMA busy_timeout=5000;
The busy_timeout pragma is the most practically important. Without it, a write lock contention returns SQLITE_BUSY immediately, and the calling code must retry manually. Setting a timeout tells SQLite to spin on the lock for up to 5000 milliseconds before surfacing the error. For containers with infrequent writes, this absorbs most contention without additional retry logic in application code.
If the workload allows a single-writer architecture, enforcing it explicitly is worthwhile:
# Writing container: read-write connection
conn = sqlite3.connect("/data/app.db")
conn.execute("PRAGMA journal_mode=WAL")
# Reading containers: read-only, cannot acquire write locks
conn = sqlite3.connect("file:/data/app.db?mode=ro", uri=True)
Read-only connections in WAL mode do not acquire write locks and do not disturb the checkpoint process.
Rollback Journal Mode As the Safe Default
For any configuration outside the same-host same-kernel case, the safe default is rollback journal mode:
PRAGMA journal_mode=DELETE;
The rollback journal uses POSIX fcntl() byte-range locks on the main database file, not shared memory. These locks are scoped to the kernel’s lock table and work correctly across containers on the same host. Their failure modes on network filesystems are at least detectable, returning errors rather than silently producing wrong data. Writers hold an exclusive lock for the duration of a write transaction, which blocks concurrent readers, but for many workloads this is an acceptable trade-off for correctness.
If you switch from WAL mode to rollback journal mode, remove the -shm and -wal files afterward when no connections are open:
rm app.db-shm app.db-wal
Leaving them in place is harmless since SQLite handles stale WAL files on open, but removing them avoids confusion about the current journal mode.
Solutions for Genuine Multi-Container Write Concurrency
If you need concurrent writes from multiple containers and cannot guarantee same-host local filesystem access, several tools address this at the architecture level.
LiteFS presents a FUSE-based virtual filesystem to each container. One container is elected primary and handles all writes. The primary streams WAL frames to replicas over HTTP. Replicas are read-only from SQLite’s perspective. It solves the coordination problem by centralizing writes behind an abstraction layer rather than relying on shared memory coherency.
Litestream streams WAL frames to object storage in near-real-time for disaster recovery and replica restoration. It handles replication to object storage for a single-writer deployment; it does not coordinate live concurrent writes from multiple containers, which is a different problem despite the surface similarity.
libSQL and Turso extend the WAL mechanism with a remote replication layer, coordinating writes through a dedicated server process rather than shared files. This removes the host-colocation requirement entirely, at the cost of latency and operational complexity.
For genuinely high-concurrency write workloads, the SQLite documentation’s own guidance points toward client-server databases. PostgreSQL and MySQL coordinate concurrent writes through a server process specifically designed for that purpose. The shared-volume multi-container pattern is asking SQLite to operate outside its design scope.
The question of whether WAL mode works across Docker containers has a precise answer: on the same Linux host with a local volume, it works because containers share a kernel and therefore share a page cache. On Docker Desktop for Mac or Windows, or on any networked filesystem, it does not work reliably, and the failures will not announce themselves with clear error messages. The -shm file is the point of failure, and the guarantee it depends on is a property of the kernel, not the filesystem.