SQLite WAL Mode Across Containers: The Shared Memory Guarantee That Network Volumes Cannot Provide
Source: simonwillison
SQLite’s WAL mode has been running reliably in production for years, but container orchestration has introduced a failure mode subtle enough to produce silent data corruption before anything looks wrong. Simon Willison posted a practical investigation of this exact problem last week. The underlying mechanism deserves closer attention than the standard “don’t use SQLite on NFS” warning usually gets, because understanding what breaks tells you precisely which container deployment patterns are safe and which are not.
What WAL Mode Does to Your Filesystem
When you enable WAL mode on a SQLite database (PRAGMA journal_mode=WAL), SQLite creates up to three files alongside your main .db file:
- The main database file (
.db): Updated only during checkpoints, not on every write. - The write-ahead log (
.db-wal): All committed writes are appended here as frames. Reads consult this file to find recent page versions before falling back to the main database file. - The shared memory index (
.db-shm): A 32KB file that serves as an index over the WAL, so readers can locate specific frames without scanning the entire log from the beginning.
The first two files are straightforward. The third one carries the constraint that matters for containers.
SQLite accesses the .db-shm file via mmap(), mapping it directly into each process’s virtual address space. This is not an implementation detail buried in the source; the official WAL documentation states it plainly: WAL mode requires that the VFS support shared memory. Shared memory is the mechanism that allows multiple processes on the same machine to maintain a coherent, low-overhead view of the WAL index without each reader scanning the entire WAL file independently.
The Coherence Requirement
The choice of mmap() over regular file reads comes down to performance and coherence. Mapping the shm file into virtual memory means updates to the WAL index by one process are immediately visible to other processes on the same host, because the OS kernel manages coherence of mmap-ed regions backed by the same physical file on the same machine. When process A writes a new frame to the WAL and updates the shm index, process B’s view of the mmap region reflects that update without any explicit inter-process communication. Both processes share physical memory pages through the VM subsystem, and the kernel enforces coherence between them.
The moment you move to a network filesystem, that guarantee disappears. NFS, SMB/CIFS, AWS Elastic File System, Azure Files, and most cloud-provider shared volume solutions do not provide coherent mmap semantics across different host machines. Each host maintains its own memory-mapped region, and the kernel on host 2 has no mechanism to invalidate or update that mapping when host 1 writes to the underlying file.
The result is that a second container reads the WAL using a stale index. It may read the wrong frame for a given database page and return data that was already overwritten. It may follow a frame pointer that points to an inconsistent location in the WAL file. SQLite may detect the inconsistency via page checksums and return SQLITE_CORRUPT, or it may not detect it at all and return silently wrong query results. The second failure mode is considerably worse than the first.
What SQLite’s Own Documentation Says
The WAL documentation’s limitations section is unambiguous: “All processes using a database must be on the same host computer; WAL does not work over a network filesystem.” This is not a soft recommendation. Elsewhere the documentation notes that SQLite will refuse to enable WAL mode if the VFS reports it does not support shared memory, but most VFSes on network filesystems do not correctly report this, so the check does not reliably catch the problematic configurations in practice.
The file locking documentation adds another layer to the problem: even the POSIX advisory locks that SQLite uses for writer coordination are unreliable on many NFS configurations. There is no architectural path to making WAL mode safe across hosts; the shared memory requirement is structural.
Same-Host Containers Are a Different Case
There is an important distinction worth drawing between cross-host container access and same-host container access, and it is a distinction that Willison’s investigation surfaces clearly.
When two containers on the same Docker host mount the same local volume, they are accessing the same underlying filesystem through the same kernel. The kernel can maintain mmap coherence between the two container processes because they share the same physical host. SQLite’s write serialization prevents concurrent writes from corrupting the WAL, and the shm coherence holds because both processes are managed by the same kernel, regardless of their container namespace boundaries.
Same-host container access to a WAL-mode SQLite database generally works. The failure appears specifically when containers run on different physical or virtual hosts sharing a networked volume.
Container orchestration makes this easy to stumble into. A Kubernetes Deployment with replicas: 2 where both pods share a single PVC backed by an NFS provisioner is exactly the cross-host scenario where WAL mode will silently corrupt data. The development environment, where everything runs on a single Docker host, gives no indication that anything is wrong.
Rollback Journal Mode: A Different Problem, Not a Solution
Switching to rollback journal mode (PRAGMA journal_mode=DELETE) removes the mmap dependency. Rollback journal uses only POSIX advisory byte-range locks on the main database file for coordination. There is no shared memory, no mmap coherence requirement.
This is a meaningful structural difference, but it does not make cross-host concurrent access safe. POSIX advisory locking is also unreliable on many network filesystems. NFS has a long history of lock manager failures, lock revocation on network partition, and silent lock loss under load. The SQLite locking documentation warns about this explicitly. Rollback journal mode on NFS is differently broken rather than fixed; it lacks the mmap problem but retains the POSIX lock reliability problem.
The practical guidance is the same for both modes: multiple containers on different hosts should not write to the same SQLite file through a shared network volume, regardless of journal mode.
Architectures That Hold Up
The cleanest safe architecture is the single-writer pattern. One container owns the SQLite database and handles all writes. Other containers either communicate through an HTTP API or maintain their own read-only local copies.
Litestream fits this model precisely. It runs as a sidecar alongside the single writer, streaming WAL frames continuously to object storage: S3, GCS, Azure Blob, SFTP, and others. Replica containers restore from object storage on startup and serve reads from their own local copy of the database. Replication lag is typically under a second. Because each container has its own file, there is no shared volume at all, and the entire problem disappears by design.
# Litestream restore on container startup
litestream restore -config /etc/litestream.yml /data/db.sqlite
# Then launch the application alongside litestream replicate
litestream replicate -config /etc/litestream.yml &
exec ./myapp
LiteFS, built by Fly.io, takes a more involved approach. It presents a FUSE-based virtual filesystem that intercepts SQLite file operations and replicates them to other nodes using its own protocol rather than the WAL’s shared memory approach. Replicas can serve reads with low latency, and leader election is handled via Consul or etcd. The operational complexity is higher than Litestream, but the read scaling story is stronger for deployments where replica read latency matters.
For any workload that genuinely requires concurrent write access from multiple containers on different hosts, a client-server database is the appropriate tool. PostgreSQL handles concurrent connections across an arbitrary number of clients without any of these constraints. SQLite’s architecture was designed for concurrent processes on a single host, and the WAL mechanism reflects that design. Litestream and LiteFS extend the single-writer model in useful ways, but they are explicitly replication systems built on top of SQLite’s single-host design, not replacements for it.
Why This Pattern Keeps Appearing
The danger in the Docker shared volume scenario is that it can look like it works. A local Docker volume on a single development machine does work. A bind mount on a single production host works. The failure appears only when the volume is backed by a network filesystem, which is a common configuration in cloud deployments where shared storage is often implemented as EFS or a similar NFS-backed service.
That gap between a passing development setup and silent production corruption is the real problem here. The mechanism is knowable and the fix is clear, but neither is visible from the surface behavior. Going one level deeper, to the mmap call and the kernel coherence guarantee it relies on, is what makes the failure mode predictable rather than mysterious.