· 7 min read ·

The -shm File Is the Whole Story: SQLite WAL Mode and Shared Docker Volumes

Source: simonwillison

Most developers reach for SQLite’s WAL mode as a straightforward performance upgrade. Writers no longer block readers, readers no longer block writers, and throughput improves noticeably for read-heavy workloads. It ships as a one-liner: PRAGMA journal_mode=WAL;. What the documentation buries is that WAL mode changes SQLite’s entire inter-process coordination model, and that change has real consequences when you run multiple containers sharing the same database file on a Docker volume.

Simon Willison’s investigation into this exact scenario is worth reading as a starting point, but the deeper story is in the mechanics SQLite uses to make WAL work at all.

What WAL Mode Actually Creates

When you enable WAL mode on a SQLite database, the engine stops modifying the database file in place during a write transaction. Instead, it appends changes to a separate write-ahead log, the .db-wal file. Readers continue reading from the original database file; they consult the WAL only for pages that have been modified since the last checkpoint. Periodically, a checkpoint process folds the WAL back into the main database file.

This gives you two files most developers know about. The third file, .db-shm, is what makes this entire scheme work between processes, and it is what makes the Docker scenario interesting.

The -shm file is the WAL index. It is a shared memory region that all processes attached to the database use to find which frames in the WAL correspond to which database pages. Without it, every reader would have to scan the entire WAL from the beginning on every query, which would be catastrophically slow. The WAL index is a hash table and a set of read marks that allow each reader to efficiently determine what to read from the WAL versus the main file.

Critically, SQLite does not just treat the -shm file as a regular file. It memory-maps it with MAP_SHARED, so all processes see the same physical memory pages. Modifications made by one process are immediately visible to all others without any explicit read/write syscall. The entire coordination protocol is built on top of this assumption.

The Locking Model in WAL Mode

In journal mode (the default before WAL), SQLite uses POSIX advisory byte-range locks on the main database file to coordinate between readers and writers. These locks are well understood, work correctly over many filesystems, and fail gracefully in recognizable ways.

In WAL mode, SQLite switches to a different scheme. Byte-range locks are still used, but they are placed on the -shm file itself, not the main database file. The lock layout within the -shm file is documented in the WAL format specification:

  • Byte 120: the WRITE lock
  • Byte 121: the CKPT (checkpoint) lock
  • Byte 122: the RECOVER lock
  • Bytes 123-127: five READ locks, one per read mark slot

Processes acquire and release these locks to signal their intent. A reader grabs one of the READ lock slots and writes its current WAL position into the corresponding read mark in the shared memory. A writer checks all read marks before checkpointing to avoid overwriting WAL frames that a reader might still need.

The key constraint: this protocol only functions correctly if the shared memory region is truly shared. All processes must be looking at the same physical memory pages when they read and write the WAL index. If two processes have separate mappings of the -shm file that are not kept coherent by the kernel, the entire coordination scheme breaks. Readers will miss updates. Writers will checkpoint frames still in use. Data corruption becomes possible.

What Docker Does to the Filesystem

Docker containers on the same host share the Linux kernel. They are not separate virtual machines. This matters because mmap() coherency for MAP_SHARED mappings is a kernel guarantee, not a filesystem guarantee. When two processes on the same Linux kernel both mmap() the same inode with MAP_SHARED, the kernel ensures that writes from one process are visible to the other through their respective mappings. This is the foundation of POSIX shared memory semantics.

For Docker containers using bind mounts or named volumes backed by a local filesystem (ext4, xfs, btrfs), the inodes are the same inodes the host kernel manages. Two containers mapping the same -shm file are mapping the same kernel inode. The shared memory semantics hold.

The situation with Docker’s overlay2 storage driver is more nuanced. Overlay2 presents a layered filesystem where each container has a writable upper layer merged over shared lower layers. Files in the upper layer are real inodes on the host filesystem. When the -shm file exists in the upper layer of one container’s overlay, and another container accesses it through a volume mount (which bypasses the overlay entirely and reaches the host filesystem directly), the question is whether both processes are mapping the same underlying inode.

With named volumes and bind mounts, they are. The volume is mounted directly, not through the overlay. Both containers reach the same host filesystem path, the same inode, and the same kernel page cache. The MAP_SHARED coherency guarantee applies.

Where It Actually Breaks

The scenario that reliably breaks SQLite WAL mode is network filesystems. NFS, CIFS/SMB, and most distributed storage backends do not provide the same MAP_SHARED coherency guarantees that local filesystems do. The SQLite documentation is explicit about this in its section on WAL mode limitations:

WAL mode does not work on network filesystems.

The reason is that MAP_SHARED over NFS does not guarantee that writes from one client are immediately visible through another client’s mapping. NFS has its own caching and attribute revalidation mechanisms that can introduce staleness. Two Docker containers on different hosts sharing an NFS-backed volume will not see coherent shared memory, and WAL mode will misbehave.

This also applies to popular cloud storage backends: EFS (AWS Elastic File System, which is NFS under the hood), Azure Files with NFS protocol, and most “managed NFS” offerings. People reach for these because they want a simple shared volume across containers or across availability zones, and SQLite looks attractive for its simplicity. WAL mode plus EFS is a combination that will work fine under low concurrency and then silently fail under load.

There is also a subtler issue even on local filesystems. The POSIX advisory lock behavior on the -shm file depends on the VFS implementation. Certain container runtimes or volume plugins that implement their own filesystem layer (FUSE-based drivers, for instance) may not implement fcntl() byte-range locks correctly, particularly the semantics around lock inheritance after fork() and the behavior of locks held by a process that exits. SQLite handles some of these edge cases in its VFS layer, but it cannot compensate for a filesystem that lies about lock semantics.

A Concrete Failure Mode

Consider two containers, both running web servers backed by the same SQLite database in WAL mode, sharing a volume:

services:
  web1:
    image: myapp
    volumes:
      - dbdata:/app/data
  web2:
    image: myapp
    volumes:
      - dbdata:/app/data
volumes:
  dbdata:

On a local Docker host with a named volume backed by ext4, this works. Both containers map the same -shm inode through the same kernel, POSIX locks coordinate access correctly, and WAL mode functions as designed.

Now deploy this to a Kubernetes cluster where the PersistentVolumeClaim is backed by an NFS-based storage class (common in many managed Kubernetes offerings). The pods land on the same node: still works, because they share a kernel. The pods land on different nodes: broken. The -shm mapping is not coherent across nodes, lock semantics are unreliable over NFS, and the database will eventually corrupt.

This is a failure mode that appears only under specific scheduling conditions in production, which is the worst possible kind.

WAL Checkpoint Behavior Adds More Complexity

Even when the shared memory semantics are correct, WAL mode introduces checkpoint behavior that requires careful thought in a multi-container setup. The WAL file grows until a checkpoint folds it back into the main database. By default, SQLite triggers an automatic checkpoint when the WAL reaches 1000 pages. Any connection can checkpoint; there is no designated checkpoint process.

In a single-process setup, this is fine. In a multi-container setup, both containers may attempt to checkpoint simultaneously, or one container may checkpoint the WAL while the other has open read transactions that span frames the checkpoint needs to overwrite. SQLite handles this correctly through the read mark mechanism, but it means checkpointing may silently do nothing if a long-running reader in another container holds a read mark at an early WAL position.

For applications that write frequently, this means the WAL file can grow without bound in a multi-reader scenario if one reader stays open across many writes. The PRAGMA wal_checkpoint and the SQLITE_FCNTL_CHUNK_SIZE file control give you manual handles on this, but most application code does not reach for them.

The Practical Guidance

For local Docker volumes on a single host, WAL mode across containers sharing a volume is safe. The kernel coherency guarantee holds, POSIX locks work correctly, and the performance benefits of WAL apply.

For any scenario where containers may run on different hosts, do not use WAL mode with a shared volume. Use journal mode instead, which has weaker concurrency guarantees but does not depend on shared memory coherency. Better still, architect the system so that only one process owns the SQLite database and other services talk to it through an API, which is the approach tools like Litestream and Turso build on.

For read-heavy workloads where you genuinely need multiple readers but can tolerate a single writer, WAL mode on a local volume with one writer container and multiple reader containers is a reasonable architecture. The reader containers open the database read-only, which in WAL mode means they acquire a READ lock slot, read their data, and release it. No writes, no checkpointing from their side.

The broader lesson is that SQLite’s simplicity is real, but it is bounded by assumptions about the process model and filesystem semantics that container orchestration can violate. WAL mode is not a drop-in upgrade when the database is shared across process boundaries that span more than one kernel. The -shm file is not an implementation detail; it is the mechanism, and understanding what it requires is the difference between a system that works and one that fails in ways that are hard to reproduce and harder to diagnose.

Was this interesting?