· 7 min read ·

SQLite WAL Mode and Docker Volumes: Why the -shm File Is the Thing That Actually Matters

Source: simonwillison

Simon Willison recently documented his findings about SQLite’s WAL mode working across Docker containers sharing a volume, and the results are worth examining from first principles rather than just accepting the conclusion. Whether this works, and more importantly when it stops working, depends entirely on what SQLite’s coordination primitives actually require from the operating system.

Three Files, One Database

When you enable WAL mode on a SQLite database, you get three files instead of one:

  • database.db: the main database file, updated only during checkpoints
  • database.db-wal: the write-ahead log, where new writes land first
  • database.db-shm: the shared memory index, a 32KB coordination file

The WAL file gets most of the attention in documentation and tutorials. Writes append to it, readers check it before falling back to the main database, and periodic checkpoints merge the WAL’s contents back. But the -shm file is where inter-process coordination actually happens, and it is the one that determines whether sharing a SQLite database across multiple containers is safe.

Enable WAL mode with:

PRAGMA journal_mode=WAL;

This call is persistent. Once a database is in WAL mode, it stays in WAL mode across connections and process restarts until explicitly changed back.

What the -shm File Contains

The SQLite WAL documentation describes the -shm file as a “shared memory” file. SQLite memory-maps it into every process that opens the database in WAL mode. Its 32KB contains a WAL index: a data structure that records how many frames are in the WAL, which frames represent committed transactions, and read marks that prevent the checkpointer from reclaiming frames still needed by active readers.

Critically, it also contains a locking region. SQLite uses POSIX byte-range advisory locks via fcntl(F_SETLK) on specific offsets within the -shm file to coordinate concurrent access:

  • Byte 120: write lock (exclusive for writers and checkpointers)
  • Bytes 121-127: read lock slots (one per active reader, up to five concurrent readers)

This is the locking protocol that makes WAL mode’s concurrency model work: multiple readers can proceed simultaneously, and a single writer can run concurrently with readers, because the locks are fine-grained and non-blocking for reads. The question for Docker is whether these primitives retain their semantics across container boundaries.

Containers Are Just Processes

Docker containers are Linux processes with namespace isolation layered on top. Network namespaces, PID namespaces, mount namespaces, and user namespaces all separate them from the host and from each other. But they all run on the same kernel. There is no hypervisor between a container and the host kernel, unlike a virtual machine.

When you mount a named Docker volume or a host bind-mount into multiple containers, those containers access the same underlying filesystem through the mount. Named volumes on a single host live under /var/lib/docker/volumes/ on the host filesystem. Bind mounts expose host directories directly. Either way, the containers are reading and writing to the same inodes on the same physical storage.

Two properties follow from this that matter for SQLite WAL:

The page cache is shared. When any process memory-maps a file on a local filesystem, the kernel maps physical page cache frames into the process’s virtual address space. All processes on the same host that map the same file share the same physical frames. There is no copying, no replication, no separate coherence protocol. A write to the mmap’d region by one process is immediately visible to all other processes that have mapped the same file, because they are pointing at the same memory. Docker containers on the same host share this property entirely.

POSIX advisory locks work across container boundaries. The fcntl lock API identifies locks by a combination of the process ID and the file’s inode number on the filesystem. Namespace isolation does not change how fcntl locks are managed; the kernel’s lock manager is not namespaced. When a container process acquires an exclusive lock on byte 120 of the -shm file, another container process attempting to acquire an overlapping lock will block correctly, exactly as two processes on the same bare-metal host would block each other.

A simple Docker Compose configuration that shares a SQLite database between two services:

services:
  writer:
    image: my-app
    volumes:
      - db-data:/data
    environment:
      DATABASE_PATH: /data/app.db

  reader:
    image: my-app
    volumes:
      - db-data:/data
    environment:
      DATABASE_PATH: /data/app.db

volumes:
  db-data:

Both services see the same named volume. On a single Docker host, this maps to the same directory on the host filesystem, with shared page cache and shared lock management.

The Network Filesystem Boundary

The model breaks completely at the network filesystem boundary. If the Docker volume is backed by NFS, CIFS, AWS EFS, or any network-attached storage, the -shm file’s mmap no longer provides shared memory in any meaningful sense.

On NFS, each host has its own page cache. When process A on host 1 writes to its mmap of the -shm file, that write lives in host 1’s page cache. Process B on host 2 has its own mapping backed by host 2’s page cache. The NFS client will eventually flush dirty pages to the server and eventually fetch updates from the server, but “eventually” is not the synchronous coherence that SQLite requires. Two writers can both believe they hold the write lock because the lock state they see in their respective page caches has diverged.

SQLite’s own documentation explicitly warns against this:

WAL mode will work on network filesystems if the network filesystem correctly supports locking and shared memory. However, WAL mode is not recommended on network filesystems.

NFSv4 added a stateful lock manager that improves on NFSv3’s notoriously unreliable advisory locks, but even NFSv4 locking is not equivalent to local POSIX locking in terms of reliability and performance characteristics. The coherence requirements of SQLite’s WAL index are strict enough that the risk of silent corruption on NFS is real rather than theoretical.

What Breaks in Kubernetes

Kubernetes introduces a wrinkle that the single-host Docker case avoids. A Deployment with two replicas does not guarantee that both pods land on the same node. Kubernetes schedules pods across nodes based on resource availability and affinity rules. If two pods sharing a SQLite WAL database end up on different nodes, the correctness of the whole arrangement depends entirely on what storage backend is underneath the volume.

ReadWriteMany (RWX) persistent volumes, which allow multiple pods on multiple nodes to mount the same volume simultaneously, are almost always backed by a network filesystem: NFS, CephFS, GlusterFS, AWS EFS. That is the only practical way to make a volume writable from multiple nodes at once. Using SQLite WAL mode on an RWX volume is not safe.

ReadWriteOnce (RWO) volumes, typically backed by block storage (EBS, GCE PD, a local disk), are safer, but only if all pods using the volume run on the same node. Kubernetes does not guarantee this by default. A pod restart might reschedule the pod to a different node, which would fail to mount the RWO volume anyway (since another node holds it), but the failure mode is at least explicit rather than silent corruption.

The reliable Kubernetes pattern for SQLite is a StatefulSet with a single replica and a RWO volume. One pod, one node, one writer at a time. If other components need to read the data, they communicate with that pod over a network protocol rather than sharing the file directly.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: sqlite-app
spec:
  replicas: 1
  serviceName: sqlite-app
  volumeClaimTemplates:
  - metadata:
      name: db-storage
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 10Gi

With replicas: 1, you get the WAL concurrency benefits within that single pod’s processes without exposing yourself to the multi-node coherence problem.

Alternatives for Multi-Process Replication

Litestream takes a different approach that sidesteps the shared-file problem entirely. It runs as a sidecar alongside your SQLite process, reads the WAL stream, and replicates it continuously to S3-compatible object storage or SFTP targets. Recovery happens by fetching the replicated WAL and restoring it. Litestream never shares the live database file across multiple simultaneous writers; it is purely a disaster recovery and read-replica mechanism rather than a live multi-writer solution.

rqlite takes yet another approach: it wraps SQLite in a Raft consensus layer, giving you a distributed SQLite cluster where all writes go through the Raft leader and are replicated to followers before committing. This is genuinely distributed, but it adds significant operational complexity and network round-trip latency to every write.

For the common case of a web application with one writer and occasional background workers on the same Docker host, none of this complexity is necessary. WAL mode on a shared local volume works, the kernel semantics guarantee it, and the concurrency characteristics of WAL mode (one writer, many readers, no blocking between them) are well-suited to typical web workloads.

The Practical Baseline

For a single Docker host with containers sharing a named volume or bind mount, SQLite in WAL mode works correctly. The shared page cache and unified POSIX lock management are what make it work, and those properties hold as long as you stay on one host with a local filesystem. You can verify the mode is active on any connection with PRAGMA journal_mode;, which returns wal when WAL mode is enabled.

The boundary condition is worth internalizing now rather than discovering through a corruption incident later: local filesystem on a single host, safe; network filesystem at any scale, unsafe. Knowing exactly why the single-host case works is what lets you recognize immediately when a configuration change, a cloud migration, or a Kubernetes deployment has moved you out of the territory where the guarantees hold.

Was this interesting?