Sub-Millisecond VM Sandboxes: How CoW Memory Forking Makes Firecracker Startup Nearly Free
Source: hackernews
The Warm-State Sandboxing Pattern Has a Long History
The idea that you should pay the initialization cost once and amortize it across many executions predates microVMs by decades. Apache’s prefork MPM, introduced in the mid-1990s, did exactly this: a parent process loaded the interpreter and configuration, then called fork() to spawn workers. Each worker inherited the parent’s virtual address space via copy-on-write semantics, meaning the kernel did not actually copy anything until a worker wrote to a page. FastCGI took the same approach for scripting languages that had expensive startup paths. The technique even has a name in the PHP ecosystem: the “warm pool” model, where a set of pre-forked workers sits ready to handle requests without paying the startup cost again.
What zeroboot announced on March 17, 2026 is the same insight applied one level deeper in the stack, at the microVM boundary rather than the process boundary. Instead of forking a process with Python already imported, it forks the entire virtual machine state with Python and numpy already loaded into guest memory. The result is VM startup in under 1ms, which is fast enough to treat a Firecracker VM as a request-scoped sandbox rather than a warm pool entry.
Copy-on-Write at the MMU Level
To understand why this works, you need a clear picture of what the MMU does when a process calls mmap() with MAP_PRIVATE. The Linux mmap(2) man page describes MAP_PRIVATE as creating “a private copy-on-write mapping,” but that description understates the mechanism. When you map a file MAP_PRIVATE, the kernel does not read the file into physical memory immediately. It sets up page table entries that point at the file’s page cache, marked read-only and copy-on-write. Those pages are shared, directly, with every other process that has the same file mapped.
When any mapping holder writes to one of those pages, the MMU raises a page fault. The kernel’s fault handler allocates a fresh physical page, copies the contents of the shared page into it, updates the faulting process’s page table to point at the new private page, and returns. From that point forward the process has its own private copy of that page; the shared version in the page cache is unmodified. Processes that never write to a page never trigger this copy, and never consume any physical memory beyond the page table entries themselves.
This is why MAP_PRIVATE is the right primitive for zeroboot’s approach. The snapshot’s memory.bin file is mapped MAP_PRIVATE. A new KVM VM is created, and its guest physical address space is backed by that mapping. Pages the guest never writes to are shared zero-cost with the snapshot file’s kernel page cache. Pages the guest does write to are faulted in as private copies. For a short-lived code execution sandbox, the fraction of pages written is small; most of guest memory is the Python and numpy runtime, which the executing code reads but does not modify.
How Firecracker Snapshot Files Are Structured
Firecracker’s snapshot and restore documentation describes two files produced by the snapshot API: a vmstate file and a memory file. The vmstate file is compact; it contains the serialized state of all virtualized hardware, including vCPU register state (general-purpose registers, control registers, MSRs, FPU state), device state for virtio devices, the KVM clock, interrupt controller state, and the guest’s GDT and IDT. For a minimal Firecracker VM this is on the order of a few hundred kilobytes.
The memory file is the full guest physical memory dump. For a Python process with numpy loaded, this might be 300-500 MB of guest RAM. What zeroboot’s trick depends on is that this file does not need to be read on startup; it only needs to be mapped. The KVM_SET_USER_MEMORY_REGION ioctl, documented in the KVM API reference, is what wires this together. It tells KVM to use a range of host virtual memory as the backing for a range of guest physical addresses. When you pass a MAP_PRIVATE mapping of memory.bin as the userspace_addr, KVM uses those pages as guest RAM, and the CoW semantics flow through transparently. KVM does not know or care that the host pages are backed by a file with copy-on-write semantics; it just sees host virtual addresses, and the MMU handles the rest.
Restore then consists of loading vmstate, replaying the device and vCPU state via the appropriate KVM ioctls, and calling KVM_RUN. The first instruction the guest executes is whatever was at the vCPU’s RIP when the snapshot was taken. Because the process was captured mid-execution with the Python runtime fully initialized and numpy already imported, the guest resumes without any initialization work.
Comparison With CRIU
CRIU (Checkpoint/Restore In Userspace) solves an adjacent problem: checkpointing and restoring Linux processes rather than VMs. It operates at the OS level, dumping all process state including memory maps, file descriptors, sockets, pipes, and the kernel-side objects those descriptors reference. The restore side reconstructs this state by re-creating the process hierarchy, re-opening files at the right offsets, re-establishing pipes and socket pairs, and mapping memory back from dump files.
CRIU is impressive engineering, but it has important differences from the zeroboot approach. First, CRIU restores a specific process snapshot, so each restore produces a distinct process with its own identity; there is no sharing of physical pages between restored instances unless you deliberately use userfaultfd or some other mechanism. Second, CRIU’s restore path is slow relative to what zeroboot claims; practical CRIU restore times are in the hundreds of milliseconds to low seconds range for real workloads, depending on memory size and file descriptor complexity. Third, CRIU operates on processes, not VMs, so you do not get the hardware isolation that Firecracker’s KVM boundary provides. For untrusted code execution, process isolation is insufficient; you want the guest kernel barrier.
The critical difference is that CRIU actually restores state, while zeroboot’s approach maps shared state and only forks on write. That distinction is what produces sub-millisecond startup.
Lambda SnapStart and the Same Idea at AWS Scale
AWS Lambda SnapStart, announced in late 2022 for Java functions, uses Firecracker snapshots in a similar way. A Lambda function is initialized, a snapshot is taken after initialization completes (the equivalent of after Python and numpy are loaded), and subsequent cold starts restore from that snapshot rather than initializing from scratch. AWS reported cold start times dropping from multiple seconds to under 200ms for heavily initialized Java runtimes.
The architectural difference is that Lambda SnapStart makes a new copy of the snapshot for each restore. It is not using CoW page sharing between concurrent invocations; each invocation gets its own full restore. This is partly a consequence of the isolation model (each Lambda invocation must be fully isolated from others) and partly because Lambda’s restore path was optimized for latency rather than memory efficiency. The SnapStart approach still gets significant benefit from skipping initialization work, but it does not compress memory usage across concurrent invocations the way that shared CoW pages would.
Zeroboot’s approach potentially does get that compression, assuming the operating system’s page deduplication is effective across the simultaneously running VMs. Pages that no invocation has dirtied remain shared in the kernel’s page cache, so running ten concurrent sandboxes does not require ten copies of the Python runtime in physical RAM.
gVisor and the Overhead Trade
gVisor takes a fundamentally different approach to sandboxed execution. Rather than a hardware VM, gVisor interposes on system calls in userspace, running a guest kernel (Sentry) in the same process as the application but intercepting every syscall before it reaches the real kernel. The isolation comes from restricting what syscalls the Sentry itself can make, using either ptrace or a KVM-based mode.
gVisor’s cold start is fast because there is no VM to boot; you just start the Sentry process and exec the guest binary. The trade-off is runtime overhead from syscall interposition, which can be significant for I/O-heavy workloads. Firecracker with CoW snapshot restore gets hardware-level isolation at near-zero startup cost, but pays no ongoing syscall overhead because the guest kernel handles syscalls directly through KVM’s normal virtualization path. For workloads that are syscall-intensive, that ongoing overhead difference matters considerably.
ASLR, TLB Shootdowns, and Dirty Page Tracking
Several implementation details in zeroboot’s approach deserve attention because they affect both correctness and performance.
ASLR is the first complication. When Firecracker takes a snapshot, the host process’s virtual address layout is fixed. When you restore that snapshot into a new process, the MAP_PRIVATE mapping of memory.bin needs to be placed at the same host virtual address that KVM expects, because KVM_SET_USER_MEMORY_REGION uses host virtual addresses. ASLR would randomize this placement, causing KVM to map the wrong memory. The solution is to either disable ASLR for the restore process via personality(ADDR_NO_RANDOMIZE), or to use mmap() with a fixed address hint and verify that the kernel honored it.
TLB shootdowns are a performance concern when multiple vCPUs are running VMs backed by the same physical pages. When a guest writes to a page and the kernel copies it from the shared mapping to a private one, the kernel must invalidate TLB entries on all CPUs that might have cached the old mapping. For a single-vCPU Firecracker VM, the common case for small Python sandboxes, this is not an issue. For multi-vCPU VMs, the cost of TLB invalidation scales with the number of CPUs involved, and frequent writes to shared pages can produce measurable overhead from shootdown interrupts.
Dirty page tracking is related. KVM can track which guest physical pages have been written via the KVM_GET_DIRTY_LOG ioctl, which relies on the kernel setting page table entries as read-only and using write-protect faults to record modifications. This mechanism is primarily used for live migration, but it is relevant here because understanding which pages a typical sandbox execution dirties is what determines the memory amplification factor. If a sandbox execution writes to 2 MB of pages out of 400 MB of guest RAM, the CoW overhead is minimal. If it writes to 100 MB, the benefit is much smaller. The actual dirty fraction depends heavily on the workload, and profiling it with KVM dirty logging is the right way to characterize the savings.
Fork-Based Servers and Unikernels as Adjacent Points in the Design Space
Zeroboot’s approach sits in an interesting part of the sandboxing design space. Fork-based servers (Apache prefork, Unicorn for Ruby, Gunicorn for Python) share virtual memory via OS fork, get copy-on-write for free, but provide only process-level isolation. The security boundary is weak enough that any kernel exploit in the guest breaks the sandbox entirely. Unikernels like MirageOS or Unikraft go the other direction: they compile application and OS into a single image with minimal attack surface, accept a hardware VM boundary, but typically pay a boot cost because there is no pre-warmed state to restore. Neither approach combines hardware isolation with copy-on-write memory sharing across concurrent instances.
Zeroboot occupies that gap: Firecracker provides hardware isolation via KVM, and MAP_PRIVATE snapshot mapping provides CoW memory sharing. The key insight is that the snapshot file is an immutable artifact that the kernel can safely share across any number of mmap consumers, and KVM does not require exclusive ownership of the host memory it backs guest RAM with. The hardware isolation boundary does not care whether the backing pages are private or shared on the host side; from the guest’s perspective, it has exclusive access to its RAM.
What This Pattern Enables
The practical implication of sub-millisecond VM startup is that the latency budget for serverless code execution changes. When a Firecracker VM takes 125ms to boot, you have to maintain warm pools and pre-provision capacity to hide that cost from user-facing latency. When startup takes 0.5ms, you can afford to create a new VM per request and still keep total execution latency in a range that is acceptable for interactive workloads. The pool management complexity, the bin-packing problem of deciding how many warm VMs to keep alive, and the resource waste of idle warmed VMs all shrink in proportion to how cheap startup becomes.
For code execution sandboxes in particular, where the threat model demands fresh isolation per execution and where workloads vary unpredictably, this is a significant operational simplification. The security property becomes easier to maintain because you are not tempted to reuse sandboxes to avoid startup cost. The snapshot is the canonical clean state; every execution starts from it, and there is no risk of state leakage between executions because CoW guarantees each execution gets its own private dirty pages.
The zeroboot project demonstrates the technique clearly enough that the important question is no longer whether it works but where the remaining costs are. Network namespace setup, the KVM file descriptor creation, and the overhead of loading vmstate and replaying hardware state are the irreducible components of restore latency. As those components are optimized, the approach applies to increasingly latency-sensitive workloads. The Linux kernel, KVM, and the MMU are doing most of the work; the application-level code is mostly about orchestrating them correctly.