Sub-Millisecond VM Sandboxes: Applying fork() Semantics at the Hypervisor Level
Source: hackernews
The serverless cold-start problem is well-trodden territory, but most solutions attack it from the wrong end. You can optimize container images, trim dependencies, use provisioned concurrency, or switch to a compiled language. Fewer approaches go one level deeper and apply fork() semantics to the entire virtual machine.
That is the core idea behind zeroboot, a project by Adam Miribyan published in March 2026. It uses Firecracker microVMs not as lightweight containers to boot per-request, but as a substrate to snapshot once and then clone repeatedly using Copy-on-Write memory. The result is sandbox startup in under a millisecond, with Python and numpy already initialized and waiting.
The Cold-Start Numbers That Make This Worth Solving
To understand why this matters, the baseline numbers are worth stating plainly. A minimal Firecracker microVM boot takes roughly 125ms according to AWS’s own benchmarks. That covers kernel boot, init, and getting to a usable userspace. Python startup adds another layer: importing numpy alone can cost 50-200ms depending on the hardware and whether the libraries are warm in the filesystem cache. For a sandbox that runs untrusted user code, you pay all of this on every invocation.
If you are running a code execution service, a multi-tenant compute platform, or something like a Discord bot that lets users run Python snippets, that latency budget is painful. You end up maintaining pools of pre-warmed VMs, which means paying for idle capacity, or you eat the cold-start cost and make your users wait.
How Firecracker Snapshots Work
Firecracker’s snapshot mechanism was designed for fast resume, not for the fork-style cloning that zeroboot uses it for. But the primitives are there.
When you call Firecracker’s snapshot API, you send a PUT /snapshot/create request with paths for the memory file and the VM state file:
{
"snapshot_type": "Full",
"snapshot_path": "/snapshots/python-numpy.vmstate",
"mem_file_path": "/snapshots/python-numpy.mem"
}
Firecracker serializes the vCPU register state, the device state (virtio queues, network interfaces, block devices), and dumps the guest physical memory to the mem file. The vmstate file is relatively small. The mem file is a raw dump of everything the guest has in RAM, which for a Python process with numpy loaded might be 200-400MB.
Restoring a snapshot uses PUT /snapshot/load:
{
"snapshot_path": "/snapshots/python-numpy.vmstate",
"mem_file_path": "/snapshots/python-numpy.mem",
"enable_diff_snapshots": false
}
The conventional use of this API is sequential: boot a VM, take a snapshot, restore it when you need a fresh instance. One snapshot, one VM. The zeroboot approach breaks that assumption.
The Fork Analogy
Unix fork() works by marking all of a process’s memory pages as copy-on-write at the kernel level. The parent and child share the same physical pages. When either one writes to a page, the kernel catches the page fault, copies the page, and maps the private copy into the faulting process’s address space. Pages that are never written are never copied. The cost of fork is cheap; the cost of divergence is proportional to how much state you actually modify.
The zeroboot approach applies exactly this model at the hypervisor level. The snapshot memory file is the shared parent state. Each new sandbox maps that file with MAP_PRIVATE, which tells the Linux kernel to use CoW semantics for any writes. When KVM maps guest physical memory regions backed by a MAP_PRIVATE file mapping, the CoW mechanics operate at the host page level. A guest that reads a numpy array is touching pages backed by the original snapshot file, shared in physical memory with every other running sandbox. A guest that writes to a variable gets a private copy of that page, allocated on demand by the page fault handler.
The parallel to fork() is not just conceptual. The kernel mechanisms are identical. The only difference is the layer of indirection: instead of a process forking another process, a VMM is creating a new KVM VM whose memory is backed by a CoW view of a file.
Lazy vs. Eager Restore
Restoring a 300MB memory snapshot takes time even if you are just doing mmap(). With eager restore, the VMM copies all snapshot pages into guest memory before the vCPU starts executing. With lazy restore, you start the vCPU immediately and handle page faults as they arrive.
userfaultfd (uffd) is the Linux mechanism that makes lazy restore practical. Instead of the kernel handling page faults in the usual way, you register a file descriptor with the kernel that receives fault notifications for a specific memory range. A userspace thread processes these notifications and copies the appropriate page from the snapshot file before waking the faulting thread.
For a Python sandbox, lazy restore is a significant win. The guest vCPU starts executing almost immediately. It touches the pages it actually needs, Python’s interpreter loop, the numpy extension module’s code segment, and the initialized heap, and those pages are faulted in on demand. Pages the sandbox never touches are never loaded.
The REAP paper (USENIX ATC 2019) explored this space with the concept of “selective memory eagerness”: profiling which pages a workload accesses during startup and pre-fetching exactly those pages, avoiding both the full eager copy and the latency of individual page faults during execution. It is a middle path that works well when your workload is predictable, which Python sandbox execution generally is.
Memory Efficiency Across Many Sandboxes
The physical memory arithmetic here is favorable. Suppose your Python-plus-numpy snapshot has 300MB of guest memory. With eager restore and no sharing, 10 concurrent sandboxes need 3GB of RAM. With CoW-backed snapshots, those 10 sandboxes share the physical pages for the read-only portions of the snapshot. The Python interpreter code, the numpy shared library code and read-only data, the standard library modules loaded at startup: none of that gets duplicated.
A sandbox that runs a simple computation, reading a numpy array and returning a result, might dirty only a few kilobytes of pages: a stack frame here, a small heap allocation there. KVM’s dirty page tracking via the KVM_GET_DIRTY_LOG ioctl can tell you exactly which pages each sandbox wrote. The actual private memory overhead per sandbox is proportional to what it does, not to the total snapshot size.
This is meaningfully different from running 10 separate Python processes, where each process has its own private copy of the interpreter and numpy in memory, mitigated somewhat by Linux’s ELF shared library page sharing but not eliminated.
Prior Art and Related Work
This pattern has been explored at several layers of the stack. Catalyzer (OSDI 2020) applied fork-style creation to gVisor sandboxes, creating new sandboxes from a pre-initialized template using the same CoW memory semantics. The paper demonstrated sub-millisecond sandbox creation for containerized workloads, which is essentially the same result zeroboot achieves for microVMs.
AWS Lambda SnapStart uses Firecracker snapshots to eliminate Java cold starts. AWS initializes a Java Lambda function, snapshots the VM, and restores from that snapshot on subsequent invocations. The framing is different from zeroboot, focusing on per-function warmup rather than per-request CoW cloning, but the underlying mechanism is the same snapshot API.
FAASM approaches the problem from the WebAssembly angle, using shared memory regions for WebAssembly module state to allow fast function instantiation. The tradeoff space is different: WASM sandboxes like those in Spin or wasmtime have fast cold starts by default because WASM modules are much smaller than a Python interpreter, but they require your code to target WASM and they lose the ability to run arbitrary Python with arbitrary native extensions.
The Tradeoffs
The approach is not without complications. Snapshot staleness is the obvious one: the snapshot captures a moment in time, so any mutable global state in the Python process is shared across all clones at the moment of their creation. If your Python code modifies a global at import time, and plenty of libraries do, every sandbox starts with that modification already applied. This is usually fine since the snapshot represents a fully-initialized state, but it requires care when the snapshot is created and updated.
The security implications of shared physical pages deserve attention. The pages themselves are read-only at the hardware level until written; a sandbox cannot corrupt another sandbox’s view of the shared data because any write produces a private copy. But the shared pages are identical across all sandboxes, which is a consideration for workloads where data isolation between tenants matters beyond simple memory protection.
Dirty page tracking overhead is real but bounded. The KVM_GET_DIRTY_LOG ioctl requires KVM to maintain a bitmap of written pages, which adds a small cost to every page write in the guest. For compute-heavy workloads that write a lot of memory, this is measurable. For typical sandbox workloads that do some computation and return a result, it is negligible.
Why Python Specifically Benefits
The cold-start problem is particularly acute for Python because Python’s startup costs are high and hard to avoid. The interpreter itself takes time to initialize. The import system is slow by design, resolving module paths, reading and compiling .py files or loading .pyc caches, and executing module-level code. numpy’s C extension does substantial initialization: setting up BLAS, probing CPU features for SIMD dispatch, registering array types. You cannot skip any of this; you can only pay for it once.
The snapshot approach pays the initialization cost exactly once and amortizes it across every sandbox that runs. The first boot is slow. Every subsequent clone is fast. For a long-running service, the amortization window is effectively infinite.
The zeroboot project demonstrates that sub-millisecond startup is achievable with production-grade isolation using Firecracker’s battle-tested VMM. The implementation is not complex in concept; the complexity is in the details of snapshot management, uffd integration, and handling the edge cases around mutable state. But the core idea, fork() for VMs, is one of those things that feels obvious once you see it.