· 6 min read ·

FreeBSD Jails at Twenty-Five: The Isolation Design That Container Runtimes Keep Rediscovering

Source: hackernews

FreeBSD 4.0 shipped jails in 2000. Docker launched in 2013. The fourteen-year gap is not a curiosity; it is an explanation for why their security properties look so different. Jails were designed as a security boundary first, as a single coherent kernel primitive, by engineers who understood what guarantees they needed to provide. Linux’s container model assembled the same functional space from at least seven independent namespace types, cgroups, and a stack of security policies, added piecemeal over a decade and a half, and integrated by container runtimes that must compose them correctly on every invocation.

The Hacker News thread on Dragas’s post about FreeBSD hit 500 points and 254 comments last week, and jails came up repeatedly. The technical reasons are worth unpacking beyond the headline observation that jails predate Docker.

What a Jail Actually Is

A jail, from the kernel’s perspective, is a named partition with an enforced security boundary and its own identity. A process running inside a jail cannot see processes outside it, cannot access the host filesystem beyond its root, cannot directly manipulate kernel state, and, with VNET enabled, cannot touch the host’s network interfaces, routing tables, or packet filter state.

The kernel interface that creates a jail is jail(2), introduced in a patch by Poul-Henning Kamp. At creation time, you specify the root path, the hostname, and the set of IP addresses the jail may bind. What makes this architecturally distinct from a Linux container is that all of these constraints are enforced against a single kernel data structure: struct prison. Every security-relevant syscall in the FreeBSD kernel passes through a prison_check_* function that determines whether the calling process is permitted to proceed. The isolation is not a composition of independently developed mechanisms. It is a coherent set of checks against a single structure, reviewable in a single code path.

Tools like bastille and iocage layer ZFS dataset management on top of this foundation. Each jail lives in its own ZFS dataset, snapshotable and cloneable at near-zero cost:

# Create a jail using bastille on FreeBSD 14.2
bastille bootstrap 14.2-RELEASE
bastille create webserver 14.2-RELEASE 10.0.0.10

# bastille places the jail in a ZFS dataset
zfs list | grep webserver
# zroot/bastille/jails/webserver

# Snapshot before a configuration change
bastille snapshot webserver
# Roll back if needed
bastille rollback webserver

The jail and its filesystem state are one unit. A failed deployment is two commands away from reversal.

VNET: Per-Jail Network Stacks

The original jails design shared the host network stack, restricting jails to a subset of IP addresses on shared interfaces. FreeBSD 8.0, released in 2009, introduced VNET, which gives each jail a complete, independent network stack: its own interface list, routing table, ARP cache, TCP and UDP connection tables, and PF firewall state. A VNET jail shares no network state with the host or with other jails.

The mechanism that connects a VNET jail to the outside world is epair(4), a virtual Ethernet pair. Creating an epair produces two linked interfaces; one end goes into the jail, the other remains in the host and can be bridged to a physical interface:

# Create a virtual ethernet pair
ifconfig epair create
# Kernel creates epair0a and epair0b

# Bridge the host-side end to a physical interface
ifconfig bridge0 addm epair0b

# In jail.conf: assign the jail-side end
exec.start += "ifconfig epair0a 10.0.0.10/24";
exec.start += "route add default 10.0.0.1";

Inside the VNET jail, PF can be loaded and configured as if the jail were a separate machine. Two VNET jails can use overlapping address spaces without conflict because their routing tables are independent. This is functionally equivalent to what Linux network namespaces provide, but VNET arrived four years before Docker launched and integrates with the rest of the jail security model as a unified feature rather than a separately composed mechanism.

The Composition Problem in Linux Containers

Linux mount namespaces arrived in kernel 2.4.19 in 2002. Network namespaces stabilized around 3.0 in 2011. PID and user namespaces came in 3.8 in 2013. cgroups for resource accounting arrived in 2.6.24 in 2008. Docker launched in 2013 and assembled all of these, plus seccomp syscall filtering and AppArmor or SELinux policies depending on configuration, into a container runtime.

Each mechanism is individually defensible. The problem is the composition. Container runtimes must configure all of them correctly, in the right order, handling interactions between them, on every container start. The runc codebase that underlies Docker and most Kubernetes node runtimes has accumulated security patches for cases where the interaction between namespace setup and capability-dropping left a window for privilege escalation. User namespaces in particular, which allow unprivileged users to create what appears to be a privileged environment inside a namespace, have been a recurring source of CVEs because the boundary between apparent and real kernel privilege requires correct handling across multiple subsystems simultaneously.

FreeBSD’s security advisories for jails exist too. When the isolation is a single coherent primitive with a defined boundary, though, the audit surface is smaller and the failure modes are more tractable. A bug in prison_check_cred() is a jail security issue. In Linux containers, a bug might exist in the namespace code, in the cgroup code, in the interaction between user namespaces and capabilities, in the seccomp filter application sequence, or in the OCI runtime’s orchestration of all of the above. Isolating which layer failed, and why, is structurally harder.

Capsicum: The Complementary Sandboxing Story

Jails isolate at the process and privilege level. Capsicum, in the FreeBSD base system since version 10.0 in 2014, operates at the file descriptor level.

A process that calls cap_enter(2) enters capability mode: it cannot open new filesystem paths, cannot fork privileged children, and can only operate on file descriptors it held before entering capability mode. Each fd can have a specific set of capability rights attached via cap_rights_limit(2):

/* Restrict an open fd to read and seek, no write */
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_SEEK);
cap_rights_limit(fd, &rights);
cap_enter(); /* Process can no longer open new paths */

Base system daemons including ping, tcpdump, dhclient, and openssh use Capsicum to sandbox themselves after startup. They open the files and sockets they need during initialization, then restrict themselves to capability mode so that a compromise of the daemon cannot reach arbitrary filesystem paths or make unconstrained privileged calls.

Linux has seccomp-BPF for syscall filtering and Landlock since kernel 5.13 for filesystem access control. They address the problem differently and require more scaffolding to achieve comparable fd-level granularity. More relevant to the architectural argument: Capsicum is maintained by the same team that maintains the kernel it integrates with, reviewed against the same source tree, and ships in the base system without requiring out-of-tree module loading or third-party policy tooling.

Where Jails Fit in 2026

Kubernetes on FreeBSD is not a supported configuration. Teams with existing k8s workloads and CI/CD pipelines built around OCI images have no straightforward migration path to FreeBSD nodes. Podman on FreeBSD is advancing, and the runj project implements an OCI runtime backed by jails, but the ecosystem is not there yet for organizations that have standardized on Helm and container registries.

FreeBSD jails remain compelling for the workloads that the original post describes: hosting services where control, auditability, and a coherent security model matter more than ecosystem breadth. A bastille-managed jail cluster on ZFS, with snapshotable environments and VNET isolation, provides meaningful isolation at near-zero overhead, with a security model that can be understood by reading the jail(2) man page and the struct prison definition rather than by reconstructing it from five kernel subsystem changelogs and a container runtime specification.

The container industry has spent a decade building toward the security properties that FreeBSD jails started with in 2000, and the gap has narrowed considerably. It has not closed, and the reason it exists at all is architectural: building isolation by composing independent mechanisms produces more capable and flexible systems, but it also produces more surface area and more interaction bugs than designing the boundary as a single primitive from the start. FreeBSD made a different trade, and twenty-five years of production use is a reasonable basis for evaluating whether it was the right one.

Was this interesting?