· 6 min read ·

Two Ways to Shrink a Process: Capsicum's Capability Rights vs. seccomp's Syscall Filters

Source: lobsters

Process sandboxing in 2026 mostly means seccomp on Linux. It is what Chrome, Firefox, Docker, and systemd all reach for. But there is a parallel tradition, represented by Capsicum on FreeBSD and pledge/unveil on OpenBSD, that takes a fundamentally different position on where restriction should happen. Vivian Voss’s comparison is a good entry point to this topic, and it prompted me to dig into the architectural difference more carefully, because the two approaches are not just different implementations of the same idea.

Where the Filter Lives

seccomp-bpf, merged into Linux 3.5 in 2012, works at the syscall boundary. When a process installs a filter, the kernel runs a small cBPF program on every subsequent syscall before executing it. That program receives a struct seccomp_data containing the syscall number, the calling architecture, and up to six argument registers:

struct seccomp_data {
    int   nr;                   /* syscall number */
    __u32 arch;                 /* e.g., AUDIT_ARCH_X86_64 */
    __u64 instruction_pointer;
    __u64 args[6];              /* syscall arguments */
};

The program returns a verdict: SECCOMP_RET_ALLOW, SECCOMP_RET_KILL_PROCESS, SECCOMP_RET_ERRNO(e), and several others. If the process tries to call execve and the filter says kill, the kernel kills the process before execve runs.

Capsicum, first published in the USENIX Security 2010 paper by Watson, Laurie, and Anderson and shipped in FreeBSD 10.0, works at the resource layer instead. Rather than filtering syscalls, it restricts what operations are permitted on file descriptors the process already holds. Two primitives do the work:

/* Narrow the rights on an open fd */
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_SEEK, CAP_FSTAT);
cap_rights_limit(fd_input, &rights);

/* Enter capability mode — severs access to global namespaces */
cap_enter();

After cap_enter(), the process loses access to all global namespaces: no open("/etc/passwd"), no kill(pid, ...), no connect() to a new socket by address. Any syscall that requires resolving a global name fails with ECAPMODE. The only things the process can do are operations on file descriptors it already holds, subject to the rights attached to each fd via cap_rights_limit(2).

The FreeBSD man page for capsicum(4) enumerates around 60 capability rights: CAP_READ, CAP_WRITE, CAP_SEEK, CAP_CONNECT, CAP_ACCEPT, CAP_MMAP_R, CAP_MMAP_X, CAP_LOOKUP (needed to use a directory fd as the base for openat), and so on. Rights on an fd can only be narrowed, never widened. A process with CAP_READ on a file descriptor cannot grant itself CAP_WRITE on that same fd.

What Each Model Cannot Do

This is where the comparison becomes more than academic.

seccomp-bpf can inspect syscall argument registers, but it cannot dereference pointers. The args[6] in seccomp_data contains the raw 64-bit values passed in registers to the syscall. For a call like openat(dirfd, "/etc/passwd", O_RDONLY), the BPF program sees the integer value of the pointer to the string "/etc/passwd", not the string itself. This means seccomp fundamentally cannot implement a policy like “allow opening files only under /var/data”. That policy requires inspecting the path, and inspecting the path requires dereferencing a user pointer inside the kernel, which cBPF cannot safely do.

Capsicum handles this naturally. If you want a sandboxed process to be able to open files within /var/data but nowhere else, you open /var/data as a directory fd before calling cap_enter(), restrict it to CAP_LOOKUP | CAP_READ, and pass it in:

int dir_fd = open("/var/data", O_RDONLY | O_DIRECTORY);
cap_rights_init(&rights, CAP_LOOKUP, CAP_READ, CAP_FSTAT);
cap_rights_limit(dir_fd, &rights);
cap_enter();

/* Now only files within /var/data are reachable */
int f = openat(dir_fd, "input.txt", O_RDONLY);  /* OK */
int g = open("/etc/passwd", O_RDONLY);           /* ECAPMODE */

The restriction is enforced by the kernel’s fd machinery, not by inspecting path strings. There is no TOCTOU window, no symlink confusion, no race between the policy check and the operation.

Capsicum has its own limitation: it cannot restrict which syscalls are called, only what operations succeed on held fds. If an attacker exploits a memory-safety bug in a Capsicum-sandboxed process, they are constrained to fd operations within the capability rights of that process. They cannot call execve or ptrace directly, because those require global namespaces that capability mode severed. But the restriction comes from the model, not from enumerating forbidden syscalls.

seccomp can explicitly block execve, ptrace, clone with specific flags, and dozens of other dangerous calls regardless of what fds the process holds. This is a different guarantee, and for some threat models it matters more.

OpenBSD’s Middle Path

OpenBSD took a third approach with pledge(2) (OpenBSD 5.9, 2015) and unveil(2) (OpenBSD 6.4, 2018). pledge restricts syscall categories using promise strings:

pledge("stdio rpath inet dns", NULL);

Unlike seccomp, the mapping from promise string to permitted syscalls is fixed in the kernel and maintained by the OpenBSD team. Users cannot write arbitrary filters. unveil handles the filesystem namespace:

unveil("/var/data", "r");
unveil("/tmp",      "rwc");
unveil(NULL, NULL);  /* lock */

Together, pledge and unveil cover approximately the same space as Capsicum: syscall class restriction plus filesystem namespace restriction. The difference is that unveil works on path strings rather than file descriptors, which makes it easier to retrofit onto existing applications but introduces the usual path-based TOCTOU exposure that Capsicum avoids entirely. The OpenBSD man pages describe the reasoning behind the design trade-offs.

Attack Surface

The attack surface difference between Capsicum and seccomp-bpf deserves attention. seccomp’s filter mechanism runs BPF through a verifier and either an interpreter or a JIT compiler. The Linux kernel seccomp documentation describes the architecture in detail. The BPF JIT has had multiple security vulnerabilities over the years, including several that allowed sandboxed processes to escalate privilege by exploiting bugs in the JIT itself. A restricted process can generate complex BPF programs via nested filter installations; that complexity means more surface area.

Capsicum’s enforcement is simpler. Capability mode is a flag on struct proc. Rights checking is a bitmask AND on a 64-bit value in the struct filedesc entry for each fd. There is no verifier, no interpreter, no JIT. The FreeBSD cap_rights_limit(2) path in the kernel is a handful of lines. Runtime overhead is effectively zero.

The Trusted Helper Problem

Both models run into the same practical wall: real applications need capabilities beyond a minimal sandbox. They need DNS resolution, password database lookups, logging to syslog. FreeBSD ships Casper, a daemon specifically for this. A process in capability mode holds a socket fd connected to the Casper daemon and requests privileged operations through that channel. Casper validates each request and performs only the allowed subset.

Linux added an analogous mechanism in kernel 5.0: SECCOMP_RET_USER_NOTIF, which forwards a filtered syscall to a userspace supervisor process via a file descriptor. Container runtimes like systemd-nspawn and Podman use this to implement a broker pattern where the container process cannot perform an operation directly, but sends it to a supervisor outside the container for validation. The kernel seccomp documentation describes the notifier fd mechanism.

The convergence here is telling. Both capability-based and syscall-filter-based sandboxing ended up needing the same architectural solution: a trusted co-process that holds wider authority and validates requests from the sandboxed component. The mechanisms for getting there differ, but the pattern is identical.

The Portability Gap

The most significant practical limitation of Capsicum is portability. FreeBSD ships it in the base system and has progressively applied it to base utilities: tcpdump, dhclient, rwhod, tftp, sort. A Google-maintained capsicum-linux patch set exists but has never been merged into the mainline Linux kernel. The reasons are partly political, partly technical: Linux already has seccomp, namespaces, and LSMs, and the kernel community has not found the object-capability model compelling enough to justify another sandboxing primitive.

seccomp-bpf runs on every significant Linux distribution, which means it runs in every cloud VM, every container, every Android device. Chrome’s renderer sandbox on Linux has used seccomp-bpf since around 2012. Firefox’s content process sandbox uses it. The Docker default seccomp profile blocks around 44 syscalls. systemd’s SystemCallFilter= directive compiles service-unit syscall lists to BPF at runtime. The practical deployment scale of seccomp-bpf dwarfs anything Capsicum has achieved outside FreeBSD.

For applications targeting only FreeBSD, Capsicum is the cleaner model. The security invariants are simpler, the attack surface is smaller, and the filesystem restriction story is solved correctly. For cross-platform work or anything running on Linux, seccomp-bpf is what the ecosystem has standardized on, and libseccomp makes it considerably less painful than writing raw cBPF.

The deeper lesson from comparing the two is that sandboxing models make explicit trade-offs between expressiveness and simplicity. Capsicum chose simplicity: a fixed rights vocabulary attached to file descriptors, a clean capability-mode transition, and minimal kernel machinery. seccomp chose expressiveness: arbitrary BPF programs, argument-level filtering, and a programmable policy interface. Both choices have costs. If you are sandboxing a program and wondering which approach to reach for, the answer is mostly determined by your OS target, but understanding why they differ will save you from misapplying either one.

Was this interesting?