Two Theories of Process Sandboxing: What Capsicum and seccomp Actually Disagree On
Source: lobsters
Process sandboxing has two competing philosophies, and most developers using Linux only ever encounter one of them. Vivian Voss’s comparison of Capsicum and seccomp is a good starting point for understanding the surface-level API differences, but the more interesting question is why the two mechanisms are structured so differently in the first place. The answer has to do with how each system reasons about authority.
The Ambient Authority Problem
POSIX processes inherit what security researchers call “ambient authority”: the ability to reference global namespaces (the filesystem, the network, the PID namespace) using nothing more than a name. Any process can call open("/etc/passwd", O_RDONLY) as long as it has the file permission. The process doesn’t need to hold a token that grants this access; the name itself is sufficient.
This is convenient but hard to reason about at security boundaries. When you want to confine a process, you need to enumerate everything it shouldn’t be able to name, which is a much larger set than what it legitimately needs. Both Capsicum and seccomp are attempts to deal with this problem, but they attack it from opposite directions.
Capsicum: Strip the Namespaces, Keep the Handles
Capsicum, developed by Robert Watson and Ben Laurie and introduced in the USENIX Security 2010 paper, ships as part of FreeBSD’s base system and takes the capability-model approach. Rather than filtering what a process is allowed to ask for, it removes the ability to ask by name at all.
The central call is cap_enter(). After it returns, the process is in capability mode: all global namespace lookups are blocked at the kernel level. You cannot call open("/etc/passwd") because path-based lookup simply fails. The only resources the process can use are file descriptors it already holds, and those FDs now carry an explicit rights bitmask enforced by the kernel.
#include <sys/capsicum.h>
// Before entering capability mode, narrow each fd's rights
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_FSTAT);
cap_rights_limit(config_fd, &rights);
cap_rights_init(&rights, CAP_WRITE, CAP_SEEK);
cap_rights_limit(output_fd, &rights);
// Irreversible: global namespaces are now inaccessible
if (cap_enter() < 0)
err(1, "cap_enter");
The cap_rights_t type is a 64-bit bitmask with per-FD granularity: CAP_READ, CAP_WRITE, CAP_SEEK, CAP_FSTAT, CAP_LOOKUP (for directory traversal), CAP_CONNECT, CAP_ACCEPT, and dozens more. Rights can only be narrowed, never expanded. A child process that receives an FD via SCM_RIGHTS gets exactly the rights the parent passed down, and can restrict them further but not broaden them.
Filesystem access inside capability mode goes through directory file descriptors using the *at() family:
// dirfd was opened and limited before cap_enter()
int fd = openat(dirfd, "config.txt", O_RDONLY);
// Traversal above dirfd (via "..") is blocked even with CAP_LOOKUP
The sandbox is structurally complete in the sense that there is no configuration space to get wrong and no list of forbidden syscalls to maintain. Global namespace access is simply absent from the process’s ability set.
seccomp: Filter the Syscalls, Leave the Namespaces
Linux’s seccomp mechanism takes the opposite approach. Processes retain ambient authority; seccomp interposes on each syscall and runs a BPF program to decide whether to permit it. The seccomp filter documentation describes the filter mode introduced in Linux 3.5 by Will Drewry.
The BPF program receives a struct seccomp_data containing the syscall number, architecture identifier, instruction pointer, and the six integer arguments. It returns a disposition: SECCOMP_RET_ALLOW, SECCOMP_RET_KILL_PROCESS, SECCOMP_RET_ERRNO, SECCOMP_RET_TRAP, and others. In practice, raw BPF is painful to write correctly. libseccomp provides a cleaner interface:
#include <seccomp.h>
scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL_PROCESS);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(brk), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mmap), 0);
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
seccomp_load(ctx);
seccomp_release(ctx);
The PR_SET_NO_NEW_PRIVS call is required before a non-privileged process can install a seccomp filter. Without it, the filter would persist through execve() of a setuid binary, creating a confused-deputy situation where the filter limits a more-privileged program.
The Path Filtering Gap
The structural limitation that seccomp cannot work around is that BPF programs cannot safely dereference user-space pointers. When the process calls openat(dirfd, path, flags), the path argument is a pointer into user-space memory. The BPF filter sees only the pointer value, not the string it points at. Even if the filter could read the string at inspection time, the user-space process could race to change it before the kernel processes the syscall.
This means seccomp cannot express a policy like “allow open only for /etc/resolv.conf”. It can say “allow openat with flags O_RDONLY”, which permits reading any file the process has permission to read. The policy operates on syscall shape, not resource identity.
Capsicum sidesteps this entirely because path lookup is prohibited; the kernel never reaches the question of which path is acceptable. The process can only reach resources it already holds handles to.
What Linux Does Instead
Linux has been gradually assembling a more complete picture of process confinement by combining mechanisms.
Landlock, merged in Linux 5.13, addresses the path-filtering gap directly. Unlike seccomp, Landlock operates at the VFS layer and reasons about inodes, not pointer-strings. A process creates a ruleset, attaches rules anchored to opened directory file descriptors, and then applies it:
struct landlock_ruleset_attr ra = {
.handled_access_fs = LANDLOCK_ACCESS_FS_READ_FILE |
LANDLOCK_ACCESS_FS_WRITE_FILE,
};
int ruleset_fd = syscall(SYS_landlock_create_ruleset, &ra, sizeof(ra), 0);
struct landlock_path_beneath_attr pa = {
.allowed_access = LANDLOCK_ACCESS_FS_READ_FILE,
.parent_fd = open("/etc", O_PATH | O_CLOEXEC),
};
syscall(SYS_landlock_add_rule, ruleset_fd, LANDLOCK_RULE_PATH_BENEATH, &pa, 0);
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
syscall(SYS_landlock_restrict_self, ruleset_fd, 0);
The design is closer to Capsicum in spirit: it operates on kernel objects rather than string arguments, and rules are anchored to real inodes rather than path patterns. The scope is narrower, though. Landlock covers filesystem access and, as of Linux 6.7, TCP port binding and connection. It does not narrow rights on already-open FDs and provides no capability delegation mechanism.
OpenBSD’s pledge and unveil are the most practical synthesis available in any production OS today. pledge restricts which categories of syscalls are permitted using named promises like "stdio rpath inet dns", while unveil restricts which filesystem paths are visible to the process. Together they provide both syscall-level and path-level confinement with a notably small API surface:
unveil("/etc/ssl/cert.pem", "r");
unveil("/var/cache/myapp", "rwc");
unveil(NULL, NULL); // finalize: no more unveil calls allowed
pledge("stdio rpath wpath cpath inet", NULL);
The tradeoff is coarser granularity compared to Capsicum’s per-FD rights, but the OpenBSD base system has pledged and unveiled nearly every program, which demonstrates the practical adoption ceiling you can reach with a simpler interface.
Composability and the Delegation Problem
One difference that doesn’t get enough attention is how each mechanism handles delegation to child processes.
With Capsicum, a parent can pass a narrowed FD to a child via SCM_RIGHTS. The child receives exactly the rights the parent specified, and the parent can transmit different subsets to different children for genuinely least-privilege child processes. The capability flows from parent to child with explicit narrowing at each step.
With seccomp, filters stack with AND semantics: a child can only be more restricted than its parent, never less. There is no mechanism for a parent to say “this child subprocess is allowed to do something I’ve restricted myself from doing”. For that pattern, the newer SECCOMP_RET_USER_NOTIF return code (Linux 5.0) provides a supervisor notification mechanism where a privileged parent process receives a file descriptor and can inspect, intercept, and respond to sandboxed syscalls on the child’s behalf. LXD uses this to handle mount() calls from inside containers. It works, but the architecture is more complex than Capsicum’s straightforward FD delegation.
The Hardware Endpoint
The theoretical endpoint of the Capsicum lineage is CHERI, also led by Robert Watson’s group at Cambridge, which moves the capability model into hardware. CHERI introduces fat 128-bit pointer registers that embed bounds and permission bits, enforced by the CPU at every memory access. The monotonicity principle that governs cap_rights_limit() at the FD level applies to every pointer in the address space: capabilities can only be derived from existing capabilities with equal or lesser permissions, and they cannot be synthesized from integers.
CheriBSD runs on the ARM Morello development board and extends Capsicum’s software-level confinement with hardware-backed spatial and temporal memory safety. The same reasoning that makes Capsicum’s sandbox provably complete at the process API level applies, on CHERI hardware, to individual memory regions within the process.
Where This Leaves You
Chrome, Firefox, Docker, systemd, and Android all use seccomp-bpf on Linux, and it works effectively at scale. The overhead is low (typically below measurable threshold except for syscall-heavy workloads), the tooling is mature, and deployment doesn’t require application refactoring. The path-argument limitation is a real constraint, but for syscall-number-based confinement, seccomp is practical and compositional.
Capsicum’s model is more principled. Once a process calls cap_enter(), there is no ambient authority to enumerate, no policy to audit for missing syscall entries, no new kernel feature that silently expands the attack surface. The cost is application refactoring: pre-opening all necessary resources, replacing fork/waitpid with pdfork/pdwait4, routing restricted operations through libcasper services. For existing codebases, that’s a significant undertaking.
The difference isn’t just API ergonomics. It’s the kind of claim you can make when you’re done. seccomp lets you say “this process cannot call these syscalls”. Capsicum lets you say “this process has no ambient authority and can only operate on the specific resources I gave it”. Both claims are useful. They’re just answers to different questions.