Structural Confinement vs. Syscall Filtering: What Capsicum and seccomp Reveal About OS Sandboxing
Source: lobsters
Process sandboxing is one of those topics where the implementation choices encode deep beliefs about where trust should live and how security should be enforced. Capsicum and seccomp-bpf both shipped around 2012, both confine what a Unix process can do, and both are in active production use. But their designs are so different that comparing them exposes the core tension in OS security: structural confinement versus policy filtering.
Vivian Voss’s comparison of the two is a useful starting point. This post tries to go further into why the architectural differences matter, where each approach breaks down, and what the Linux ecosystem has been quietly doing to close the gap.
The Capsicum Model: Authority From File Descriptors
Capsicum, designed by Robert Watson and Jonathan Anderson at Cambridge and introduced in FreeBSD 9.0, is built on object-capability security. The central premise is that every resource a process can access should be represented as a file descriptor carrying an explicit rights mask. Global ambient authority, meaning the ability to open arbitrary paths, create sockets, signal arbitrary processes, or call sysctl, is revoked when the process calls cap_enter(2).
#include <sys/capsicum.h>
/* Open everything needed BEFORE sealing the process */
int fd = open(argv[1], O_RDONLY);
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_FSTAT);
cap_rights_limit(fd, &rights);
/* One-way door. No absolute paths, no new sockets, no kill(). */
cap_enter();
After cap_enter(), calling open("/etc/passwd", ...) fails with ECAPMODE. The only way to access a file is through a file descriptor that was opened before entering capability mode, or through openat(2) relative to a directory FD that carries CAP_LOOKUP. The rights on that directory FD constrain what can be done within it: CAP_CREATE permits O_CREAT, CAP_UNLINKAT permits deletion, and so on.
The rights system is monotonically decreasing. A sandboxed process can only lose rights, never gain them. If you pass an FD to a child or a subprocess, it inherits at most the rights the parent had. This property makes capability-based confinement composable in a way that policy-based systems struggle to match.
For operations that capability mode blocks entirely, such as DNS resolution, syslog, and group database lookups, FreeBSD ships casper(8): a privileged service daemon that provides capability-safe versions of common library functions over pre-opened socket FDs. Programs like tcpdump, dhclient, and ssh on FreeBSD have been Capsicumized this way.
The seccomp Model: A BPF Filter on Every Syscall
Linux’s seccomp, in its original 2005 form (SECCOMP_MODE_STRICT), was nearly useless for general sandboxing: it allowed exactly four syscalls and was designed for computational grids running untrusted binaries. Will Drewry’s seccomp-bpf extension, merged in Linux 3.5 in 2012 and motivated directly by Chrome’s sandboxing needs, made it genuinely useful.
Instead of revoking ambient authority, seccomp installs a BPF bytecode filter that runs before every syscall. The filter inspects a seccomp_data struct containing the syscall number, architecture, and raw argument values, then returns an action:
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/audit.h>
#include <sys/prctl.h>
/* Always validate arch first to prevent 32-bit compat bypass on x86_64 */
static struct sock_filter filter[] = {
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, arch)),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 1, 0),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_read, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_write, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_exit_group, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
};
Writing raw BPF is painful. libseccomp wraps all of this in a portable C API that handles architecture normalization and program optimization:
#include <seccomp.h>
scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL_PROCESS);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
/* Restrict mmap to non-writable, non-executable mappings */
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mmap), 1,
SCMP_A2(SCMP_CMP_MASKED_EQ, PROT_WRITE | PROT_EXEC, 0));
seccomp_load(ctx);
seccomp_release(ctx);
The ecosystem around seccomp is substantial: Chrome, Firefox, OpenSSH, Android, Docker, systemd, Flatpak, and qemu all use it. Docker’s default seccomp profile blocks around 44 syscalls. systemd lets service units declare SystemCallFilter= in plain text. Adoption has been broad.
Where the Designs Diverge
The fundamental difference is what each approach models as the unit of protection.
Capsicum’s unit is the capability, a file descriptor with an attached rights mask. Whether you call read(2), pread(2), or readv(2) is irrelevant; what matters is whether the FD you’re using carries CAP_READ. The protection is structural: it describes what the process holds, and the kernel enforces consequences from that description automatically.
seccomp’s unit is the syscall invocation. Whether you can read a file depends not on what FDs you hold but on whether read (or pread64, or readv, or any of the other syscalls that read data) appears in your allowlist. This is a policy layer, not a structural one, and maintaining a correct policy is substantially harder than it looks.
The pointer-dereferencing limitation is where seccomp most visibly strains. A BPF filter can check that argument 1 to openat() equals a specific integer, but it cannot inspect the string that argument points to. You cannot write a seccomp filter that allows openat(dirfd, "/etc/ssl/certs", ...) but blocks openat(dirfd, "/etc/shadow", ...) by examining the path. The BPF program runs in the kernel before the syscall with access only to the raw register values.
Capsicum sidesteps this entirely. The question of which files a process can access is answered by which directory FDs it holds with which rights, not by runtime inspection of path strings.
Where Capsicum Strains
Capsicum’s structural elegance comes with a real cost: retrofitting it to an existing application requires significant restructuring. Every resource must be acquired before cap_enter(). For programs with lazy initialization, complex plugin loading, or many code paths that opportunistically open files or create connections, this is an architectural refactoring, not a configuration change. tcpdump required nontrivial work. Long-running server daemons that handle configuration reloads mid-flight need careful rethinking.
The casper daemon reduces this burden for common cases, but it increases the trusted computing base. Any vulnerability in a casper service is a vulnerability in the sandbox. Applications that need DNS, syslog, and group lookups each get a separate channel to a separate privileged service, all of which must be audited.
seccomp has no equivalent restructuring burden. You call seccomp_load() at any point in your program’s lifecycle. For sandboxing existing applications without source changes (Docker containers, Flatpak wrappers, systemd service units), seccomp is the only viable option.
The io_uring Problem
io_uring, introduced in Linux 5.1, creates a shared memory ring buffer between userspace and the kernel through which I/O operations are submitted without individual syscalls. A seccomp filter cannot inspect io_uring operations because they do not manifest as syscalls in the traditional sense. The io_uring_enter syscall submits batches of operations, and the filter sees only that one syscall, not the individual operations within it.
This is a genuine gap. Docker’s default seccomp profile disables io_uring entirely. Linux 6.x has improved the interaction between io_uring and seccomp, but the fundamental tension remains: a syscall-filtering model does not naturally compose with a batch-submission model.
Capsicum has no equivalent problem. If a process holds an FD with CAP_READ and CAP_WRITE, it can use whatever kernel interface it has access to for reading and writing that FD, including async interfaces, because the rights check is at the capability layer, not the syscall layer.
Linux’s Slow Convergence on FD-Based Confinement
The Linux kernel has been quietly building toward Capsicum’s FD-centric model through Landlock, an LSM that reached mainline in Linux 5.13. Landlock restricts filesystem access using rulesets attached to FDs, enforced at the VFS layer, usable without privileges:
struct landlock_ruleset_attr attr = {
.handled_access_fs = LANDLOCK_ACCESS_FS_READ_FILE |
LANDLOCK_ACCESS_FS_WRITE_FILE,
};
int ruleset_fd = syscall(SYS_landlock_create_ruleset, &attr, sizeof(attr), 0);
struct landlock_path_beneath_attr path = {
.allowed_access = LANDLOCK_ACCESS_FS_READ_FILE,
.parent_fd = open("/etc/ssl", O_PATH | O_CLOEXEC),
};
syscall(SYS_landlock_add_rule, ruleset_fd, LANDLOCK_RULE_PATH_BENEATH, &path, 0);
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
syscall(SYS_landlock_restrict_self, ruleset_fd, 0);
Landlock ABI versions have shipped steadily: filesystem rules in v1 (5.13), cross-directory operations in v2 (5.19), truncation in v3 (6.2), TCP bind/connect in v4 (6.7), UDP and UNIX sockets in v5 (6.10), and IPC scoping in v6 (6.13). The IPC scoping in v6 directly parallels Capsicum’s restrictions on kill() in capability mode.
Landlock plus seccomp together approximate Capsicum’s protection model, but with considerably more surface area to configure and more distinct APIs to understand. The conceptual unification that Capsicum offers, where everything flows from the rights attached to FDs, does not exist on Linux. There is no Linux equivalent of cap_enter() as a unified semantic point of no return.
The out-of-tree Capsicum-Linux port has attempted to bridge this gap using namespaces and seccomp as a backend, but as of early 2025 it remains unmerged. The kernel community’s preference has been to build incrementally through Landlock rather than adopt the full Capsicum model.
Overhead
Capsicum’s per-syscall overhead is a bitmask AND in the FD lookup path, which is already hot. The Watson and Anderson 2010 paper measured under 1% overhead on syscall-heavy workloads. The check is O(1) and architecture-independent.
seccomp’s overhead scales with filter complexity. A simple five-instruction BPF filter adds a few nanoseconds per syscall. A complex allowlist with 50 or more entries and multiple stacked filters can add 20 to 40 percent overhead on syscall-heavy microbenchmarks, though real workloads that spend most time in userspace see far less. Linux 4.19 added BPF JIT for seccomp filters, and the actual performance gap on typical applications is small. Chrome measured roughly 1 to 2 percent overhead from its renderer sandbox.
Choosing Between Them
On FreeBSD, Capsicum is the right default for new privilege-separated code. The model is cleaner, the overhead is lower, and the protection is structural rather than policy-based. The restructuring cost is real but pays off in auditable, composable confinement.
On Linux, the answer is Landlock plus seccomp via libseccomp, with SECCOMP_RET_USER_NOTIF for cases that require dynamic policy decisions at runtime. The combination is less elegant but more widely supported, and the gap with Capsicum has been narrowing release by release.
For portable sandboxing code that runs on both platforms, the practical approach is to use Capsicum when available (detecting via cap_getmode(2)) and fall back to seccomp-bpf on Linux, which is what OpenSSH does. The policies will differ in expressivity, but both provide meaningful confinement.
The deeper lesson from putting these two systems side by side is that good sandboxing is hard not because of missing primitives but because of the gap between what applications assume about the environment and what a correctly sandboxed process is allowed to do. Both Capsicum and seccomp solve the enforcement problem well. The structuring problem, making existing applications operate correctly within confinement, is where the real work happens, and neither system makes that easy.