Two Axes of Process Sandboxing: What Capsicum and seccomp Each Get Right
Source: lobsters
Most Linux developers encounter seccomp-bpf before they encounter Capsicum, if they encounter Capsicum at all. Vivian Voss’s comparison of the two is a useful side-by-side, but the deeper question is architectural: the two mechanisms do not just differ in API surface; they address different dimensions of the privilege problem entirely.
Operations vs. Objects
seccomp-bpf, added to Linux in kernel 3.5 through work by Will Drewry at Google, installs a BPF filter that runs on every syscall entry. The filter receives the syscall number and up to six integer arguments; it returns a verdict: allow, kill, return a specific errno, deliver SIGSYS, or forward to a userspace supervisor via seccomp-notify (added in Linux 5.0). Using libseccomp, a basic filter looks like:
scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(openat), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
seccomp_load(ctx);
The process can now call openat(). The filter is satisfied. But openat(AT_FDCWD, "/etc/shadow", O_RDONLY) still succeeds if the process’s UID permits it, because seccomp can only inspect the syscall number and its scalar arguments. The path string lives in user memory, which BPF cannot safely dereference without TOCTOU exposure. The process retains full ambient authority over every resource it could access before the filter was installed, for every syscall the filter allows through.
Capsicum, described in the Watson et al. 2010 USENIX Security paper and shipped in FreeBSD 9.0, solves the problem at a different layer. Rather than filtering which syscalls can be invoked, it eliminates the global namespace. After cap_enter(), path-based syscalls like open(), socket(), and connect() fail entirely. The process can only operate on file descriptors it already holds, and each fd carries an explicit bitmask of capability rights that can be attenuated but never amplified:
int fd = open("/var/myapp/data.db", O_RDWR);
// Strip this fd down to read and stat, no write, no seek
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_FSTAT);
cap_rights_limit(fd, &rights);
cap_enter(); // One-way; cannot be undone
// Works: fd has CAP_READ
read(fd, buf, sizeof(buf));
// Fails ENOTCAPABLE: fd has no CAP_WRITE
write(fd, buf, sizeof(buf));
// Fails ECAPMODE: global namespace is gone
open("/etc/passwd", O_RDONLY);
Authority is no longer ambient. It lives in the file descriptor. A library handed this fd can read from it; it cannot write, seek, or open anything else, regardless of the process’s UID or what path string it passes.
The Confused Deputy Problem
Norm Hardy described the confused deputy problem in 1988. A program that acts as a server holds its own privileges, granted by its installation, independently of any given client invocation. When a client supplies crafted input, the program can be made to exercise its own privileges on the client’s behalf, touching resources the client cannot access directly. A compiler with write access to both user output files and the system billing database can be made to overwrite the billing database if an attacker controls the output filename.
In POSIX, this surfaces anywhere a program takes a filename argument and calls open() on it. The call uses the process’s ambient filesystem access, not any authority granted per-invocation by the caller. A sandboxed component that accepts a path and opens it still carries the full ambient authority of the process for that operation.
Capsicum addresses this structurally. The caller passes a directory fd scoped to a permitted subtree:
int dirfd = open("/var/myapp/uploads", O_DIRECTORY | O_RDONLY);
cap_rights_t dir_rights;
cap_rights_init(&dir_rights, CAP_LOOKUP, CAP_READ);
cap_rights_limit(dirfd, &dir_rights);
cap_enter();
// Path traversal with "../" is blocked at the CAP_LOOKUP check
// user_supplied_path cannot escape /var/myapp/uploads
int f = openat(dirfd, user_supplied_path, O_RDONLY);
The kernel enforces CAP_LOOKUP boundaries during path resolution. An openat() call with ../../etc/shadow fails because the capability rights on dirfd do not permit traversal outside the subtree. The component’s behavior is determined by the rights on the fds it was given, not by the process’s position in the global filesystem hierarchy.
seccomp has no mechanism for this. You can block openat() entirely or allow it; if you allow it, every permitted call carries the process’s full ambient authority to open any path the UID can reach.
What seccomp Does Well
seccomp’s strength lies in a different dimension: reducing the kernel’s attack surface. Syscalls like ptrace(), perf_event_open(), keyctl(), userfaultfd(), and kexec_load() have been entry points for numerous privilege escalation exploits. Blocking them at the seccomp layer removes entire classes of kernel vulnerability from consideration, independent of which files the process can access. A process that is fully legitimate in its resource usage might still exploit a kernel bug through an obscure syscall; seccomp can block that syscall unconditionally.
seccomp also allows argument-level filtering on scalar values. A filter can allow socket(AF_INET, ...) while blocking socket(AF_PACKET, ...) and socket(AF_NETLINK, ...), eliminating raw socket access without affecting TCP or UDP. Docker’s default seccomp profile blocks roughly 40 syscalls considered dangerous in container contexts; Chrome’s renderer process on Linux installs a tailored seccomp-bpf filter combined with user namespaces.
Capsicum does not filter syscalls. Preventing mmap(PROT_EXEC) or ptrace() in a Capsicum sandbox requires additional mechanisms. The Watson paper acknowledges this directly; the two approaches are designed to complement each other.
The Linux Response
Linux has been incrementally building toward object-level restrictions. Landlock, merged in Linux 5.13 (2021), lets a process restrict its own filesystem access using path-based rules without requiring elevated privileges:
struct landlock_ruleset_attr attr = {
.handled_access_fs = LANDLOCK_ACCESS_FS_READ_FILE |
LANDLOCK_ACCESS_FS_WRITE_FILE,
};
int ruleset_fd = landlock_create_ruleset(&attr, sizeof(attr), 0);
struct landlock_path_beneath_attr path_attr = {
.allowed_access = LANDLOCK_ACCESS_FS_READ_FILE,
.parent_fd = open("/var/myapp", O_PATH | O_CLOEXEC),
};
landlock_add_rule(ruleset_fd, LANDLOCK_RULE_PATH_BENEATH, &path_attr, 0);
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
landlock_restrict_self(ruleset_fd, 0);
After landlock_restrict_self(), the process cannot access anything outside /var/myapp regardless of which syscall it uses. Landlock 5.19 added network port restrictions; 6.7 added UNIX socket control. It is path-prefix based rather than per-fd rights attenuation, so it does not close the confused deputy problem the way Capsicum does. But it moves enforcement to the object layer rather than the operation layer, which is the meaningful architectural step.
OpenBSD took a third path with pledge(2) (2015) and unveil(2) (2018). pledge groups syscalls into coarse promise categories, stdio, rpath, wpath, inet, dns, and others; a process declares which groups it needs and the kernel enforces violations with SIGABRT. unveil restricts the visible filesystem namespace to declared paths. Together they approximate much of Capsicum’s security surface with considerably lower porting effort, trading per-fd granularity for a simpler API.
Practical Trade-offs
Capsicum’s porting cost is real. POSIX interfaces that perform path resolution internally, including getaddrinfo(), openlog(), and dlopen(), do not work in capability mode without wrapping through Casper, FreeBSD’s capability-safe RPC daemon that proxies privileged operations from outside the sandbox. Applications like tcpdump, OpenSSH, and dhclient on FreeBSD have been retrofitted, but the process typically requires several hundred lines of changes per application, and cross-platform software must maintain separate sandboxing paths for Linux and FreeBSD.
seccomp is considerably easier to add to an existing application. Tracing the syscall set with strace, constructing a policy with libseccomp, and inserting the prctl() call at the right point is tractable for most programs. The ecosystem is deep: Docker profiles, systemd’s SystemCallFilter= in unit files, Flatpak, and bubblewrap all rely on it. The result is not as strong as Capsicum for confused-deputy scenarios, but it is available on Linux and widely understood.
Production sandboxing architectures layer both models. Chrome on Linux combines seccomp-bpf with user and mount namespaces in each renderer process. gVisor combines a strict seccomp-bpf filter on its Sentry process with a full Linux kernel ABI implemented in Go, so guest processes never reach the host kernel directly. The syscall filter handles operation-space risk; the namespace and filesystem restrictions handle resource-space risk. Neither layer can substitute for the other.
The Design Divide
Capsicum and seccomp reflect a genuine fork in how to express security policy. seccomp asks which kernel operations a process should be permitted to invoke. Capsicum asks which objects a process should be permitted to act upon, and with what rights. Each question is tractable; neither subsumes the other.
The confused deputy problem has existed since 1988, and POSIX’s ambient authority model means that a syscall-only filter leaves it unaddressed. Capsicum’s contribution was demonstrating that retrofitting capability-based authority onto a POSIX system is feasible, not just theoretically appealing. Landlock and seccomp-notify on Linux are converging on similar properties from a different direction, fifteen years later. A complete sandbox addresses both dimensions.