· 7 min read ·

Capability Models vs. Syscall Filters: What Capsicum and seccomp Reveal About Process Sandboxing

Source: lobsters

Process sandboxing has two dominant approaches in the Unix world, and they disagree at a fundamental level about what the problem actually is. Capsicum, developed at the University of Cambridge and shipped in FreeBSD 9.0 in 2011, is a capability-based framework: it strips a process of ambient authority and lets it operate only through file descriptors with explicitly limited rights. seccomp, originally added to the Linux kernel in 2.6.23 and extended with BPF-based filtering in 3.5 (2012), is a syscall filter: it intercepts system calls and decides, per call, whether to allow or deny them. Vivian Voss’s comparison covers the mechanics well. What’s worth digging into is why the two approaches diverge so sharply, and what that divergence costs in practice.

What Ambient Authority Actually Means

The classic Unix security model gives every process a user identity and lets it access any resource that identity can reach. This is ambient authority: the process doesn’t need to explicitly hold a ticket to /etc/passwd to open it; it just opens it, and the kernel checks whether the current uid is allowed. Capsicum’s insight is that ambient authority is the real problem. A sandboxed process running as the same user still has the ability to open arbitrary files, look up paths, and call execve unless something intervenes.

Capabilities solve this by making authority explicit and object-bound. Instead of a process identity that implicitly grants access to a namespace, you have file descriptors that carry their own permission sets. Before a process enters capability mode, it opens the files and sockets it needs, restricts those descriptors to the minimum required rights, and then calls cap_enter(). After that call, the process can no longer open new paths, and any descriptor it tries to use beyond its stated rights returns ENOTCAPABLE.

#include <sys/capsicum.h>

/* Restrict a socket to read and write only */
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_WRITE);
if (cap_rights_limit(sock_fd, &rights) < 0)
    err(1, "cap_rights_limit");

/* Irrevocably enter capability mode */
if (cap_enter() < 0)
    err(1, "cap_enter");

/* From here: open() fails, path lookups fail,
   but read(sock_fd) and write(sock_fd) work */

The critical property is that cap_enter() is irrevocable. The process cannot leave capability mode. It can’t call cap_enter() again to restore permissions it gave up. The rights on a descriptor can be further restricted but never expanded. This is a one-way ratchet, which is exactly what a sandbox needs to be.

How seccomp Works Instead

seccomp takes the opposite approach: it doesn’t touch ambient authority at all. The process still runs as the same user with the same file namespace. What seccomp does is intercept system calls at the kernel boundary and evaluate them against a BPF program installed by the process itself.

#include <seccomp.h>

scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ALLOW);

/* Deny specific syscalls */
seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EPERM), SCMP_SYS(open), 0);
seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EPERM), SCMP_SYS(openat), 0);
seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EPERM), SCMP_SYS(execve), 0);

seccomp_load(ctx);
seccomp_release(ctx);

The filter operates on syscall numbers and up to six arguments. It cannot dereference pointers. This is a deliberate kernel design choice to prevent time-of-check/time-of-use (TOCTOU) races: if the filter checked a filename by reading from userspace, the process could change the filename between the check and the actual syscall. So the filter sees integer arguments only. For openat, it can check the flags integer; it cannot read the path string.

This limitation has real consequences. You cannot write a seccomp filter that says “allow openat only for paths under /tmp”. You can say “deny openat entirely” or “allow openat only with O_RDONLY”. Anything finer than syscall-level control on non-pointer arguments requires a different mechanism.

The Return Actions and Their Uses

seccomp filters return one of several actions, and the diversity of actions is part of what makes seccomp useful beyond simple blocking:

  • SECCOMP_RET_ALLOW: let the call proceed
  • SECCOMP_RET_ERRNO(n): return an error without entering the kernel
  • SECCOMP_RET_TRAP: send SIGSYS to the process
  • SECCOMP_RET_TRACE: notify an attached ptrace tracer
  • SECCOMP_RET_KILL_PROCESS: terminate the process immediately
  • SECCOMP_RET_USER_NOTIF: notify a userspace supervisor (added in Linux 5.0)

The SECCOMP_RET_USER_NOTIF action is particularly interesting because it enables what’s called a supervisor model: a separate process can intercept syscalls, inspect them, and decide whether to allow them. This is how gVisor and similar userspace kernels work. The sandboxed process makes a syscall, the supervisor receives a notification with the full argument set (in the supervisor’s address space, not the sandboxed one), and responds with a result.

Application Structure and Adoption Friction

The architectural difference between the two models shows up most clearly in how applications have to be structured to use them.

Capability mode requires what’s called a “capability-clean” initialization phase. The application opens all the file descriptors it will ever need, arranges for any forked helper processes to receive the right descriptors, and then calls cap_enter(). This is fundamentally a refactoring exercise. You can’t retrofit Capsicum onto a program that opens files lazily or discovers its needed resources at runtime based on configuration. OpenSSH’s privilege separation predates Capsicum but maps naturally onto it: the monitor process holds privileged resources, and the slave runs in a Capsicum sandbox on FreeBSD and in a seccomp sandbox on Linux.

seccomp, by contrast, can often be applied with relatively little application restructuring. The common pattern is to determine the set of syscalls your process needs, generate a filter, and install it at an appropriate point. Tools like libseccomp abstract the raw BPF bytecode. Docker’s default seccomp profile blocks about 44 syscalls without any application awareness at all; the container just runs and the profile is applied by the container runtime.

This explains a significant portion of seccomp’s wider adoption. Chromium uses seccomp-bpf for its renderer and GPU process sandboxes on Linux. Firefox uses it for content processes. systemd uses it for service unit hardening via SystemCallFilter=. Most of these were added incrementally, without restructuring the application around a capability model.

Landlock: Linux Moving Closer to Capabilities

Landlock, merged into Linux 5.13 in June 2021, is worth mentioning here because it occupies an interesting middle position. It’s a filesystem access control mechanism that works through a ruleset of file descriptors. A process creates a landlock_ruleset_attr specifying which filesystem actions it wants to restrict, adds path-based rules using open file descriptors as anchors, and then applies the ruleset with landlock_restrict_self().

struct landlock_ruleset_attr ruleset_attr = {
    .handled_access_fs =
        LANDLOCK_ACCESS_FS_READ_FILE |
        LANDLOCK_ACCESS_FS_WRITE_FILE,
};
int ruleset_fd = landlock_create_ruleset(&ruleset_attr,
    sizeof(ruleset_attr), 0);

struct landlock_path_beneath_attr path_beneath = {
    .allowed_access = LANDLOCK_ACCESS_FS_READ_FILE,
    .parent_fd = open("/tmp", O_PATH | O_CLOEXEC),
};
landlock_add_rule(ruleset_fd, LANDLOCK_RULE_PATH_BENEATH,
    &path_beneath, 0);

landlock_restrict_self(ruleset_fd, 0);

This is closer to Capsicum’s model for filesystem access. The rules are anchored to actual file descriptors, not path strings, which avoids the TOCTOU problem. The restriction is self-applied and cannot be relaxed. Where seccomp works at the syscall boundary, Landlock works at the VFS layer and can express path-based constraints that seccomp cannot. The two are composable: you can apply both a seccomp filter and a Landlock ruleset to the same process.

What Each Model Gets Right

The Capsicum model has a cleaner theoretical foundation. It makes the security boundary explicit in the structure of the program. The set of resources a sandboxed process can access is exactly the set of file descriptors it held before entering capability mode, restricted to the rights granted on each. There’s no ambiguity about what the process can reach.

seccomp’s strength is in the breadth of what it can restrict without application awareness. It can deny syscalls that have no file descriptor analogue at all, like ptrace, kexec_load, or perf_event_open. These are dangerous primarily because of their capability-escalation potential rather than their access to specific resources, and a capability model doesn’t address them directly. A Capsicum-sandboxed process on FreeBSD still has access to those syscalls unless there’s a separate mechanism blocking them.

The practical outcome is that production sandboxes often combine approaches. Chrome on Linux uses seccomp-bpf for syscall filtering alongside a separate broker process that mediates file system access, approximating Capsicum’s capability model with more moving parts. The broker handles open file requests from the renderer, checks permissions, and passes back file descriptors, which is manually implementing the pre-delegation pattern that Capsicum builds directly into the kernel.

The Cost of Retrofitting

The adoption pattern for Capsicum is informative. It has been in FreeBSD for over a decade, has a clean API, and has well-documented examples in the base system. The projects that have adopted it, including OpenSSH, ISC BIND, and parts of the FreeBSD base system, are ones where developers could justify the architectural refactor. The projects that haven’t tend to be ones where lazy resource acquisition is too deeply embedded in the design.

This isn’t an argument against Capsicum’s model. It’s an argument that security mechanisms that require upfront architectural discipline will see slower adoption than ones that can be bolted on. seccomp’s deployment story on Linux is partly a story about the baseline being low enough that container runtimes and system managers could apply it without touching application code.

For new code, particularly daemons and network-facing services where the resource access pattern is known at startup, the capability model is worth the discipline it requires. The explicit pre-delegation forces you to think about what resources the process actually needs, which is useful independent of the security benefit. For existing code at scale, seccomp plus Landlock gets you meaningful confinement without a rewrite. The two ecosystems have landed on different defaults, and both defaults are defensible given their constraints.

Was this interesting?