· 7 min read ·

The Pointer seccomp Can't Read: Capsicum, Syscall Filters, and the Capability Gap Linux Is Still Closing

Source: lobsters

When you sandbox a process, you are making a decision about which layer of the kernel to enforce policy at. That choice shapes everything: what you can express, what the policy can and cannot inspect, and where the model eventually breaks down under pressure from new kernel features. Capsicum and seccomp-BPF represent two different answers to that question, and the differences are not cosmetic.

seccomp: Filtering at Syscall Dispatch

seccomp-BPF, added in Linux 3.5 (2012) by Will Drewry, sits at the point where a process enters the kernel for a system call. Before the syscall dispatcher runs, the kernel evaluates a classic BPF program against a seccomp_data struct:

struct seccomp_data {
    int   nr;                   // syscall number
    __u32 arch;                 // AUDIT_ARCH_X86_64, etc.
    __u64 instruction_pointer;
    __u64 args[6];              // raw register values
};

The filter returns a verdict: allow, kill, trap, return an error, or notify a supervisor. For common use, you either write a BPF program by hand using sock_filter arrays, or use libseccomp to generate one:

scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL_PROCESS);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(close), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
seccomp_load(ctx);

This works, and it is used everywhere: Chrome, Firefox, Android (since Oreo), Docker’s default container profile, systemd’s SystemCallFilter= in unit files. The adoption is broad because Linux is broad.

But the model has a ceiling that is built into its design.

The Pointer the Filter Cannot Read

The args[6] array in seccomp_data contains raw register values at syscall entry. For scalar arguments, like flags or file descriptors, this is sufficient. For pointer arguments, it is not.

When a process calls open("/etc/passwd", O_RDONLY), the first argument is a pointer to the string "/etc/passwd" in userspace memory. The seccomp BPF program sees that argument as an integer, the virtual address of the string. It cannot dereference that pointer. The BPF verifier will not allow it, and even if it did, reading from userspace memory inside a BPF program running in the kernel would be a security problem in itself.

This means seccomp cannot restrict filesystem access based on path. It can block the open syscall entirely, or allow it entirely. It can check that a particular flag is absent from the flags argument, because flags are scalar. But it cannot say “allow open only for paths under /var/run/myapp.”

The same limitation applies to connect(). The sockaddr structure is behind a pointer. seccomp cannot inspect the IP address or port a process is trying to connect to. It can block connect entirely, or allow it entirely.

This is not a fixable implementation detail. It is a consequence of where in the kernel the enforcement happens, at the syscall dispatch boundary, before the kernel has set up any of the objects the syscall would create.

Capsicum: Object Capabilities

Capsicum, designed by Robert Watson, Jonathan Anderson, Ben Laurie, and Kris Kennaway and published at USENIX Security 2010, takes a different position. Instead of filtering which syscalls a process can call, Capsicum restricts what kernel objects the process can access and what operations it can perform on each.

The model builds on a property POSIX file descriptors already have: they are unforgeable references. A process cannot manufacture an fd pointing to /etc/passwd out of thin air. It can only get one from the kernel, which means the kernel controls whether that reference exists at all. Capsicum extends this: a process can enter capability mode, and from that point the kernel strips all ambient authority:

// Pre-open everything the process will need
int data_fd = open("/var/db/myapp/data", O_RDONLY);
int log_fd  = open("/var/log/myapp.log", O_WRONLY | O_APPEND);

// Restrict rights on each fd
cap_rights_t r;
cap_rights_init(&r, CAP_READ, CAP_SEEK, CAP_FSTAT);
cap_rights_limit(data_fd, &r);

cap_rights_init(&r, CAP_WRITE);
cap_rights_limit(log_fd, &r);

// No going back
cap_enter();

// open("/etc/passwd") now returns ECAPMODE
// data_fd with CAP_WRITE attempt returns ENOTCAPABLE

After cap_enter(), path-based open() fails because opening by path requires access to the global filesystem namespace. The process can still call open on paths relative to a directory fd it already holds, using openat(), if it has the appropriate rights on that directory fd. All access flows through previously established, explicitly rights-limited references.

The rights model is also richer than a simple allow/deny. cap_rights_t encodes distinctions like CAP_MMAP_R vs CAP_MMAP_W vs CAP_MMAP_X, or the specific set of ioctls permitted on a socket. Rights are monotonically decreasing: a process can call cap_rights_limit() to further restrict an fd, but never to expand it.

For process management, Capsicum provides pdfork(), which returns a process descriptor fd instead of a raw PID. PIDs are global names; holding a process descriptor and using pdkill() does not require access to the PID namespace. This removes the last common ambient authority foothole for process control.

io_uring: seccomp’s Latest Structural Problem

The mismatch between seccomp’s enforcement point and where actual I/O policy needs to live has gotten more visible with io_uring, added in Linux 5.1. io_uring lets a process submit I/O operations through a shared ring buffer without making individual syscalls for each operation. The process calls io_uring_enter() once, and the kernel processes a batch of reads, writes, and socket operations from the ring.

From seccomp’s perspective, the process called io_uring_enter(). That is the one syscall it sees. The actual I/O operations that follow, potentially opening files and sending data, happen inside the kernel, outside the syscall dispatch boundary where seccomp’s filter runs.

Docker’s default seccomp profile disabled io_uring for this reason. Container security profiles that allow io_uring have to accept that the filter cannot meaningfully inspect what operations will be submitted through the ring. Capsicum has no equivalent concern, because its enforcement is at the object layer: if a process does not hold an fd with CAP_READ rights, no ring buffer submission on that fd will succeed.

Why seccomp Won the Platform War Anyway

Capsicum ships in FreeBSD and is used throughout the FreeBSD base system. tcpdump, dhclient, unbound, and OpenSSH’s privilege-separated child all call cap_enter(). Chromium uses Capsicum for its renderer sandbox on FreeBSD.

There was a Linux port, capsicum-linux, maintained at Google by David Drysdale. It was never merged into mainline Linux, and development stalled around 2015. The reasons were a combination of review bandwidth, the complexity of retrofitting the rights model onto the existing Linux VFS and socket layers, and the fact that seccomp already existed and was good enough for most stated requirements.

Good enough is doing a lot of work in that sentence. seccomp’s limitations are real, but they are tolerable for the most common sandboxing use cases: blocking dangerous syscalls entirely, restricting a renderer to read/write/mmap, preventing a service from calling ptrace or kexec_load. Chrome’s seccomp policy is around 1,600 lines of C++ that generate BPF programs, which gives a sense of how much complexity gets absorbed trying to express something more fine-grained. But it ships and it works.

Applications moving to Capsicum also require non-trivial refactoring. The “pre-open everything, then enter capability mode” pattern is easy to describe and hard to retrofit into code that interleaves I/O setup with I/O operation. FreeBSD’s libcasper provides a Casper service framework that lets capability-mode processes delegate specific privileged operations (DNS lookups, syslog writes, password database reads) to a privileged helper over a capability-safe socket pair, which substantially reduces the porting burden. But it is still more work than adding SystemCallFilter=@system-service to a systemd unit file.

Landlock: Linux Moving Toward the Object Layer

Landlock, developed by Mickaël Salaün and merged in Linux 5.13 (2021), represents the clearest acknowledgment that seccomp’s enforcement layer is insufficient for filesystem policy. Landlock operates at the VFS layer, not at syscall dispatch. It enforces filesystem access rules on the objects being accessed, not on the syscall that happens to access them:

struct landlock_ruleset_attr attr = {
    .handled_access_fs =
        LANDLOCK_ACCESS_FS_READ_FILE  |
        LANDLOCK_ACCESS_FS_WRITE_FILE |
        LANDLOCK_ACCESS_FS_EXECUTE,
};
int ruleset_fd = syscall(SYS_landlock_create_ruleset,
                          &attr, sizeof(attr), 0);

struct landlock_path_beneath_attr path_attr = {
    .allowed_access = LANDLOCK_ACCESS_FS_READ_FILE |
                      LANDLOCK_ACCESS_FS_READ_DIR,
    .parent_fd = open("/etc", O_PATH | O_CLOEXEC),
};
syscall(SYS_landlock_add_rule, ruleset_fd,
        LANDLOCK_RULE_PATH_BENEATH, &path_attr, 0);

prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
syscall(SYS_landlock_restrict_self, ruleset_fd, 0);

This solves the pointer problem for filesystem access because Landlock is not inspecting args[0] of an open() call. It is attached to the VFS layer and evaluates access when the kernel’s path resolution produces an inode, regardless of which syscall triggered that resolution. io_uring submitting a read through a file descriptor runs through the same VFS code and gets the same check.

Landlock is also evolving quickly. Network restrictions appeared in v4 (Linux 6.7), and IPC signal scope restrictions arrived in v5 (Linux 6.10). The scope is narrower than Capsicum’s full-object capability model, and Landlock does not eliminate ambient authority the way cap_enter() does. But the direction is clear: Linux is incrementally building the enforcement mechanisms that Capsicum demonstrated were necessary fifteen years ago.

The source article by Vivian Voss covers the practical comparison well, particularly for developers who want to understand the mechanics before choosing an approach. What the comparison ultimately shows is that seccomp is a policy tool shaped by where it was easy to insert enforcement, and Capsicum is a policy tool shaped by where enforcement should actually live. For most Linux deployments today, the right answer involves using both: seccomp for syscall-level restriction of dangerous operations, combined with Landlock for filesystem path policy. The combination approximates what Capsicum provides by default, at the cost of managing two separate policy systems with different mental models.

FreeBSD got this right in one abstraction. Linux is catching up, one kernel version at a time.

Was this interesting?