eBPF as a Deployment Safety Net: What GitHub's Approach Reveals About the Technology
Source: lobsters
Most deployment safety tooling operates at the application layer: health checks, canary traffic routing, error rate thresholds. These are useful, but they only see what the application chooses to expose. GitHub’s engineering team recently published a description of how they moved one layer deeper, using eBPF to observe kernel-level behavior during rollouts. The specifics of their implementation are worth understanding, because they point at a broader pattern that any team running Linux infrastructure can adopt.
What eBPF Actually Does Here
eBPF (extended Berkeley Packet Filter) is a Linux kernel subsystem that lets you run sandboxed programs inside the kernel without modifying kernel source or loading kernel modules. Programs are written in a restricted C subset, compiled to BPF bytecode, then verified by the kernel’s BPF verifier before being loaded. The verifier enforces termination (no unbounded loops in older kernels), memory safety, and type correctness. If a program passes verification, it gets JIT-compiled to native machine code and runs at near-native speed.
For deployment safety, the relevant program types are tracepoints and kprobes. Tracepoints are stable, explicitly defined hooks in the kernel source; kprobes are dynamic hooks that can attach to almost any kernel function. GitHub’s use case centers on syscall tracing: when a process running the new binary calls openat, execve, connect, socket, or any other syscall, an eBPF program fires, records the event, and sends it to userspace via a ring buffer.
The ring buffer (BPF_MAP_TYPE_RINGBUF, introduced in Linux 5.8) is the modern replacement for perf event arrays. It supports variable-length records, avoids per-CPU complexity, and provides ordering guarantees that matter when you are reconstructing a sequence of syscalls. Userspace polls the ring buffer using epoll or a busy-poll loop, depending on latency requirements.
Building a Syscall Profile
The core idea in GitHub’s approach is behavioral fingerprinting. Before a deployment, you establish what syscalls the existing binary makes under normal load: which file paths it opens, which network addresses it connects to, which subprocesses it spawns. During and after a deployment, you compare the new binary’s observed behavior against that baseline.
A minimal eBPF program to trace openat syscalls looks like this:
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
struct event {
u32 pid;
char filename[256];
};
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024);
} events SEC(".maps");
SEC("tracepoint/syscalls/sys_enter_openat")
int trace_openat(struct trace_event_raw_sys_enter *ctx)
{
struct event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
if (!e) return 0;
e->pid = bpf_get_current_pid_tgid() >> 32;
bpf_probe_read_user_str(e->filename, sizeof(e->filename),
(const char *)ctx->args[1]);
bpf_ringbuf_submit(e, 0);
return 0;
}
char LICENSE[] SEC("license") = "GPL";
This attaches to the stable sys_enter_openat tracepoint. The vmlinux.h header is generated from the running kernel’s BTF (BPF Type Format) data, which is what enables CO-RE (Compile Once, Run Everywhere): the program references kernel structs symbolically, and the BPF loader patches field offsets at load time to match the running kernel. This means the same binary works across kernel versions without recompilation, which matters at GitHub’s scale where fleet heterogeneity is unavoidable.
For the Go side of the stack, GitHub almost certainly uses cilium/ebpf, a pure Go library that handles ELF loading, map creation, program attachment, and ring buffer consumption without cgo dependencies. The library generates Go bindings from compiled BPF objects using bpf2go, embedding the BPF bytecode directly into the Go binary.
Why This Beats Seccomp for This Use Case
Seccomp is the obvious comparison. It also operates at the syscall boundary and is well-understood. But seccomp is a static enforcement mechanism: you define an allowlist (or denylist) at process start, and the kernel enforces it for the lifetime of the process. It answers the question “is this syscall permitted?” not “is this pattern of syscalls expected?”
For deployment safety, the interesting question is the second one. A new version of a service might legitimately call openat on a configuration file, but if it suddenly starts reading /etc/shadow or spawning unexpected child processes, that is a signal worth catching before 100% of traffic moves over. Seccomp would only catch it if your policy explicitly denies those calls, which requires you to know in advance what “unexpected” looks like. eBPF-based tracing lets you discover what unexpected looks like by observing it.
There is also the matter of blocking versus observing. Seccomp in strict mode kills the process on a policy violation. During a canary deployment, killing the process on the first anomaly is often not what you want; you want to collect data, assess whether the anomaly is benign, and make an informed decision about whether to continue the rollout. eBPF gives you that observation-first model.
The Falco project takes a similar approach for security monitoring, maintaining a ruleset of suspicious syscall patterns and alerting on them in real time. GitHub’s use is more narrowly scoped to deployment verification rather than continuous runtime security, but the underlying mechanics are the same.
The Verification Pipeline
Tracing syscalls is only part of the system. The other part is turning a stream of events into a deployment decision. At a high level, the pipeline looks like:
- During a baseline period (old binary under production load), an eBPF collector builds a profile: sets of observed syscalls, file paths, network destinations, and subprocess names, aggregated per service.
- When a deployment begins (typically a canary phase with a small percentage of traffic), a new collector runs against the new binary.
- A comparison layer diffs the two profiles, weighted by frequency. Rare syscalls in the baseline that are missing in the canary, or new syscalls in the canary that were absent in the baseline, are surfaced as anomalies.
- A rollout controller consumes those anomalies and either auto-pauses the deployment, pages an on-call engineer, or proceeds based on configured thresholds.
The hardest part of this pipeline is not the eBPF program itself but the signal-to-noise problem. Production services are noisy. They fork threads, open files transiently, make one-off network calls. A naive diff would flag dozens of differences on every deployment. GitHub’s solution, inferred from the article, involves frequency weighting and stability filtering: an event that occurred only once in the baseline and once in the canary is not interesting; an event that occurred ten thousand times in the baseline and zero times in the canary is.
This kind of profiling also benefits from understanding process trees. eBPF can track clone and execve to build a tree of which processes spawned which children, allowing the system to scope events to a specific service’s process group rather than the entire host. The bpf_get_current_task() helper gives access to the kernel’s task struct, from which you can walk the process tree.
Performance Overhead
The question anyone running production infrastructure asks is: what does this cost? BPF programs run in kernel context, which means every syscall that triggers a tracepoint adds some overhead. For a service making millions of syscalls per second, even a few hundred nanoseconds per event adds up.
In practice, GitHub’s approach likely applies rate limiting or sampling at the BPF layer. The bpf_ktime_get_ns() helper lets you implement token-bucket rate limiting inside the BPF program itself, skipping the ringbuf write if the bucket is empty. This keeps overhead bounded regardless of syscall frequency. Sampling one in N events is another option; for frequency-based profiling, you do not need every event, just a statistically representative sample.
Published benchmarks from the Linux kernel BPF maintainers put tracepoint overhead at roughly 100 to 300 nanoseconds per event on modern hardware, depending on the probe site and the program complexity. For most services, this is negligible relative to the cost of the syscall itself. The openat syscall without BPF tracing takes on the order of 500 nanoseconds to a few microseconds; adding a BPF probe increases that by 5 to 30 percent at most.
Broader Implications
What GitHub has built is essentially a runtime behavioral contract for their services. The contract is not manually written; it is learned from observation. This sidesteps the maintenance burden of seccomp profiles, AppArmor policies, or other static security configurations, which tend to drift out of sync with the actual application over time.
The interesting extension of this pattern is using it not just for deployment gating but for continuous drift detection. If a service’s syscall profile changes significantly between Monday and Friday without a corresponding deployment, that is worth investigating. It could be a gradual configuration change, a dependency update, or a compromised process. The same eBPF infrastructure that powers deployment safety can power this broader observability.
Projects like Tetragon, Cilium’s runtime security component, and Pixie from New Relic explore similar territory, combining syscall tracing with network observability to give platform teams a full picture of service behavior without requiring application instrumentation. GitHub’s deployment-specific use case is a more focused slice of what this technology enables.
The kernel support required is available in any kernel shipping in major Linux distributions since around 2020: CO-RE requires BTF support (kernel 5.2+), ring buffers require kernel 5.8+, and the full LSM BPF hook support that would allow enforcement rather than just observation requires kernel 5.7+. For a fleet running Ubuntu 22.04 or later, or RHEL 9+, all of these are available without patching.
The tooling to build this kind of system has matured considerably. libbpf provides the C-side skeleton generation; cilium/ebpf handles the Go side; bpftool lets you inspect loaded programs and maps at runtime; and bpftrace gives you a scripting layer for quick exploration before you commit to a compiled program. The barrier to experimenting with eBPF-based deployment observability is lower now than it has ever been, and GitHub’s writeup is a useful reference point for what a production implementation looks like.