eBPF as a Deployment Safety Net: What GitHub Built and Why It Matters

GitHub ships code many times per day. That cadence is only possible if you trust your deployment pipeline to catch problems early, and trust is the hard part. Static analysis, test suites, and code review all help, but they operate on code that hasn’t yet run against real traffic. The gap between passing CI and “this is safe to roll out to everyone” is where production incidents live.

GitHub’s engineering blog recently described how they filled that gap using eBPF, the Linux kernel’s programmable tracing and networking subsystem. The short version: they attach eBPF programs to running Ruby processes during canary deployments, collect behavioral signals at the kernel level, and use those signals to make automated rollout or rollback decisions. The longer version is more interesting, because this approach involves tradeoffs that aren’t obvious until you’ve thought carefully about what alternatives exist.

What eBPF Actually Is, Briefly

eBPF started as “extended Berkeley Packet Filter,” a mechanism for filtering network packets in the kernel without writing kernel modules. The name has outlived its original scope. Today eBPF is a general-purpose in-kernel virtual machine that lets you run sandboxed programs in response to kernel events: system calls, function calls, network packets, hardware performance counters, and more.

The key property is safety. Before any eBPF program runs, the kernel’s verifier statically analyzes it to prove it will terminate, won’t access invalid memory, and won’t crash the kernel. Programs are JIT-compiled to native code and run with minimal overhead. This is what makes eBPF viable for production observability: you get kernel-level visibility without the risk of loading an actual kernel module.

eBPF programs communicate with userspace through maps, which are typed key-value stores shared between the kernel program and userspace readers. A tracing program might increment a counter in a map every time a particular function is called; a userspace daemon reads those counters and ships them to a metrics backend.

The Problem with Existing Instrumentation

GitHub’s application stack is primarily Ruby on Rails. Ruby has mature APM tooling, Datadog’s Ruby agent, Scout APM, and several others. These tools work by monkey-patching Ruby classes and methods at load time, wrapping them with timing and error-capturing logic. They’re effective for steady-state observability.

For deployment safety, though, they have a structural limitation: they’re part of the application. The instrumentation ships with the code. If a bug affects the instrumentation itself, or if the startup sequence changes, or if a library the agent patches gets replaced, you lose visibility precisely when you need it most. You’re also paying the overhead of that instrumentation permanently, which influences capacity planning in ways that can mask the true cost of new code.

There’s a subtler problem too. Traditional APM agents observe what the application explicitly exposes: HTTP request durations, database query times, custom metrics developers remembered to add. They don’t observe what the kernel sees: every read() and write() call, every allocation that tips into swap, every signal delivered, every file descriptor leak. The kernel’s view is comprehensive in a way that any userspace agent’s view isn’t.

Uprobes: Attaching to Userspace Functions from the Kernel

The eBPF mechanism GitHub uses for this is called uprobes, short for userspace probes. A uprobe works by instrumenting a function in a userspace binary: the kernel rewrites the first byte of the target function with a breakpoint instruction, and when that breakpoint fires, the associated eBPF program runs in kernel context.

For a compiled language like C or Go, identifying functions to probe is straightforward because the binary has stable symbol addresses. Ruby is more complicated because it’s an interpreted language. The Ruby VM is a C program, and the eBPF probes attach to C-level functions inside the Ruby interpreter binary rather than to Ruby methods directly. GitHub probes functions like rb_longjmp, which is called whenever Ruby raises an exception, and various VM entry points that correspond to executing Ruby method calls.

This is what makes the approach architecturally interesting. You don’t need to touch the application code at all. You attach probes to the Ruby interpreter binary, which is the same binary across all your hosts. A single eBPF program specification can be deployed to an entire fleet and activated at will without restarting any application processes.

A minimal example of what a uprobe-based exception counter looks like, using the bcc Python bindings:

from bcc import BPF

program = """
#include <uapi/linux/ptrace.h>

BPF_HASH(exception_counts, u32, u64);

int count_exception(struct pt_regs *ctx) {
    u32 pid = bpf_get_current_pid_tgid() >> 32;
    u64 zero = 0;
    u64 *count = exception_counts.lookup_or_init(&pid, &zero);
    (*count)++;
    return 0;
}
"""

b = BPF(text=program)
# Attach to rb_longjmp in the Ruby interpreter
b.attach_uprobe(
    name="/usr/bin/ruby",
    sym="rb_longjmp",
    fn_name="count_exception"
)

In production you’d use libbpf and CO-RE (Compile Once, Run Everywhere) rather than bcc, because CO-RE programs can be distributed as pre-compiled bytecode that adapts to different kernel versions at load time. But the conceptual shape is the same: identify a kernel or userspace function, attach a small program to it, accumulate data in a map.

The Deployment Safety Loop

GitHub’s canary deployment system routes a fraction of production traffic to hosts running the new version of the application. The eBPF layer adds a behavioral comparison: it collects exception rates, error rates, and other kernel-observable signals from both the canary hosts and the baseline hosts running the current version.

This comparison is statistically more robust than a simple threshold. Any production system has baseline noise: transient errors from upstream dependencies, occasional timeouts, background job failures. Comparing canary behavior to baseline behavior on the same traffic, at the same time, removes most of that noise. What you’re left with is a cleaner signal: is the canary generating more exceptions than the baseline, controlling for the traffic it’s seeing?

When the signal crosses a threshold, the deployment system halts the rollout and can trigger automatic rollback. The decision happens in near-real-time because eBPF maps are updated by the kernel synchronously with the events they track, with no userspace collection latency beyond the polling interval.

This is the loop:

Start canary deployment to a small host cohort.
eBPF programs activate on canary and baseline hosts.
A userspace daemon polls eBPF maps and computes the behavioral delta.
If the delta exceeds the rollback threshold, halt and revert.
If the delta stays within bounds over the observation window, proceed with the full rollout.

Step 3 is where most of the engineering complexity lives. Computing a meaningful delta requires deciding what metrics matter, how to weight them, how long the observation window should be, and what threshold constitutes “too different.” These are statistical questions as much as engineering questions.

Why Not OpenTelemetry or a Service Mesh?

The question worth asking is why this is better than existing alternatives. OpenTelemetry provides standardized instrumentation that applications can emit directly. Service meshes like Istio or Linkerd can observe traffic at the network layer without application changes. Both approaches have active communities and broad tooling support.

The kernel-level eBPF approach has a specific advantage that neither of those provides: it sees application behavior that never manifests as network traffic. An exception that’s caught internally and converted to a default return value is invisible to a service mesh. A Ruby GC pause that causes a request queue to back up shows up in latency metrics, but its cause doesn’t. eBPF attached to interpreter internals can observe the exception before it’s handled, the GC cycle as it’s happening, and the system calls that back the I/O the application is doing.

OpenTelemetry requires the application to emit spans and metrics. That’s fine for steady-state observability, but for deployment safety you want signals the application doesn’t control, because the thing you’re trying to detect is the application behaving differently than expected. An eBPF observer outside the application process has a harder-to-fool view.

The service mesh comparison is more interesting. Cilium, which is itself eBPF-based, can provide deep network observability that overlaps with what GitHub is doing here. But Cilium is primarily a CNI plugin focused on networking and security policy; it doesn’t attach to Ruby interpreter internals. The two approaches are complementary rather than competing.

Performance Overhead in Practice

The standard question about any production tracing tool is: what does it cost? eBPF has a reputation for low overhead, and that reputation is generally accurate but context-dependent.

Uprobe overhead scales with the call frequency of the probed function. Attaching a probe to a function called millions of times per second, like a tight inner loop, will have measurable overhead. Attaching to exception-handling paths, which should be infrequent on a healthy application, has negligible cost. GitHub is probing error paths specifically, which means the overhead is roughly proportional to the error rate, which should be low.

The eBPF verifier also constrains what programs can do: no unbounded loops, limited stack size (512 bytes as of kernel 5.2, though this has been relaxed in some contexts), and restricted helper function access. These constraints prevent programs from becoming performance problems by limiting how much work they can do per invocation.

Benchmarks from the eBPF community suggest that uprobe overhead for a function called 1 million times per second is on the order of a few percent of CPU. For error-path probing where the probe fires thousands of times per second rather than millions, the overhead is below measurement noise in most configurations.

The Broader Ecosystem Context

GitHub isn’t alone in using eBPF for production observability. Pixie, which New Relic acquired, uses eBPF to provide automatic instrumentation for Kubernetes workloads without requiring application code changes. Parca uses eBPF for continuous profiling. Tetragon, from the Cilium project, uses eBPF for security observability and enforcement at the kernel level.

The pattern these tools share is worth naming explicitly: eBPF lets you add observability as infrastructure rather than as application code. The instrumentation is deployed and managed separately from the application, it can be updated without application restarts, and it observes the application from outside its own abstraction boundaries.

For deployment safety specifically, that separation is the point. You want the thing watching your deployment to be independent from the thing being deployed.

What This Takes to Build

Implementing this isn’t a weekend project. eBPF programming requires understanding the verifier’s constraints, the CO-RE portability model, the differences between kprobes and uprobes, and how to navigate kernel version differences. The Ruby-specific part requires understanding which C-level functions in the interpreter correspond to the Ruby-level behaviors you want to observe, and those mappings can change between Ruby versions.

The statistical layer on top, the part that computes meaningful deltas and makes rollback decisions, is a separate engineering challenge entirely. Getting the thresholds wrong means either missing real regressions (too permissive) or blocking good deployments (too strict).

GitHub has the engineering depth to build and maintain this infrastructure. For smaller teams, the practical path is probably tools like Pixie or Parca that provide eBPF-based observability without requiring you to write eBPF programs yourself. But GitHub’s approach demonstrates what becomes possible when you treat the kernel as a first-class observability surface rather than an opaque substrate beneath your application.

The deployment pipeline is the last line of defense before production. Building that defense out of kernel-level instrumentation, rather than application-level self-reporting, is a meaningful architectural choice with real consequences for the quality of signal you get when something goes wrong.