When 75 Nanoseconds Is All You Have: CERN's FPGA ML Pipeline

The Large Hadron Collider generates roughly 40 terabytes of raw detector data every second. Permanent storage can absorb maybe 1 to 3 gigabytes per second. The gap between those two numbers is the fundamental engineering problem of experimental particle physics, and CERN’s approach to closing it is one of the more technically unusual deployments of machine learning anywhere in science.

The answer involves neural networks translated into FPGA firmware, running inference in under 100 nanoseconds, at 40 million events per second, with no CPU in the data path. The models are tiny by modern standards, but the constraints that make them tiny are instructive for anyone thinking seriously about inference at the edge.

The Trigger Problem

The LHC collides proton bunches at 40 MHz. Each bunch crossing can produce dozens of simultaneous inelastic collisions. The detectors around the interaction points, primarily ATLAS and CMS, record the resulting shower of particles across hundreds of millions of sensor channels. The physics of interest, Higgs decays, potential supersymmetric particles, rare B-meson processes, represents a tiny fraction of total events. Most crossings produce only low-energy background collisions that are well understood and scientifically uninteresting.

The trigger system’s job is to decide, for every single one of those 40 million crossings per second, whether the event is worth keeping. The Level-1 trigger must make that decision within a fixed latency budget. For the CMS detector in Run 3, the L1 trigger has roughly 4 microseconds to issue an accept or reject signal. For ATLAS, the budget is around 25 microseconds. Either way, the decision must be made before the detector pipelines overflow.

At 40 MHz input, even 4 microseconds of latency means the firmware must be pipelined so that a new event enters the pipeline every 25 nanoseconds, regardless of where earlier events are in their processing stages. The target output rate is around 100 kHz, a reduction factor of about 400. That 100 kHz stream then goes to the High-Level Trigger, a CPU farm, which applies more sophisticated reconstruction before writing to disk at 1 to 3 kHz.

The Level-1 trigger is implemented entirely in custom ASICs and FPGAs. No software CPUs. No operating system. The compute happens in configurable digital logic clocked at 200 to 250 MHz.

Why Not GPUs

Modern GPUs can do millions of inferences per second for typical batch sizes. That sounds adequate until you examine the latency profile. A GPU inference call has overhead measured in microseconds to milliseconds: memory transfers, CUDA kernel launch latency, scheduling. Even the fastest GPU inference pipelines have end-to-end latencies of several microseconds for a single input with no batching.

The L1 trigger cannot batch events and wait. Each crossing is independent. You cannot accumulate a batch of 1,000 events and send them to a GPU in parallel, because the first event in that batch would already be 25 microseconds old by the time the batch filled. The decision would be useless.

FPGAs solve this because they execute logic spatially, not sequentially. A fully pipelined FPGA design processes one new input per clock cycle, with each stage of computation happening simultaneously on different pieces of hardware. The latency from input to output is fixed and deterministic, set by the number of pipeline stages multiplied by the clock period. At 250 MHz with 20 pipeline stages, you get 80 nanoseconds of latency, with new results produced every 4 nanoseconds.

ASICs would be faster still, and CERN has explored that direction. A 2021 paper by Di Guglielmo and colleagues (arXiv:2101.05108) demonstrated taking a neural network through the same design flow used for FPGAs and synthesizing it into an ASIC tape-out for detector front-end data compression. But ASICs take years to fabricate and cannot be updated after deployment. FPGAs can be reprogrammed, which matters when the collision conditions change or a new physics analysis requires a different trigger strategy.

hls4ml: Turning Keras into Firmware

The tool that makes FPGA deployment practical for physicists who are not digital hardware engineers is hls4ml, developed primarily by researchers at CERN and Fermilab. It takes a trained neural network from Keras, PyTorch via ONNX, or scikit-learn, and emits C++ code annotated with vendor-specific pragmas that FPGA high-level synthesis tools like Xilinx Vivado HLS compile into register-transfer-level logic.

The foundational paper from Duarte and colleagues in 2018 (arXiv:1804.06913) benchmarked a three-hidden-layer network with 64, 32, and 32 neurons per layer on a Xilinx Virtex-7 XCVX690T FPGA. The network classified jets by particle type. At 16-bit fixed-point precision, the synthesized firmware achieved an end-to-end latency of around 75 nanoseconds, using roughly 3 percent of the chip’s LUTs and 5 percent of its DSPs. That leaves most of the chip available for other logic or additional networks.

Fixed-point arithmetic is the core of why this works at all. Floating-point multipliers are expensive in FPGA logic. Fixed-point multipliers, where the binary point is defined at compile time, are far cheaper. hls4ml maps each weight and activation to an ap_fixed<W,I> type, where W is the total bit width and I is the number of integer bits. The default is ap_fixed<16,6>, but tuning this per layer is where real resource savings come from.

The Co-Design Loop

The interesting engineering is not in hls4ml itself but in the workflow it enables. Physics collaborations have converged on what they call co-design: the model architecture and the hardware constraints are optimized together from the beginning, not sequentially.

A conventional ML deployment pipeline looks like: train a model to high accuracy, then try to compress it to fit the target hardware. That works badly when the hardware constraints are as tight as a Level-1 trigger. A model that achieves good physics performance at float32 precision often degrades significantly when quantized to 8 bits naively.

The co-design approach instead begins with the target resource budget and latency constraint. The designer runs hls4ml synthesis estimates during the architecture search phase, before full synthesis, to get approximate LUT and DSP counts. The quantization precision becomes a first-class hyperparameter. QKeras, a quantization-aware extension of Keras, allows training with simulated fixed-point operations in the forward pass so that gradients reflect the quantization noise the model will face during deployment.

Coelho and colleagues formalized this in 2021 (arXiv:2006.10159) with heterogeneous quantization: different layers in the same network can have different bit widths, searched automatically. Input layers might use ap_fixed<16,6> while internal activations use ap_fixed<8,3>. The AutoQKeras framework treats per-layer precision as a discrete hyperparameter in a Keras Tuner search. The results were substantial: up to 50x reduction in resource usage with less than 2 percent accuracy loss on jet classification benchmarks.

Beyond Dense Networks

Dense (fully connected) layers are the easiest case for FPGA synthesis because the computation pattern is regular: matrix multiplication with fixed dimensions. CERN’s physics requirements pushed hls4ml to support more complex architectures.

Boosted Decision Trees were an early priority because many existing physics analyses used BDTs rather than neural networks. The Conifer library, spun out from hls4ml, handles XGBoost, scikit-learn gradient boosting, and TMVA BDTs. A tau lepton identification BDT for the CMS Run 3 Level-1 trigger achieves inference at around 80 nanoseconds on Xilinx UltraScale+ hardware. That BDT is running in production at the LHC today.

Graph Neural Networks address a structural problem: particles in a detector do not arrive on a regular grid. A calorimeter tower sum is spatial, but a set of reconstructed particle tracks is a variable-length point cloud with non-trivial topology. The GarNet architecture, a simplified message-passing graph network, was demonstrated on FPGAs by Iiyama and colleagues (arXiv:2003.06396) for calorimeter clustering, achieving around 200 to 500 nanoseconds depending on configuration. Variable-length inputs require special handling: the firmware pads to a maximum size at synthesis time, a co-design constraint that forces the physicist to commit to a maximum number of input objects.

Autoencoders are perhaps the most scientifically interesting case. Govorkova and colleagues (arXiv:2201.05349) deployed an autoencoder on a Xilinx Virtex UltraScale+ XCVU9P running at 40 MHz for unsupervised anomaly detection. The encoder compresses 16 input features down to a 2-dimensional bottleneck. Events with high reconstruction error are flagged as potentially anomalous, possibly indicating new physics that no specific supervised classifier was trained to find. At 8-bit precision the encoder fits in around 10 percent of the chip’s LUTs and 8 percent of its DSPs, with approximately 100 nanoseconds of latency. This is meaningful: the standard trigger relies on recognizing known signatures. An autoencoder running in parallel can flag events that look unlike anything in the training data, providing a model-independent discovery channel.

What the Numbers Actually Mean

For context, a modern GPU inference call for a small model takes roughly 50 to 500 microseconds end-to-end including data transfer. An FPGA-based hls4ml inference for the same model architecture takes 75 to 300 nanoseconds: a factor of 100 to 1,000 faster, with deterministic latency and no OS jitter.

The tradeoff is that the FPGA model is fixed at synthesis time. Changing the network weights requires re-synthesis and reprogramming, which takes hours. The training precision must be specified before synthesis. The maximum input size must be known at compile time. These constraints push the co-design workflow hard toward front-loaded design work: getting the architecture right before the synthesis cycle starts.

The High-Luminosity LHC upgrade, expected to begin operation around 2029, will raise the pileup to approximately 200 simultaneous interactions per bunch crossing, up from around 50 in Run 3. Both ATLAS and CMS are redesigning their Level-1 triggers for this environment, and both upgrade designs explicitly incorporate hls4ml-based inference in the trigger firmware. The CMS Phase-2 Global Trigger uses a two-stage FPGA architecture with ML models for object identification running at every bunch crossing.

Broader Implications

The techniques developed at CERN for extreme-latency inference have obvious relevance outside particle physics. Any application where inference must happen in nanoseconds with deterministic timing, real-time industrial control, high-frequency trading, network packet classification, radar signal processing, faces the same fundamental constraints: GPUs are too slow, CPUs have too much jitter, and ASICs are too inflexible.

hls4ml has been used beyond physics for exactly this reason. The co-design methodology it embodies, training models with hardware constraints as explicit objectives rather than post-hoc compression afterthoughts, is a more principled approach to edge deployment than quantization-after-training regardless of application domain.

The models are small in parameter count because they must be, not because anyone is making a theoretical point about model efficiency. A three-layer dense network with a few thousand weights is not going to win an ImageNet benchmark. But it can identify a tau lepton in 80 nanoseconds inside a detector hall 100 meters underground, deciding in real time whether the collision is worth saving. For that specific job, it is the right tool, sized correctly for the work.