· 6 min read ·

Nanosecond Inference: How CERN Compiles Neural Networks Into FPGA Firmware

Source: hackernews

The Large Hadron Collider produces roughly 40 million proton-proton collision events per second. Each raw event from a detector like CMS or ATLAS carries somewhere between one and two megabytes of sensor data. That puts the raw throughput in the range of 40 to 80 terabytes per second, a number that no storage or network infrastructure on Earth can absorb directly. The entire physics programme depends on a real-time filtering system that discards the vast majority of collisions while reliably preserving the rare ones that might indicate new physics.

This is the trigger problem, and CERN’s recent work on embedding compressed neural networks into FPGA firmware is the most technically interesting answer to it in decades.

The Latency Hierarchy

The trigger system operates in two tiers. The Level-1 (L1) trigger is implemented entirely in custom hardware, FPGAs and ASICs, with a fixed latency budget of roughly 2.5 to 4 microseconds. It must reduce 40 MHz of collision events down to approximately 100 kHz for further processing. Every clock cycle in this pipeline is accounted for in advance. There is no dynamic scheduling, no heap allocation, no OS involvement. The circuit either makes a keep-or-discard decision within its allotted time or the event is gone forever.

The High-Level Trigger (HLT) sits downstream: a farm of roughly ten thousand CPU cores running software reconstruction with latencies in the hundreds of milliseconds. That is a more familiar computing environment, where you can run PyTorch or TensorFlow normally. The interesting and difficult constraint is L1. Getting a neural network to run in that environment, within a budget measured in tens of nanoseconds, requires a fundamentally different approach than loading a model into GPU memory.

What hls4ml Actually Does

The library that makes this possible is hls4ml, which stands for High-Level Synthesis for Machine Learning. It was first described in a 2018 paper by Duarte et al. published in JINST, and has been developed since then through a collaboration involving CERN, Fermilab, SLAC, MIT, Caltech, and several universities under the FastML Science umbrella.

The workflow goes: train a neural network in Keras, TensorFlow, PyTorch, or from an ONNX export, then pass it to hls4ml, which translates the model into synthesizable HLS C++. That C++ is fed into Xilinx Vitis HLS or Intel Quartus, which compiles it down to an FPGA bitstream. The resulting hardware circuit implements the neural network’s forward pass as a fixed digital circuit, with the learned weights embedded as constants directly into the arithmetic logic.

The phrase “burned into silicon” from the original article is a useful shorthand, though slightly imprecise. FPGAs are reprogrammable: the bitstream can be reloaded between LHC fills or during maintenance, which is the key operational advantage over a true ASIC. What is accurate is that the weights, once quantized, are not stored in RAM and fetched at runtime. They become hardcoded constants folded into the LUT and DSP configurations of the FPGA fabric. A 4-bit weight in a multiply-accumulate unit does not require a memory read; it is part of the circuit itself.

Quantization as Compression

The step that makes any of this feasible is aggressive fixed-point quantization. Standard trained networks use 32-bit floating-point weights. An FPGA at L1 cannot afford the DSP resources or routing complexity to implement 32-bit floating-point arithmetic across a full neural network within a few dozen nanoseconds. hls4ml solves this by converting weights and activations to narrow fixed-point representations, typically 4 to 8 bits, using the HLS ap_fixed<N,I> type, where N is the total bit width and I is the number of integer bits.

A typical trigger model, a three-to-five-layer fully connected network with 32 to 64 nodes per layer, quantized to 6-bit fixed point, can achieve inference latency around 50 to 75 nanoseconds on a Xilinx Virtex UltraScale+ device running at 200 MHz. At 250 MHz, a single clock cycle is 4 nanoseconds; that model fits in roughly 12 to 18 cycles. Larger models with five layers and 64 nodes run closer to 100 to 200 nanoseconds, still well within the L1 budget of 2,500 nanoseconds.

The resource footprint is comparably small. A 3-layer MLP in this class might consume 2,000 to 10,000 LUTs and 50 to 500 DSP blocks on a Virtex UltraScale+ device like the VU9P, which has 1.2 million LUTs and 6,840 DSPs in total. Multiple independent models can be instantiated in parallel on a single FPGA, either for redundancy or to run different classifiers on different detector subsystems simultaneously.

The training side of this pipeline uses qkeras, a quantization-aware training library for Keras. Training with quantization in the loop means the model learns to work within fixed-point constraints, rather than having quantization imposed post-hoc on a full-precision model. The accuracy penalty is measurable but acceptable for the physics use cases in question.

What Gets Run at L1

The most mature application is muon transverse momentum estimation. At L1, the muon trigger system receives coarse “stubs” from the muon detectors: hit patterns with limited spatial resolution. Older systems used lookup tables to map stub patterns to pT estimates. A small neural network replaces those LUTs with a parameterized function that achieves better resolution over the full pT range, occupies similar or smaller FPGA resources, and runs in comparable latency.

Jet tagging is the next tier: identifying whether a jet originated from a bottom quark (b-tagging), a tau lepton, or a boosted heavy boson. Full b-tagging at L1 previously required coarse approximations. With hls4ml, small convolutional networks and graph neural networks can be deployed in firmware, improving the physics reach of the trigger without exceeding the latency budget. The 2021 CNN extension paper demonstrated that convolutional architectures could be synthesized with acceptable resource costs.

The most speculative but scientifically significant application is unsupervised anomaly detection. An autoencoder trained on ordinary QCD backgrounds learns to reconstruct typical events efficiently. Events that reconstruct poorly, with high reconstruction loss, are candidates for new physics that no existing signal model anticipated. Running an autoencoder at L1 in firmware means that a detector could flag genuinely anomalous events without requiring physicists to pre-specify what they are looking for. The original anomaly detection paper from 2018 showed this was feasible in principle; subsequent work has demonstrated implementations with hls4ml achieving 100 to 300 nanosecond latency depending on autoencoder complexity.

For boosted decision trees, the sister library conifer provides an analogous synthesis path: BDTs trained with XGBoost or scikit-learn get compiled to FPGA firmware with similar latency characteristics.

The HL-LHC Pressure

All of this work is happening under a specific deadline. The High-Luminosity LHC, currently scheduled to begin Run 4 operations around 2029, will increase the number of simultaneous proton-proton interactions per bunch crossing by a factor of five to seven compared to current conditions. Every algorithm in the trigger chain will face substantially more combinatorial complexity on the same timescales.

The CMS Phase-2 Level-1 Trigger Technical Design Report, published by CERN, describes a system that assumes ML inference in firmware as a baseline capability rather than a research demonstration. The Serenity ATCA board, developed collaboratively by Bristol and CERN groups, hosts Xilinx VU13P FPGAs and is designed as the hardware platform for these HL-LHC trigger upgrades. Graph neural networks for tracking at L1, currently the most resource-intensive target, are an active area in hls4ml development, as tracking in high-pileup conditions essentially requires reasoning about variable-size point clouds with learned edge features.

The practical bottleneck at this point is less the FPGA compilation and more the training methodology. Models that work well in floating-point can degrade significantly under 4-bit quantization if they were not trained with quantization awareness from the start. The workflow of qkeras plus hls4ml addresses this, but it requires physics analysts to adopt a different training discipline than what most HEP software frameworks have historically expected.

Why This Matters Beyond Particle Physics

The hls4ml toolchain is not CERN-specific software. It is an open-source library that addresses a general problem: deploying neural network inference in latency-constrained FPGA environments where a CPU or GPU is not an option. The same approach applies to any domain where decisions need to be made in tens of nanoseconds from a deterministic hardware pipeline, including high-frequency trading, radar signal processing, real-time control systems, and network packet classification.

The FastML Science group publishes openly and the toolchain has been adopted outside particle physics. The intellectual contribution is the synthesis path itself, the recognition that a neural network is, at its core, a fixed sequence of multiply-accumulate operations over constant weights, and that this maps cleanly onto the LUT and DSP fabric of modern FPGAs once you accept the quantization constraint.

CERN’s trigger problem is extreme in scale, but the engineering answer it has produced is general enough to be useful anywhere latency budgets are measured in nanoseconds rather than milliseconds.

Was this interesting?