Neural Networks at 25 Nanoseconds: The Engineering Discipline Behind CERN's FPGA AI
Source: hackernews
The Large Hadron Collider produces proton-proton collisions at 40 MHz. Every 25 nanoseconds, a new bunch crossing occurs, generating roughly one megabyte of raw detector data per event. Before any permanent storage or offline analysis can happen, that stream must be reduced by a factor of roughly 40,000. The system responsible for the first cut is called the Level-1 Trigger, and its latency budget is 12.5 microseconds. No software running on a general-purpose CPU can meet that budget. The only hardware that can is an FPGA.
CERN’s use of AI models burned directly into FPGA fabric for this filtering task is not a novelty project. It is the logical endpoint of a hard constraint chain that starts in physics and ends in lookup tables.
What the Trigger System Actually Has to Do
The LHC’s two main general-purpose detectors, CMS and ATLAS, each have a two-stage trigger architecture. The Level-1 Trigger (L1T) runs entirely in custom electronics, FPGAs included, and must reduce the 40 MHz collision rate to around 100 kHz within that 12.5 microsecond window. The High Level Trigger (HLT) then runs on a commodity CPU farm and reduces the surviving events further to roughly 1-2 kHz for permanent storage.
The L1T is not making fine-grained physics decisions. It is looking for coarse signatures: high transverse momentum jets, missing energy consistent with a neutrino, muons above a threshold. Historically this was done with hardcoded threshold logic and lookup tables. The question CERN’s machine learning group has been pursuing for years is whether learned functions can do the same job better, specifically whether a neural network can identify interesting topologies that fixed thresholds would miss.
The answer is yes, but the hardware imposes conditions that bear almost no resemblance to typical deep learning workflows.
hls4ml: Compiling Keras to FPGA Fabric
The tool that makes this tractable is hls4ml, a library developed jointly by CERN, Fermilab, MIT, and collaborators starting around 2018. The name is short for High-Level Synthesis for Machine Learning. The basic idea is straightforward: you train a small neural network in Keras or PyTorch, hand it to hls4ml, and it generates synthesizable HLS C++ that Xilinx Vivado or Intel Quartus can compile down to FPGA bitstream.
What makes hls4ml interesting is the set of transformations it applies along the way. The generated HLS is not a loop over layers. Every layer is unrolled into a pipeline, with operations on individual neurons mapped to DSP slices, LUTs, and block RAMs. The result is that inference starts a new computation every clock cycle, achieving an initiation interval of 1 at typical clock frequencies between 200 and 360 MHz. That means you can sustain 200-360 million inferences per second from a single network instance.
A small multilayer perceptron with three hidden layers of 64 units each, operating at 200 MHz with 6-bit integer quantization, can complete one inference in under 100 nanoseconds. That fits comfortably inside the L1T latency budget even accounting for data transport delays across the trigger backplane.
The Quantization Constraint is Not Optional
FPGA resources are finite and shared across the entire trigger board. A typical Xilinx Ultrascale+ device used in CMS has around 1.7 million LUTs and 6000 DSP slices. Those resources also have to serve the rest of the trigger logic, the data routing, the serializers, and the monitoring infrastructure. A neural network cannot consume most of the chip.
This means quantization is not a performance optimization. It is a feasibility requirement. hls4ml supports arbitrary fixed-point widths, and networks deployed at CERN typically use 6-bit or 8-bit integer arithmetic for weights and activations, sometimes lower. At 6-bit precision, a multiply-accumulate operation fits entirely within a single DSP48 primitive without any LUT overflow. At 32-bit float, the same operation requires multiple DSPs and significant routing overhead.
Quantization-aware training is used to recover accuracy lost from low-precision arithmetic. The model trains with simulated quantization noise so that by the time weights are committed to fixed-point, the learned representations have already adapted. In practice, a well-trained 6-bit network recovers to within a percent or two of its 32-bit baseline on physics tasks, which is acceptable given that the alternative is a fixed threshold with no adaptability at all.
Pruning goes further. Sparse weight matrices translate directly to fewer multiply-accumulate operations, which means fewer DSP slices consumed. The hls4ml workflow supports structured pruning, where entire neurons are zeroed out so their wiring can be elided in synthesis. A network pruned to 50% sparsity can fit in roughly half the DSP budget of the equivalent dense model.
What the Models Are Actually Classifying
The physics tasks mapped to FPGA AI at CMS include jet classification, anomaly detection for new physics signatures, and track-based vertex reconstruction. Each has a different profile in terms of what can be learned versus what the fixed trigger logic handles well.
Jet classification is the most mature application. A jet is a collimated spray of hadrons produced when a quark or gluon is knocked out of a proton. Whether a jet originated from a bottom quark (b-tagging) versus a light quark changes the physics interpretation substantially. Classical b-tagging relies on displacement of the secondary vertex, computable in hardware. A neural network can exploit softer features: the distribution of constituent particle momenta, angular correlations, and substructure variables. Deployed at L1T, even a shallow network with a few hundred parameters improves b-jet efficiency at fixed fake rate compared to the purely threshold-based approach.
Anomaly detection is more speculative but scientifically important. An autoencoder trained on standard model collision signatures will assign high reconstruction error to events that do not look like anything it was trained on. If you can run that autoencoder inside the Level-1 Trigger and flag high-error events for preservation, you create a model-agnostic path to recording potential new physics without knowing in advance what to look for. The CMS CICADA project (Compact Integrated Circuit And Data Analysis) does exactly this, fitting a convolutional autoencoder into the L1T FPGA fabric and using its output as an additional trigger path.
Graph neural networks are a more recent development. Particle physics events are naturally represented as graphs: hits in a tracker form nodes, and spatial proximity or curvature compatibility defines edges. GNNs on FPGAs require more aggressive approximations because dynamic graph construction is expensive in hardware. The practical approach is to fix the graph topology at synthesis time, computing edge features over a predetermined neighborhood structure, which makes the computation fully pipelined but limits adaptability to different event topologies.
The Tool Chain Gap Between Training and Deployment
One of the underappreciated engineering challenges here is the verification gap. You train a model in TensorFlow with 32-bit floats, convert it to 6-bit fixed-point HLS, synthesize it, implement it, and program an FPGA. At each step there is a potential numerical divergence. hls4ml provides a co-simulation flow where the HLS C++ model and the original floating-point model run on the same inputs and their outputs are compared, but this is not the same as testing against real LHC data on real hardware.
The CMS collaboration maintains a hardware testbed where proposed trigger algorithms run on actual FPGA boards receiving simulated collision data through the standard trigger backplane. Only algorithms that pass this hardware validation make it into the online trigger table. The latency measurement at this stage is the definitive one: if the algorithm adds more than its allocated budget to the L1T pipeline, it is not deployable regardless of its physics performance.
This creates a workflow discipline that differs substantially from typical ML deployment. The model is not deployed to a server where you can hotfix it. It is synthesized into silicon configuration, and changing it requires a re-synthesis run that takes hours, followed by hardware validation. Teams at CMS therefore maintain libraries of pre-synthesized network variants at different operating points, selecting among them at the start of each data-taking period based on the planned luminosity and physics priorities.
Why This Matters Beyond Particle Physics
The extreme constraint regime at CERN, where latency budgets are measured in nanoseconds and model capacity is measured in kilobytes, is increasingly relevant in other domains. High-frequency trading systems face similar real-time inference requirements. Autonomous radar processing in aviation and defense has comparable latency demands. Network intrusion detection at line rate on programmable NICs is structurally the same problem.
hls4ml has already seen adoption outside physics through the FastML collaboration, which is actively porting the toolchain to domains including medical imaging and communications. The core insight, that quantized neural networks expressed as fully unrolled HLS pipelines can match the latency of hand-coded logic while adapting to learned data distributions, transfers cleanly.
The HL-LHC upgrade currently scheduled for around 2029 will increase the collision rate by a factor of five to seven, pushing the L1T trigger from its current 40 MHz input rate toward 200 MHz with a corresponding increase in the data volume per event. The models running in the trigger today will not survive that transition without further compression, and the hardware will need to be substantially more capable. That pressure is already shaping the next generation of firmware development, with teams exploring new FPGA families, model architectures specifically designed for high-ratio compression, and hybrid ASIC-FPGA configurations where the most stable algorithmic components are committed to custom silicon.
What CERN has built over the past several years is not just a clever deployment of a trendy technology. It is a complete engineering discipline: a toolchain for converting trained models to synthesizable hardware, a quantization methodology calibrated to FPGA DSP primitives, a validation framework for hardware-in-the-loop testing, and an operational workflow for managing bitstream libraries across a live physics experiment. That discipline is the transferable artifact, and the broader engineering community is only beginning to absorb it.