The memory in your computer is not ready to use when power arrives. Before any kernel initializes, before the bootloader runs, the memory controller and your DRAM modules go through a negotiation that takes between 200 milliseconds and a full second on a cold boot. If you have noticed that a cold start is slower than a warm restart, you have seen the difference between a full training run and a system that cached its results from last time.
This process is called DDR4 training. The systemverilog.io breakdown of DDR4 initialization and calibration walks through the operational sequence well. What it leaves underexplored is why DDR4 needs so much of this relative to its predecessor, and what changed in the underlying physics to demand it. The answer is in signal integrity, and tracing it through the training phases changes how you think about RAM as hardware rather than commodity.
Why 3200 MT/s Changes the Physics
DDR4 runs at up to 3200 megatransfers per second per pin under the official JEDEC spec. At that rate, each unit interval, the time slot for a single bit, is 312.5 picoseconds. Setup and hold times at the DRAM input are roughly 50 ps. That leaves almost no room for signal distortion before the input comparator inside the DRAM chip makes a wrong decision.
DDR3 at 1600 MT/s had a 625 ps unit interval. That is not double the breathing room; every timing budget in the system scales with it simultaneously, so the difference in margin is multiplicative across setup time, propagation delay variation, and voltage noise.
DDR4 also changed its signaling standard from SSTL (Stub Series Terminated Logic) to POD (Pseudo Open Drain). Under SSTL, the transmitter and far-end termination cooperate to pull signals toward the reference voltage from both sides. Under POD, the signal pulls low to logic 0 through the output driver, while pull-up to VDDQ is handled by on-die termination inside the DRAM or the controller. This asymmetry means the optimal input reference voltage, Vref, is no longer simply VDDQ divided by two. It depends on the specific ODT impedance values, PCB trace characteristics, and temperature. DDR3 could hardcode Vref at 0.75V and operate reliably. DDR4 must measure it.
ZQ Calibration: Establishing a Baseline
Every DDR4 training sequence begins with ZQ calibration. On each module, an external 240-ohm resistor connects to the ZQ pin on every DRAM chip. The DRAM’s internal calibration engine uses this known reference to tune its output driver impedance and on-die termination resistance toward target values, typically RZQ/7 (approximately 34 ohms) for output drive and one of several RTT values (60, 120, or 240 ohms) for termination.
The ZQCL command (ZQ Calibration Long) takes 512 clock cycles and runs once after power-up. Periodic ZQCS commands (ZQ Calibration Short, 64 cycles) can re-run during operation to compensate for thermal drift. The JEDEC spec JESD79-4B mandates the timing. Skipping ZQ calibration means every subsequent training phase is measuring against an unstable reference, so write leveling and Vref training both depend on it completing first.
Fly-By Topology and Write Leveling
DDR4, like DDR3, uses fly-by PCB routing on multi-chip DIMMs. Rather than connecting each DRAM chip to clock and data signals in parallel, fly-by routes signals down a chain, touching each chip sequentially. This reduces PCB stub count and the signal integrity problems that stubs create at high frequencies. The cost is that the clock signal and the data strobe arrive at each chip at different times because the traces to each chip have different physical lengths.
Write leveling finds this skew. The controller sets MR1[7] to enable write leveling mode, then drives DQS (the data strobe) while watching for each DRAM chip to return a feedback bit on DQ[0] indicating when the DQS rising edge aligns to the clock. The controller sweeps DQS delay in increments of about 1/32 to 1/64 of a clock period and records the transition point for each DRAM chip and each byte lane separately. On a fully populated DDR4 DIMM, DQS-to-CK skew across all chips can reach ±500 ps, which at DDR4-3200 spans more than one full unit interval. Without correcting for this, write data to different chips would land in different clock cycles.
Vref Training: What DDR4 Added That DDR3 Did Not Have
MR6 is the mode register that defines DDR4’s most significant training addition. It controls the DRAM’s input reference voltage for the DQ bus, a capability that did not exist in DDR3 at all.
MR6 provides 90 possible Vref values split across two ranges. Range 1 covers 60% to 92.5% of VDDQ in 0.65% increments. Range 2 covers 45% to 77.5% of VDDQ in the same step size. The controller enables Vref training mode by setting MR6[7], writes a candidate value, issues a write command with a known data pattern, reads back through the DRAM’s MPR (Multi-Purpose Register) to check correctness, and repeats across all 90 candidate values. The contiguous range of passing values identifies the valid Vref window; the center of that window is programmed as the operating point.
On DIMMs that use x4-wide DRAM chips, multiple dice share a byte lane and each may need a different Vref. Per-DRAM Addressability (PDA) mode, enabled via MR3, lets the controller broadcast Vref update commands masked to individual dice. A 64-bit DDR4 bus served by sixteen x4 DRAM chips might end up with sixteen distinct Vref settings, each trained independently.
The 0.65% step resolution exists because it matters. At DDR4-3200, a 1% Vref error can push the DRAM’s input comparator to the edge of its valid range. Errors under those conditions tend to appear infrequently and worsen with heat, which makes them difficult to attribute to Vref misconfiguration without knowing the mechanism.
Read and Write DQ Training
With impedance calibrated, write skew corrected, and Vref set, the controller still needs to find the valid sampling window for every data bit on both the read and write paths.
For reads, the DRAM’s MPR0 register contains a fixed alternating pattern (0x55 repeated) that the controller can read back without having written arbitrary data first. The controller sweeps read DQ delay for each bit, checks the returned pattern against the known value, and centers the sampling point within the passing window. Per-bit deskew within each byte lane resolves timing differences as fine as 1/128 of a unit interval, compensating for PCB routing variation among the eight DQ bits that share a DQS strobe.
For writes, the controller uses a similar sweep in reverse: write a known pattern, read it back through the validated read path, adjust write DQ delays until the data arrives reliably within the write DQS window. The full read and write training sequence, combined with per-bit deskew, is typically the longest phase of the training process.
What the Firmware Actually Runs
The OS sees none of this. By the time any kernel code runs, training has completed, all calibration values are programmed into hardware registers inside the memory controller’s uncore, and the UEFI memory map has handed the bootloader a usable address space. There is no OS API that exposes trained Vref values or write leveling delays. They live in the memory controller configuration space, accessible through PCIe on Intel platforms (IMC registers in uncore) or AMD’s UMC (Unified Memory Controller), but not surfaced to user space.
The firmware that executes training is almost universally proprietary. Intel platforms use FSP (Firmware Support Package), a binary blob split into three phases: FSP-T sets up cache-as-RAM storage so the CPU has a stack before DRAM exists, FSP-M runs full DDR4 initialization and training, and FSP-S handles post-memory platform setup. Intel publishes FSP binaries on GitHub without source code. AMD platforms use AGESA, also distributed as binary blobs, though AMD has been incrementally open-sourcing components via openSIL since 2023.
Coreboot can replace proprietary BIOS/UEFI but depends on FSP-M for DDR4 memory initialization on Intel platforms. The coreboot source handles platform orchestration around the blob. The contrast with ARM SoC vendors is significant: Rockchip, Allwinner, and similar vendors ship open-source LPDDR4 training code in U-Boot and Trusted Firmware-A. The LPDDR4 initialization for a Rockchip RK3399 is fully inspectable; the DDR4 training code running on a current Intel desktop platform is not. This gap has real consequences for platform security auditing and long-term firmware maintainability.
XMP Profiles and Training Failure
XMP (Extreme Memory Profile) data is stored in DDR4 SPD bytes 128 through 254 and specifies higher frequencies, tighter timings, and voltages above the JEDEC 1.2V nominal, typically 1.35V for DDR4-3600 and upward toward 1.45V for DDR4-4000 and above. When a BIOS applies an XMP profile, it runs the complete training sequence at those elevated settings.
Training at the margin of stability fails intermittently. Most motherboard firmware handles this by tracking consecutive boot failures and reverting to JEDEC defaults after three attempts. This is the source of the DDR4 boot loop that anyone who has tuned memory timings will recognize: the system posts, fails to complete training, reboots, tries again, fails, reboots, and then falls back to safe defaults. The failure is usually not in the DRAM device itself but in the memory controller’s DLL or in the Vref window collapsing to nothing at the target frequency and voltage.
Successful training parameters are saved to SPI flash as MRC cache. Subsequent cold boots skip re-training and load from cache, which saves 300 to 800 milliseconds depending on the platform and DIMM configuration. Clearing CMOS deletes this cache along with other settings, which is why a CMOS clear can resolve training failures that have become stuck in a bad cached state: the next boot forces a fresh training run that may succeed where the corrupted cached values did not.
The Direction of Travel
DDR5 adds further training phases: on-die ECC training, 48 Vref steps per byte lane instead of 90 total, and a standardized training protocol through the new SPD hub chip (SPD5118 handles both SPD and on-DIMM voltage regulation in DDR5). Each generation removes one more thing that could previously be assumed fixed and replaces it with something that must be measured. The training time grows accordingly.
The trajectory is predictable in retrospect. Every new DDR generation has operated at a lower voltage and higher data rate than its predecessor. Lower voltage shrinks noise margins; higher data rate shrinks timing margins. The training protocol is the interface between those tightening constraints and functional hardware. As long as DRAM speeds keep increasing faster than PCB manufacturing precision and signaling noise floors improve, memory initialization will keep getting more elaborate. The firmware doing that initialization will remain a binary blob on most platforms, which is the part of this story that has no clean resolution yet.