Thirty Years of Hardware Miracles, One Language Ecosystem: Why HPC Refuses to Move
Source: lobsters
The Chapel team’s retrospective on 30 years of HPC programming opens with an observation that should be striking but has become almost mundane in HPC circles: the hardware has been completely transformed multiple times over, while the software toolchain has barely budged. Fortran, standardized in 1957 and modernized through the Fortran 90, 95, 2003, 2008, and 2018 standards, remains a first-class language on every major supercomputer. C++ with MPI, which became standard practice in the 1990s, dominates the rest. Every exascale machine on the TOP500 list runs production workloads written in one of those two languages.
This is not ignorance. HPC practitioners are not unaware that Julia, Chapel, Rust, or Python exist. The persistence of Fortran and C++ is a deliberate outcome of structural forces that are worth understanding precisely because they explain why every well-funded, technically serious attempt to break through has failed.
What Hardware Evolution Actually Looked Like
To appreciate the gap between hardware and software evolution, consider what changed on the hardware side. In 1996, the top supercomputer on the TOP500 was the Hitachi SR2201 at the University of Tokyo, achieving 220 GFLOPS on LINPACK. Today, El Capitan at Lawrence Livermore operates at roughly 2 exaFLOPS — nearly ten million times faster in raw peak performance.
The architectural path between those two points involved at least four complete paradigm shifts: the move from vector supercomputers to commodity cluster MPP (massively parallel processing) in the late 1990s, the introduction of multicore CPUs in the mid-2000s, the integration of GPU accelerators starting around 2009 with CUDA, and the current era of heterogeneous compute with AMD’s MI300A APUs on El Capitan blending CPU and GPU memory into a single pool. Each transition required HPC programmers to learn new mental models, new performance characteristics, and new debugging techniques.
And yet, through all of it, the answer was consistently: add a pragma. OpenMP for shared-memory threading, MPI for distributed memory, CUDA or HIP for GPUs. The programming language itself was treated as load-bearing infrastructure that could not be touched.
The PGAS Experiment
The most serious attempt to rethink HPC programming came through the PGAS paradigm, the Partitioned Global Address Space model. The idea is elegant: give the programmer a single global address space across all distributed nodes, but let the runtime and compiler handle the distinction between local and remote memory. This eliminates the explicit message-passing boilerplate of MPI while preserving the performance model that HPC programmers care about.
DARPA’s HPCS (High Productivity Computing Systems) program, running from roughly 2002 to 2012, funded three serious attempts: IBM’s X10, Sun/Oracle’s Fortress, and Cray’s Chapel. All three were technically serious projects with well-resourced teams and genuine language design talent. X10 brought a Java-flavored PGAS model with “places” as the distributed unit of execution. Fortress aimed at a mathematically expressive functional style with implicit parallelism. Chapel attempted to bridge the gap between productivity and performance with explicit parallel constructs built into the language grammar.
X10 was abandoned when IBM de-prioritized HPC research. Fortress was killed after Oracle acquired Sun. Only Chapel survived, and it survives today as an open-source project under HPE, actively maintained but with a user base that remains small by any standard.
Fortran itself added coarrays in the Fortran 2008 standard, effectively adopting the PGAS model for the language it was trying not to replace. The syntax is clean: real :: x[*] declares a coarray replicated across all images, and x[2] accesses the variable on image 2. Intel Fortran and GFortran both support coarrays. This matters because it shows the committee understood the problem; they just solved it by extending what already existed rather than migrating to something new.
Why Code Written in 1996 Still Runs in 2026
The durability of HPC codes is not an accident. It is a consequence of deliberate investments in backward compatibility and stable ABIs.
MPI, standardized through the MPI Forum starting in 1994, has maintained source-level compatibility across major versions. Code written against MPI-1 in 1996 still compiles and runs against modern OpenMPI or MPICH implementations. That is a thirty-year compatibility guarantee that no newer communication library has matched. The cost of rewriting to use something else is not just the rewrite itself; it is also the risk that the new library does not honor that same thirty-year implicit contract.
Fortran compilers have accumulated forty-plus years of optimization work. The vectorization heuristics in Intel Fortran Compiler or Cray’s CFE are not things that a new language compiler can replicate in a ten-year research project. HPC workloads are dominated by dense linear algebra, stencil computations, and FFTs — exactly the workloads those compilers have been tuned against for decades. When a Chapel or Julia program comes within 10-20% of C on a benchmark, that is genuinely impressive. When Fortran is 2% faster than C on DGEMM, that is decades of compiler engineering expressing itself.
Then there is the ecosystem of tools that speak only C and Fortran ABI natively. HDF5 and NetCDF handle the I/O. TAU, Vampir, and Intel VTune handle profiling. TotalView and DDT handle parallel debugging. BLAS, LAPACK, and ScaLAPACK handle the numerical kernels. Integrating any of these from a new language requires FFI bridges that add complexity and often introduce subtle bugs in exactly the kind of floating-point edge cases where scientific codes are most sensitive.
The Trust Problem
Underlying all of this is something that is rarely stated directly in technical discussions but that anyone who has worked in a national lab understands: HPC scientists do not trust compilers they cannot reason about.
A climate model, a fusion plasma simulation, or a nuclear weapons code produces numbers that are compared against physical reality and against previous runs. Floating-point reproducibility matters. When a new compiler optimization changes the order of operations and produces a result that differs in the seventh significant digit, that difference must be explainable. With Fortran and C++, the mental model connecting source code to machine instructions is well-established. Experienced practitioners know how to write code that will or will not be vectorized, what the compiler will or will not inline, where aliasing analysis will or will not apply.
Chapel’s forall loop introduces implicit parallelism decisions made by the runtime. Julia’s JIT compilation produces code whose exact instruction sequence depends on the type-dispatch graph resolved at runtime. Both languages have legitimate answers to reproducibility concerns, but those answers require trusting a new layer of abstraction that has not been validated by thirty years of production use on safety-critical codes.
This is not irrational conservatism. The FLASH code for astrophysical simulations has been in development since 1997. GROMACS for molecular dynamics has been evolving since 1991. VASP for ab initio quantum mechanics dates to 1993. These are not projects where “we can rewrite in the new language” is a proposal that gets seriously entertained.
Where Julia Actually Succeeded
The one newcomer that has made genuine inroads in scientific computing is Julia, and it is instructive to look at where it succeeded and where it did not.
Julia did not attack the core HPC kernel programming market. It did not try to replace the inner loops of LAMMPS or the MPI communication in WRF. Instead, it targeted the interactive scientific computing workflow: the layer where scientists write analysis scripts, prototype algorithms, and generate figures. This is the space previously occupied by MATLAB, Python, and R, where performance matters but the comparison class is those languages, not C++.
By offering JIT-compiled code that could reach within a small factor of C for compute-bound workloads while providing a rich REPL-driven workflow and a package ecosystem (DifferentialEquations.jl, Flux.jl, Plots.jl) that rivals MATLAB, Julia addressed a real pain point without asking anyone to rewrite their Fortran. The two-language problem — prototype in Python, rewrite hot loops in C — is a genuine cost that Julia reduces.
Chapel’s pitch is harder because it is asking for more: replace not just the scripting layer but the actual compute kernels, the MPI communication, the distributed data structures. That is a much larger ask with a much higher bar for trust.
What Chapel Gets Right That Cannot Be Easily Grafted Onto Existing Languages
The most technically interesting part of Chapel is not its productivity features but its treatment of distributed memory as a first-class language concern rather than a library concern.
In MPI, distributing an array across nodes means manually deciding how to partition it, writing code to communicate boundary halos, and ensuring that every access to a remote element goes through an explicit message or a one-sided RMA operation. The programmer manages all of this. When the distribution strategy changes, most of the communication code must be rewritten.
In Chapel, a distributed array is declared using a domain map:
use BlockDist;
const D = {1..N, 1..N} dmapped Block({1..N, 1..N});
var A: [D] real;
The forall loop that follows does not care which elements are local and which are remote. The compiler and runtime handle locality. Switching from Block to Cyclic or Stencil distribution is a one-line change. This is the kind of abstraction that MPI fundamentally cannot provide because MPI is a library, not a language; it has no access to the compiler’s understanding of the program’s memory access patterns.
Fortran coarrays get partway there, but they lack domain maps and require more explicit management of distribution. The Arkouda project, built at MIT Lincoln Laboratory using Chapel for distributed data analytics at petascale, demonstrates that this abstraction is not merely theoretical. It handles trillion-element arrays on thousands of nodes with a Python front-end calling Chapel back-ends, achieving performance that would require significant MPI engineering to replicate.
The Structural Barrier That Remains
The Chapel team’s thirty-year reflection is, at its core, an acknowledgment that technical correctness is necessary but not sufficient for adoption. X10 was technically serious. Fortress had interesting ideas. Chapel is a mature, production-capable language with competitive performance and a genuine conceptual improvement over MPI for many problem classes. None of that has translated into mainstream adoption.
The barrier is not primarily technical. It is the combination of existing investment, tooling ecosystem maturity, community training, and justified conservatism about numerical reliability. Every HPC center has invested decades into their current toolchain. Every HPC programmer has invested years into understanding how to write fast Fortran or C++. Every production code has validation suites tuned to the behavior of existing compilers.
The web ecosystem made its language transitions — from JavaScript to TypeScript, from Ruby/PHP to Node.js, from Python 2 to Python 3 — because the cost of staying put eventually exceeded the cost of migrating, and because the applications were not safety-critical in ways that demanded historical continuity. HPC is structurally different: the applications are safety-critical, the migrations are expensive, and the cost of staying put is hidden precisely because the existing tools continue to work on each new generation of hardware, even if they require increasingly painful low-level programming to extract performance.
That is the thirty-year story. Hardware evolves by necessity. Languages change only when the accumulated pain of not changing becomes undeniable. In HPC, that threshold has not yet been reached, and it is genuinely unclear whether it will be.