What C++ Is Finally Doing About CUDA Builds That Other Ecosystems Did Years Ago
Source: isocpp
The ISO C++ Developer Survey has named dependency management the top frustration in C++ development for more years than most developers care to count. A talk at using std::cpp 2026 targets a specific version of that frustration: building C++ AI and ML code that depends on CUDA, across Linux and Windows, with reproducibility that does not fall apart the moment a developer’s machine has a different driver version than the CI runner.
The proposed solution is Conan 2.x profiles combined with modern CMake’s native CUDA language support, encoding the CUDA compatibility matrix in version-controlled build files rather than shell scripts and README sections. The goal is one source checkout, one command, identical builds on every platform.
That goal is achievable. But before exploring how Conan and CMake get you there, it is worth looking at how other language ecosystems already solved this problem. They got there first, and the C++ approach shares more DNA with those solutions than most C++ developers realize.
The Matrix That Needs to Be Expressed
CUDA compatibility is three-dimensional. The first dimension is the toolkit version. Binaries compiled against CUDA 12.4 will not initialize on a host running driver 525.x. The CUDA Toolkit release notes document these minimum driver floors precisely: CUDA 12.4 requires driver 550.54.14 on Linux, CUDA 12.6 raises that to 560.28.03. These are hard limits enforced at cudaInitialize(), and the resulting error messages do not say “wrong driver”. They say things like CUDA_ERROR_NO_DEVICE, which wastes hours before anyone checks the driver version.
The second dimension is compute capability. Code compiled for sm_80 (Ampere A100) does not execute on sm_75 (Turing T4). An H100 is sm_90, with an sm_90a variant that enables tensor core instructions unavailable on every prior architecture. Developer workstations (typically sm_86), CI runners (often sm_80), and production clusters (H100 or newer) may all require different SM targets, and covering them with a single binary means either a fat binary with all architectures or JIT recompilation from PTX at first launch.
The third dimension is host compiler compatibility. CUDA 12.4 supports GCC up to 13, MSVC 2019 and 2022, and Clang up to 17, but those windows shift with every toolkit release. On Windows, nvcc requires MSVC as its host compiler specifically; Clang cannot serve that role without significant patching. The host compiler choice is not a free variable once the CUDA version is fixed.
Most C++ teams encode none of this in their build system. It lives in Confluence pages, CI environment variables, and the institutional memory of whoever originally set up the GPU servers.
How conda Encodes It
Conda treats the CUDA toolkit as an installable package, and that single design decision solves most of the coordination problem. The pytorch channel publishes separate packages per CUDA version:
conda install pytorch -c pytorch cuda-version=12.4
The cuda-version metapackage carries the toolkit version as a dependency specification. Conda’s solver enforces it: if you request a package that requires CUDA 12.4 and your environment has CUDA 11.8, the solver refuses rather than silently mismatching. conda-lock generates platform-specific lockfiles that pin every resolved version, making the entire environment reproducible from a single committed file.
What conda gives you that most C++ build systems do not: the compatibility constraint lives in package metadata, not in the consumer’s build script. The producer of pytorch-cuda-12.4 declares the requirements; every downstream consumer inherits them through normal dependency resolution. No README, no tribal knowledge.
How pip Wheels Handle It
Python’s wheel format encodes CUDA version in the artifact name itself via an informal but widely adopted convention. PyTorch publishes files like torch-2.x.0+cu124-cp311-cp311-linux_x86_64.whl where cu124 signals the CUDA 12.4 dependency. The index URL selection becomes the version selection mechanism:
pip install torch --index-url https://download.pytorch.org/whl/cu124
This encodes the version decision in the install command rather than in package metadata, which is less robust than conda’s solver approach. But it works at scale because the version string is visible in the artifact name itself. You cannot accidentally install a CUDA 12.6 wheel and link it against a CUDA 12.0 runtime without noticing the mismatch in the filename.
The tradeoff is size. PyTorch wheels targeting sm_75 through sm_90a are approximately 2.2GB per CUDA version, because all SM targets are compiled into a single fat binary. The coverage-versus-download-size tradeoff is at least explicit and visible, which is more than most C++ builds offer.
How Spack Handles It
Spack, the HPC package manager, has the most explicit and composable encoding of CUDA constraints. Its spec language treats GPU architecture as a first-class variant:
spack install my-ml-library +cuda cuda_arch=80,86,90
The cuda_arch variant propagates transitively through the dependency graph. Every package that conditionally depends on CUDA receives the variant and can gate behavior on it. Spack can also provision the toolkit itself, unlike every approach above, which matters in HPC environments where toolkit installation is not a given.
Spack recipes express constraints as Python:
depends_on("cuda@12:", when="+cuda")
conflicts("cuda_arch=none", when="+cuda")
The solver enforces these before any compilation begins. If your requested architecture combination is unsatisfiable given installed packages, you learn immediately at spec resolution time rather than three hours into a build.
What Conan + CMake Learns from Each
The approach in the using std::cpp 2026 talk inherits ideas from all three ecosystems, but operates under different constraints. It targets native C++ libraries that must be redistributable as compiled binaries, on both Linux and Windows, without requiring a separate runtime manager.
From conda, it takes the settings-based binary hash. Conan’s settings.yml is user-extensible, and adding cuda_version makes it participate in the binary cache key:
# ~/.conan2/settings.yml (partial)
cuda_version:
- "None"
- "11.8"
- "12.0"
- "12.4"
- "12.6"
A CUDA 12.4 build and a 12.6 build now hash to different cache entries and cannot overwrite each other silently. This is the same guarantee conda’s solver provides for environment artifacts, but applied to a compiled binary cache.
From Spack, it takes the compatibility plugin. Conan 2.x supports a compatibility.py extension at ~/.conan2/extensions/plugins/compatibility.py that codifies CUDA forward-compatibility: a binary built against CUDA 12.0 can satisfy a dependency requesting CUDA 12.0 or newer, within the same major version. That rule is expressed once rather than duplicated across every recipe:
def compatibility(conanfile):
if conanfile.settings.get_safe("cuda_version") == "None":
return []
major = str(conanfile.settings.cuda_version).split(".")[0]
# A 12.4 build is compatible with any 12.x consumer
return [
{"settings": [("cuda_version", f"{major}.{minor}")]}
for minor in range(0, 9)
]
From pip wheels, it takes the explicit architecture list. Where pip encodes architecture coverage in the fat binary itself, Conan profiles encode it in a per-target settings file. A CI profile might specify cuda_version=12.4 and target sm_80;sm_86;sm_90, while a developer profile specifies native to detect the installed GPU at configure time.
The CMake side uses native CUDA language support, available since CMake 3.18. The old FindCUDA module was deprecated in 3.10 and removed in 3.27. Modern builds use either FindCUDAToolkit for linking without compilation, or first-class CUDA language support:
cmake_minimum_required(VERSION 3.24)
project(ml_kernels LANGUAGES CXX CUDA)
find_package(CUDAToolkit REQUIRED)
# 'native' detects installed GPU at configure time (CMake 3.24+)
# 'all-major' covers one SM per major generation
set(CMAKE_CUDA_ARCHITECTURES "80;86;89;90")
add_library(kernels STATIC src/inference.cu)
target_link_libraries(kernels PUBLIC
CUDA::cudart_static
CUDA::cublas
)
CMAKE_CUDA_ARCHITECTURES accepts native for developer machines, all-major for distribution builds, or an explicit list when you know the target fleet. CUDA::cudart_static removes the runtime .so dependency but requires explicit links to dl, rt, and pthread on Linux, which generator expressions handle cleanly:
target_link_libraries(kernels PUBLIC
CUDA::cudart_static
$<$<PLATFORM_ID:Linux>:${CMAKE_DL_LIBS}>
$<$<PLATFORM_ID:Linux>:rt>
$<$<PLATFORM_ID:Linux>:pthread>
)
One timing constraint has no parallel in the other ecosystems and trips up many initial setups: conan_toolchain.cmake must be included before the project() call in CMakeLists.txt. CMake locks in compiler and language configuration at project() time. Setting CUDAToolkit_ROOT or CMAKE_CUDA_COMPILER afterward has no effect.
What This Approach Does Not Do
The Conan + CMake approach gives you a compatibility matrix that lives in version control, generates build errors rather than runtime failures when constraints are violated, and produces reproducible binary artifacts keyed by CUDA version.
What it does not do is provision the toolkit. Unlike Spack, Conan assumes the CUDA toolkit is pre-installed. CI pipelines still require GPU-capable runners or containers from the nvidia/cuda image series, such as nvidia/cuda:12.4-devel-ubuntu22.04, to compile correctly. On Windows, the MSVC pairing constraint further restricts the environment: CUDA 12.x requires MSVC 2019 or 2022, and the CUDA installer expects Visual Studio to already be present.
The “one command” promise holds given preconditions: the toolkit is installed, the Conan profile is written for the target platform, and the GPU architecture list is agreed upon for the project. Those preconditions are not free, but they are one-time setup costs rather than recurring tribal knowledge encoded nowhere. The compatibility rules move from documentation into build files, where a machine can check them instead of a developer.
That shift is what the other ecosystems accomplished years ago. Conda did it by making toolkit version a dependency. Spack did it by making architecture a first-class variant in the spec language. Pip did it by encoding the version in the artifact name. C++ is arriving at the same destination with different tools, shaped by the constraints of native compilation and cross-platform redistribution, but the underlying goal is identical: build systems that know what they require and tell you when those requirements are not met.