The ISO C++ annual developer survey has produced the same headline finding for several years running: dependency management is the single biggest pain point in C++ development. This is not surprising to anyone who has wrestled with a C++ project beyond trivial size. What is notable is how much worse things get when you introduce CUDA into the mix. A talk at using std::cpp 2026 addresses this directly, promising a workflow that delivers one source checkout, one command, and identical builds on every platform. Understanding why that is difficult to achieve requires unpacking what the CUDA compatibility matrix actually is.
Three Axes, Not One
When CUDA builds fail, the instinct is to blame a version mismatch. That framing is too vague to be useful. There are three independent version axes that all have to align, and they interact differently.
The first axis is the CUDA toolkit version: the compiler (nvcc), runtime headers, and utility libraries packaged together. Recent versions span 12.0 through 12.6, with 12.4 and 12.6 being common targets in current projects. The second axis is the GPU driver version installed on the host machine. Each toolkit version has a minimum required driver; CUDA 12.4 requires at least driver 550.54.14 on Linux and 551.61 on Windows. The relationship is forward-compatible: a CUDA 11.x application runs on a CUDA 12.x driver without recompilation, but the reverse is not true. A CI runner with driver 525 cannot execute code compiled for CUDA 12.4 at runtime, even if it compiled fine.
The third axis is GPU compute capability, the microarchitecture version identifier for the physical hardware. RTX 4090 is compute capability 8.9 (Ada Lovelace). A100 is 8.0 (Ampere). V100 is 7.0 (Volta). Code compiled targeting sm_80 will not execute on a sm_75 GPU. Code compiled with embedded PTX for compute_80 will JIT-compile on sm_90 hardware via the driver, but that first-run JIT latency is real and accumulates across a cold deployment.
The combinatorics become apparent quickly. Supporting the last three GPU generations (Turing, Ampere, Ada Lovelace) with CUDA 12.x means correctly targeting sm_75, sm_80, sm_86, sm_87, and sm_89, while handling architecture-specific features that vary between them. Hopper (sm_90, sm_90a) adds the a suffix variant for special tensor core operations that are invalid on prior architectures. Fat binaries that embed all of these get large fast; PyTorch pre-built wheels targeting multiple architectures routinely exceed 2 GB.
CMake’s CUDA Model Since 3.18
The old approach using find_package(CUDA REQUIRED) and the cuda_add_executable macro is officially deprecated as of CMake 3.27. The replacement has been stable since 3.18 and is worth understanding in detail.
CUDA is now a first-class language in CMake. You enable it with enable_language(CUDA) or by adding CUDA to the project() languages list. This gives you the full target-based machinery: add_library, set_target_properties, target_compile_options, and generator expressions all work on .cu files the same way they work for C++.
The central property is CMAKE_CUDA_ARCHITECTURES, introduced in CMake 3.18 with standardized behavior in 3.23:
cmake_minimum_required(VERSION 3.18)
project(MyMLLib LANGUAGES CXX CUDA)
# For developer builds: compile only for the installed GPU
set(CMAKE_CUDA_ARCHITECTURES native)
# For distribution: cover the major architecture families
set(CMAKE_CUDA_ARCHITECTURES 70 75 80 86 89 90)
# CMake 3.23+ shorthand that expands to one entry per major version
set(CMAKE_CUDA_ARCHITECTURES all-major)
The native value detects installed GPUs at configure time and generates code only for those architectures. This is the correct default for developer machines where waiting for a six-architecture fat binary is unnecessary. The all-major value is appropriate for CI artifacts that need to run anywhere.
FindCUDAToolkit (CMake 3.17+) handles the library side without requiring any .cu files in your target. It exposes properly scoped imported targets for everything in the toolkit:
find_package(CUDAToolkit REQUIRED)
target_link_libraries(my_inference_engine PRIVATE
CUDA::cudart
CUDA::cublas
CUDA::cublasLt
)
Contrast this with the old ${CUDA_LIBRARIES} variable approach, which was a flat list with no scoping information. The imported targets carry their own include paths, so there is no separate target_include_directories call for CUDA headers. CUDAToolkit_VERSION_MAJOR and CUDAToolkit_VERSION_MINOR are available as CMake variables after the find_package call, which lets you gate feature availability at compile time:
target_compile_definitions(my_lib PRIVATE
CUDA_VERSION_MAJOR=${CUDAToolkit_VERSION_MAJOR}
CUDA_VERSION_MINOR=${CUDAToolkit_VERSION_MINOR}
)
For libraries that call __device__ functions across translation units, separable compilation is also now a clean target property rather than a macro flag:
set_target_properties(my_kernels PROPERTIES
CUDA_SEPARABLE_COMPILATION ON
CUDA_RESOLVE_DEVICE_SYMBOLS ON
CUDA_STANDARD 17
)
What Conan 2.x Can and Cannot Own
Conan 2.x can own your C++ dependency graph with precision: version constraints, platform-specific settings, per-profile build options. What it cannot own is the CUDA toolkit installation itself. The toolkit is a system-level dependency that lives outside the Conan graph entirely, and no C++ package manager currently changes that.
The practical consequence is that CUDA fits into Conan 2.x as an option rather than a hard setting. A conanfile.py for a CUDA-dependent library looks like this:
from conan import ConanFile
from conan.tools.cmake import CMake, CMakeToolchain, cmake_layout
import os
class InferenceLib(ConanFile):
name = "inference-lib"
settings = "os", "arch", "compiler", "build_type"
options = {
"cuda_version": ["11.8", "12.0", "12.1", "12.2", "12.4", "12.6"],
"cuda_arch": [None, "75", "80", "86", "89", "90"],
}
default_options = {"cuda_version": "12.4", "cuda_arch": "80"}
def generate(self):
tc = CMakeToolchain(self)
tc.variables["CMAKE_CUDA_ARCHITECTURES"] = str(self.options.cuda_arch)
cuda_path = os.environ.get("CUDA_PATH", "/usr/local/cuda")
tc.variables["CUDAToolkit_ROOT"] = cuda_path
tc.generate()
def layout(self):
cmake_layout(self)
The cuda_version option becomes part of the Conan package ID. Two builds with different CUDA versions produce different binary packages and do not silently overwrite each other in the cache. The toolkit path itself comes from an environment variable or a profile [buildenv] section, keeping system-level concerns out of the recipe.
The Conan profile separates cleanly by toolkit version:
[settings]
os=Linux
arch=x86_64
compiler=gcc
compiler.version=12
build_type=Release
[buildenv]
CUDA_PATH=/usr/local/cuda-12.4
A second profile for CUDA 11.8 points to a different toolkit path and specifies cuda_version=11.8 in the options. The same source tree, the same command, different Conan profiles, and you get correctly separated binary artifacts. This is the model the using std::cpp 2026 talk describes: the compatibility matrix is encoded in build files rather than living in CI scripts and tribal knowledge.
CI Without GPUs and the Windows Constraint
Most CI runners do not have NVIDIA GPUs. This is manageable because CUDA compilation and CUDA execution are separate concerns. Building a project against CUDA 12.4 on a CPU-only runner using the nvidia/cuda:12.4.1-devel-ubuntu22.04 Docker image works fine as long as you are not running kernels. Verify that the code compiles against the correct toolkit version, link against the right libraries, and reserve GPU execution tests for hardware-equipped runners.
The driver version constraint only matters at runtime. The devel image provides the compiler and headers; the actual GPU driver requirement is irrelevant for a compile-only CI step.
The Windows story is harder. On Windows, nvcc requires MSVC as the host compiler. CUDA 12.x supports VS2019 (MSVC 14.2x) and VS2022 (MSVC 14.3x). Clang cannot serve as the nvcc host compiler on Windows without significant patching. This means any code that mixes CUDA and C++ on Windows must use MSVC for host-side translation units, regardless of other toolchain preferences in the project. CMake handles the detection automatically when CUDA is enabled as a language, but it is a constraint you need to account for when building an otherwise Clang-first project. The debug runtime selection (/MD vs /MDd) must also be consistent: the CUDA runtime uses /MD, so mixing debug C++ code with CUDA objects requires care.
What “One Command” Actually Delivers
The goal stated in the talk is achievable with real preconditions. One command works once the toolkit is installed out-of-band, once the profile is written for the target platform, and once the GPU architecture list is agreed upon for the project. Those preconditions are not trivial for a team with heterogeneous hardware: developer laptops with RTX 3000-series, CI runners with A100s, and production inference clusters with H100s represent three different compute capability families that the CMAKE_CUDA_ARCHITECTURES list needs to cover.
What the Conan plus CMake combination buys you is that this decision is made once in version-controlled build files and stays consistent across every developer’s checkout. The alternative is ad-hoc -gencode flags scattered across CI scripts, Dockerfiles, and README instructions that accumulate drift as the team grows and hardware changes.
Encoding the matrix in conanfile.py options and CMakeLists.txt properties makes it auditable. When a new GPU architecture releases, you update one list in one file and the change propagates through every build configuration. That reproducibility is the concrete answer to the dependency management complaint that the ISO C++ survey keeps surfacing: not a generic solution to the problem, but a disciplined place to put the constraints that CUDA introduces, where they can be reviewed, versioned, and understood.