· 6 min read ·

Three Version Numbers, One Build: Conan and CMake Take On the CUDA Compatibility Matrix

Source: isocpp

The Problem With Three Moving Parts

Every CUDA project carries three distinct version numbers that need to stay in sync: the CUDA Toolkit version (the compiler and libraries you build against), the NVIDIA driver version (the kernel module on the target machine), and the compute capability of the physical GPU (sm_80 for A100, sm_90 for H100). These axes are related but not identical, and the relationships between them are asymmetric.

The toolkit and driver relationship works like this: an application compiled against CUDA 12.3 requires a minimum driver version in the 545.x range on Linux (exact floors are documented in NVIDIA’s release notes); an app built against CUDA 12.0 needs at least 525.60.13. The driver is not backward-compatible with newer toolkits; you cannot run a CUDA 12.4 application against a 12.3 driver. NVIDIA does provide an “enhanced compatibility” mode introduced in CUDA 11.1, which allows applications built with a newer toolkit to run on older drivers by linking statically against libcudart and avoiding new APIs. Container runtimes lean on this mechanism, but it covers only a subset of cases.

The compute capability axis is separate. CUDA compiles to either SASS (Shader Assembly, architecture-specific binary machine code) or PTX (Parallel Thread eXecution, a virtual ISA JIT-compiled by the driver at runtime). SASS is fast but not forward-compatible: code compiled for sm_80 will not run on an sm_90 GPU without recompilation. PTX is forward-compatible and will JIT on any newer architecture, but there is a JIT overhead on first launch and, more critically, you are limited to the features available in that PTX generation.

That ceiling is a practical problem with Hopper hardware. The Tensor Memory Accelerator and warp specialization primitives are only available under sm_90a, the architecture-specific variant for H100. These features have no PTX representation. Code using CUTLASS 3.x’s optimized Hopper GEMM kernels must explicitly compile for sm_90a (CMake 3.26+ recognizes this value in CMAKE_CUDA_ARCHITECTURES); shipping generic sm_90 PTX will not expose those instructions on an H100. The “just include PTX and let the driver handle it” strategy breaks at architecture-specific extensions.

The ISO C++ annual developer survey has consistently ranked dependency management as the number one pain point for C++ developers, ahead of compile times and template error messages. For CUDA projects, that pain compounds because you are not just managing C++ library dependencies but modeling a compatibility space across toolkit versions, driver requirements, and GPU generations. A talk at using std::cpp 2026 addresses this directly, showing how Conan and CMake can encode that compatibility matrix into the build system so developers and CI see identical builds from a single command.

How CMake Handles the Architecture Dimension

Modern CMake (3.24+) provides first-class CUDA support via enable_language(CUDA) and the CMAKE_CUDA_ARCHITECTURES variable. The variable accepts concrete architecture numbers, the keywords native (detect the installed GPU, CMake 3.24+) and all-major (every major SM from sm_50 onward, CMake 3.23+), and per-architecture suffixes that control whether SASS, PTX, or both get embedded in the fat binary.

cmake_minimum_required(VERSION 3.24)
project(MLProject LANGUAGES CXX CUDA)

set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CUDA_STANDARD_REQUIRED ON)

# SASS for Ampere and Ada Lovelace, PTX for Hopper and forward compat
# Use 90a instead of 90 if any kernels need TMA or warp specialization
set(CMAKE_CUDA_ARCHITECTURES "80;86;89;90-real;90-virtual")

The -real suffix emits SASS only; -virtual emits PTX only; an unadorned number emits SASS by default. Adding 90-virtual means the fat binary includes Hopper PTX, letting the driver JIT on future Blackwell hardware (sm_100), while Hopper itself uses compiled SASS. Setting all-major is convenient for development but produces large binaries and is inappropriate for distributed libraries.

CMake translates CUDA_ARCHITECTURES directly into nvcc’s --generate-code flag pairs. Setting 80;90 expands to:

nvcc -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90

And for architecture-specific Hopper features, 90a maps to -gencode arch=compute_90a,code=sm_90a.

For CUDA Toolkit library dependencies, CMake 3.24’s FindCUDAToolkit provides proper imported targets, replacing the deprecated FindCUDA module:

find_package(CUDAToolkit REQUIRED)

target_link_libraries(my_kernels PUBLIC
    CUDA::cudart
    CUDA::cublas
    CUDA::cublasLt
    CUDA::nvToolsExt
)

Where Conan Fits In

Conan handles the package dependency layer. The CUDA toolkit itself is a system installation; Conan does not download nvcc. Libraries that depend on it, including cuDNN, NCCL, CUTLASS, and CUB, can be pulled from ConanCenter or a private remote. The mechanism that makes cross-platform builds coherent is the Conan profile, which captures both host settings and the toolchain configuration needed for nvcc.

A Linux profile for CUDA 12.3 development:

[settings]
os=Linux
arch=x86_64
compiler=gcc
compiler.version=12
compiler.libcxx=libstdc++11
build_type=Release

[conf]
tools.cmake.cmaketoolchain:generator=Ninja
tools.build:compiler_executables={"cuda": "/usr/local/cuda-12.3/bin/nvcc"}
tools.cmake.cmaketoolchain:variables={"CMAKE_CUDA_ARCHITECTURES": "80;86;89;90"}

[buildenv]
CUDA_PATH=/usr/local/cuda-12.3
PATH=+[/usr/local/cuda-12.3/bin]
LD_LIBRARY_PATH=+[/usr/local/cuda-12.3/lib64]

The CMakeToolchain generator in Conan 2 writes a conan_toolchain.cmake file that CMake includes at configure time. The variables block in the profile injects CMAKE_CUDA_ARCHITECTURES into that file, meaning the architecture targeting lives in one version-controlled location rather than scattered across CI scripts and developer dotfiles. Using cmake_layout in the conanfile also produces a CMakePresets.json with conan-release and conan-debug presets, so the full workflow from source is:

# conanfile.py
from conan import ConanFile
from conan.tools.cmake import CMakeToolchain, CMakeDeps, CMake, cmake_layout

class AIProject(ConanFile):
    settings = "os", "compiler", "build_type", "arch"
    requires = ["cutlass/3.4.0", "spdlog/1.13.0"]

    def generate(self):
        tc = CMakeToolchain(self)
        tc.variables["CMAKE_CUDA_ARCHITECTURES"] = "80;86;89;90"
        tc.variables["CMAKE_CUDA_STANDARD"] = "17"
        tc.generate()
        CMakeDeps(self).generate()

    def layout(self):
        cmake_layout(self)

    def build(self):
        cmake = CMake(self)
        cmake.configure()
        cmake.build()
conan install . --profile ai-linux-cuda12 --build=missing
cmake --preset conan-release
cmake --build --preset conan-release --parallel $(nproc)

The same profile file committed to the repository serves both the developer workstation and the CI runner. One source checkout, one command, identical builds.

Package ID and CUDA Version Slots

Conan’s package_id determines binary compatibility. A package built against CUDA 12.3 gets a different ID from one built against CUDA 12.0. This is usually correct because different toolkit versions can introduce ABI differences. NVIDIA’s minor-version compatibility guarantee, however, means that within CUDA 12.x, application binaries are compatible. You can express this in Conan to avoid redundant rebuilds of library packages that have no actual toolkit ABI dependency:

def package_id(self):
    v = str(self.info.settings.get_safe("cuda_version", ""))
    if v.startswith("12."):
        self.info.settings.cuda_version = "12.x"

This lets a binary built against CUDA 12.0 satisfy a requirement specifying 12.3. Packages that link against libcuda.so directly or use features added in specific toolkit releases should not use this relaxation.

Comparing the Alternatives

vcpkg handles CUDA-dependent ports, including OpenCV with CUDA, FAISS, and ONNX Runtime, and also assumes a system-installed toolkit. Architecture targeting goes through custom triplet files: a x64-linux-cuda.cmake triplet can set VCPKG_CMAKE_CONFIGURE_OPTIONS to pass -DCMAKE_CUDA_ARCHITECTURES=80;86;89;90. This works, but lacks the profile-level toolchain variable injection that Conan’s CMakeToolchain generator provides. For Windows with MSVC, vcpkg’s tight Visual Studio integration is a genuine advantage.

Bazel with rules_cuda offers more hermetic builds and can declare the CUDA toolkit as a Bazel repository, useful for strict reproducibility. The cost is the learning curve and weak Windows support. TensorFlow and JAX use this path internally; adoption outside Google-adjacent projects is limited.

Meson has CUDA language support but incomplete handling of CUDA separable compilation and weaker ecosystem coverage for AI workloads. It fits GPU driver or firmware projects more naturally than ML library development.

For teams working across Linux development machines, GPU CI runners, and container-based deployments, Conan 2 plus CMake 3.24+ has the most complete story available today: architecture targeting in profiles, CUDA library dependency management via ConanCenter, and CMakeToolchain bridging the two into a single reproducible workflow.

Fat Binary Size and the Coverage Trade-off

Architecture coverage carries a direct size cost. A fat binary targeting sm_75 through sm_90 across thousands of kernels accounts for a large share of PyTorch’s 2+ GB wheel size. For internal libraries distributed within a known infrastructure, targeting only the architectures you actually deploy is correct. For publicly distributed libraries, all-major plus a PTX fallback on the highest architecture is a defensible default.

The practical set for AI workloads in 2026: sm_80 covers A100, still the dominant data center GPU for many teams; sm_86 covers A10 and consumer Ampere; sm_89 covers L4 and L40, the common inference SKUs; sm_90 covers base H100. Include sm_90a separately when any kernel uses Hopper-specific TMA or warp specialization instructions. Adding 90-virtual provides forward compatibility with Blackwell without an immediate recompile, at the cost of a slightly larger binary.

The contribution of encoding all of this in Conan profiles and CMake toolchain files is that the compatibility decisions become explicit and version-controlled rather than tribal knowledge spread across CI scripts and developer setups. The CUDA compatibility matrix is genuinely complex. The answer is not to simplify it but to put it somewhere authoritative.

Was this interesting?