· 7 min read ·

Modeling the CUDA Matrix: What Conan and CMake Get Right About C++ AI Builds

Source: isocpp

Every year the ISO C++ developer survey returns with the same finding: dependency management is the single largest pain point in the C++ ecosystem. Not syntax, not ABI, not the preprocessor. Dependencies. The community has known this for over a decade, and the answer has been a slow convergence on Conan and CMake as a de facto standard pair. A talk at using std::cpp 2026 pushes this further by tackling cross-platform C++ AI development with Conan, CMake, and CUDA, arguing that the CUDA compatibility matrix can be modeled directly in your build rather than managed through tribal knowledge and Docker workarounds.

That claim is worth examining carefully, because the CUDA compatibility problem is genuinely harder than ordinary dependency management.

Why CUDA Makes the Dependency Problem Exponential

In most ecosystems, a dependency has a version. You pin it, lock it, and move on. CUDA has five dimensions that all interact:

  • CUDA Toolkit version (e.g., 12.2, 11.8): determines which APIs and libraries are available at compile time
  • NVIDIA driver version (e.g., 535.x, 560.x): must be at or above the minimum required by the toolkit
  • Compute capability (e.g., SM 8.0 for Ampere, SM 9.0 for Hopper): the GPU architecture target baked into your binaries
  • CUDA-adjacent libraries (cuDNN, NCCL, TensorRT, cuBLAS): each carries its own compatibility constraint against the toolkit version
  • Host compiler (GCC, MSVC, Clang): nvcc accepts a limited range of host compilers per toolkit version

Mismatch any of these and you get silent failures. A binary compiled for SM 7.5 will not execute on an SM 7.0 device. A cuDNN 8.9 build against CUDA 12.2 will crash at dlopen() if the runtime CUDA version is 11.8, with no build-time warning. Driver checks happen at runtime through cudaDriverGetVersion(), which means a deployment environment with an out-of-date driver looks fine until it runs.

The NVIDIA CUDA Compatibility Guide documents two forward-compatibility modes: standard compatibility, where the driver must be >= the toolkit’s minimum, and enhanced compatibility, which allows newer runtime binaries to run on older (supported) drivers through a compatibility layer. Neither mode eliminates the need to know, precisely, which toolkit version and compute capabilities you are targeting before you ship.

This is the problem the talk addresses: not just getting CUDA to compile, but encoding the full compatibility surface into your build system so that CI and development machines produce the same result.

How Conan 2.x Models the Matrix

Conan 2.x moved aggressively toward a settings-first model for binary compatibility. Every combination of settings produces a distinct package ID, meaning a package built for CUDA 12.2 with SM 80 is a different binary artifact from one built for CUDA 11.8 with SM 75. The relevant settings live in settings.yml and your Conan profile:

[settings]
os=Linux
arch=x86_64
compiler=gcc
compiler.version=12
compiler.libcxx=libstdc++11
build_type=Release
cuda_version=12.2

In a conanfile.py, you express CUDA-specific dependencies conditionally and pass the relevant architecture targets down to CMake through the toolchain generator:

from conan import ConanFile
from conan.tools.cmake import CMakeToolchain, CMake, cmake_layout

class GPUInferenceLib(ConanFile):
    name = "gpu_inference"
    version = "1.0"
    settings = "os", "arch", "compiler", "build_type"
    options = {
        "with_cuda": [True, False],
        "cuda_arch": ["75", "80", "86", "90"],
    }
    default_options = {
        "with_cuda": True,
        "cuda_arch": "80",
    }

    def requirements(self):
        if self.options.with_cuda:
            self.requires("cudnn/8.9.7")

    def generate(self):
        tc = CMakeToolchain(self)
        if self.options.with_cuda:
            tc.variables["CMAKE_CUDA_ARCHITECTURES"] = self.options.cuda_arch
        tc.generate()

    def build(self):
        cmake = CMake(self)
        cmake.configure()
        cmake.build()

The key move here is that cuda_arch becomes a first-class option, not a comment in a README. The package binary ID encodes it. Two packages with different cuda_arch values are different packages. This means Conan’s binary cache can store and retrieve pre-built artifacts per architecture without any additional tooling.

For package_id() refinement, you can collapse minor CUDA versions together if you want to express forward compatibility within a major version:

def package_id(self):
    # Treat CUDA 12.x as a single package family
    self.info.settings.cuda_version = \
        self.settings.cuda_version.split(".")[0]

This is a deliberate trade-off. Coarser package IDs mean more cache hits; finer IDs mean stricter isolation. The right choice depends on whether your library uses APIs introduced in a minor toolkit release.

CMake’s Native CUDA Language Support

CMake has had native CUDA support since 3.8 and has been steadily improving it. The old FindCUDA module-based approach was deprecated in 3.18. The modern pattern uses enable_language(CUDA) or, more commonly, listing CUDA as a project language:

cmake_minimum_required(VERSION 3.24)
project(gpu_inference LANGUAGES CXX CUDA)

set(CMAKE_CUDA_STANDARD 17)
set(CMAKE_CUDA_STANDARD_REQUIRED ON)

# In CMake 3.25+ you can use "native" for local development.
# For distribution, enumerate explicit architectures.
set(CMAKE_CUDA_ARCHITECTURES 80 86 90)

add_library(kernels STATIC
    src/attention.cu
    src/layernorm.cu
)

target_compile_options(kernels PRIVATE
    $<$<COMPILE_LANGUAGE:CUDA>:--expt-extended-lambda --generate-line-info>
)

target_link_libraries(kernels PUBLIC
    CUDA::cudart
    CUDA::cublas
    CUDA::cusolver
)

The $<$<COMPILE_LANGUAGE:CUDA>:...> generator expression is how you pass nvcc-specific flags without contaminating your host compiler invocations. The CUDA::cudart and CUDA::cublas targets come from CMake’s built-in FindCUDAToolkit module (introduced in CMake 3.17), which is separate from the deprecated FindCUDA.

For separable compilation, where device code in one translation unit references device functions in another, you need:

set_target_properties(kernels PROPERTIES
    CUDA_SEPARABLE_COMPILATION ON
    CUDA_RESOLVE_DEVICE_SYMBOLS ON
)

This enables -rdc=true in nvcc and adds a device-link step at the end. It increases binary size and link time but is necessary for any non-trivial GPU codebase split across multiple .cu files.

Conan’s CMakeToolchain generator handles injecting the CMAKE_CUDA_ARCHITECTURES variable and the compiler path. When you run conan install . --profile cuda_profile, it writes a conan_toolchain.cmake that your cmake --preset invocation picks up automatically.

The One-Command Workflow in Practice

The promise of the talk, and the broader Conan + CMake approach, is that the following sequence works identically on a Linux workstation with an A100, a Windows machine with an RTX 4090, and a CI runner with a T4:

conan install . --profile:host=profiles/cuda12-linux-gcc12 --build=missing
cmake --preset conan-release
cmake --build --preset conan-release

The profile file carries the entire platform and CUDA configuration. CI stores the profile alongside the code. The developer picks the profile that matches their hardware. There is no manual export CUDA_HOME step, no editing of CMakeLists.txt to hardcode an architecture, no mismatch between what CI compiled and what the developer tested.

This contrasts sharply with the Python ecosystem’s approach. Conda-forge handles CUDA through package naming conventions (pytorch-cuda=12.1) and pinned metapackages. It works reasonably well for end-user environments, but it offers no way to encode a custom library’s CUDA requirements into a reproducible artifact. You get reproducibility through environment files, not through binary identity. Rust, which has no GPU build toolchain to speak of yet, sidesteps the problem entirely through bindgen and FFI.

C++ with Conan is doing something structurally different: it is encoding the full hardware compatibility surface into the artifact identity itself. That is what allows binary caching across heterogeneous CI pools to work correctly.

Where the Rough Edges Still Are

The approach is sound but not without friction. Conan Center Index’s CUDA package coverage lags behind NVIDIA’s release cadence. When CUDA 12.4 or 12.6 ships, there is often a window before CCI packages catch up, which forces teams to write their own conanfile.py wrappers around a locally-installed toolkit. The system_package pattern in Conan is the escape hatch here:

def system_requirements(self):
    # Fall back to system-installed CUDA if CCI package unavailable
    self.system_requires("cuda-toolkit")

Compute capability selection also requires discipline. Using CMAKE_CUDA_ARCHITECTURES native in CMake 3.25+ is convenient for local development because it targets exactly the GPU in the build machine. It is wrong for any artifact that might run on different hardware, because it produces a binary that will silently fail on a different architecture. Separating the development profile (native) from the distribution profile (explicit SM list) requires explicit policy at the profile level.

Driver-toolkit mismatches remain impossible to catch at build time. A deployment environment with driver 525 cannot run binaries built with CUDA 12.4 (which requires driver 550+). Conan can encode the toolkit version; it cannot verify the driver on the target machine. Runtime checks with cudaDriverGetVersion() and a clear failure message are still the developer’s responsibility.

None of these are fundamental flaws in the Conan approach. They are gaps that close as the ecosystem matures. The structural insight, that CUDA compatibility belongs in your dependency model rather than in tribal knowledge, is correct and overdue.

What This Means for C++ in the AI Toolchain

Most AI workloads today are driven from Python, with C++ appearing at the library boundary through pybind11 or ctypes. As inference performance demands tighten, more teams are writing inference runtimes, custom attention kernels, and quantization layers directly in CUDA C++, then exposing thin Python APIs. The build tooling problem is real: a team building a custom kernel library needs reproducible builds across developer machines, CI, and deployment containers.

Conan and CMake, composed the way this talk describes, give C++ AI development the same baseline reproducibility that cargo gives Rust or that pip plus a well-constructed environment.yml gives Python, with the additional requirement of encoding GPU architecture into the artifact. That is not a small thing. For a language ecosystem that has lived without a standard package manager for forty years, getting dependency management right for one of the most complex hardware compatibility surfaces in modern computing is meaningful progress.

Was this interesting?