The Conan Toolchain Contract and the Two-Stage CUDA CI Pipeline

The Promise and the Actual Workflow

A talk at using std::cpp 2026 pitches a clean goal for cross-platform C++ AI development: one source checkout, one command, identical builds on every platform. The underlying infrastructure is Conan 2.x for dependency management and modern CMake for build orchestration, targeting projects that depend on CUDA. The pitch is accurate as a description of reproducibility. As a literal description of the command sequence, it compresses three distinct stages into one.

The actual workflow looks like this:

conan install . --profile=cuda-12.4-linux --build=missing -of build
cmake --preset conan-cuda-release
cmake --build --preset conan-cuda-release

These are three separate commands because they have different jobs and different outputs. conan install resolves dependencies, compiles missing binaries from source, and writes a set of CMake integration files into the build directory. cmake --preset reads those files and configures the build system. cmake --build compiles. The reason they cannot be collapsed into one is a timing constraint that runs through the heart of CMake’s CUDA support.

Why the Toolchain Must Load Before project()

CMake locks in compiler detection at the project() call. Before project() runs, the build system is unconfigured: there is no C++ standard library, no CUDA compiler path, no architecture list. After project() runs, these are fixed for the lifetime of the configuration. Setting CMAKE_CUDA_COMPILER or CMAKE_CUDA_ARCHITECTURES in a CMakeLists.txt after the project() call has no effect.

The mechanism that allows configuration before project() is CMAKE_TOOLCHAIN_FILE. CMake evaluates the toolchain file before processing the project() call, which is why it can inject compiler paths, include directories, and build settings into the configuration. This is not an accident: toolchain files exist specifically to target non-host platforms where the default compiler detection would produce wrong results.

conan install generates conan_toolchain.cmake in the build output directory. This file sets CMAKE_CUDA_COMPILER, CMAKE_CUDA_ARCHITECTURES, CMAKE_PREFIX_PATH, and other variables that must be present before project(). The critical constraint is that cmake --preset must point at this generated file. If the file does not exist because conan install has not run, the CMake configuration either fails or silently uses wrong defaults.

CMakePresets.json makes this dependency explicit and machine-readable:

{
  "version": 6,
  "configurePresets": [
    {
      "name": "conan-cuda-release",
      "generator": "Ninja",
      "binaryDir": "${sourceDir}/build/Release",
      "toolchainFile": "${sourceDir}/build/Release/generators/conan_toolchain.cmake",
      "cacheVariables": {
        "CMAKE_BUILD_TYPE": "Release"
      }
    }
  ],
  "buildPresets": [
    {
      "name": "conan-cuda-release",
      "configurePreset": "conan-cuda-release"
    }
  ]
}

The toolchainFile entry is the contract: it tells CMake where to find the Conan-generated toolchain file. Any developer who runs cmake --preset conan-cuda-release without first running conan install gets an immediate, intelligible error rather than a silent misconfiguration. The preset encodes the dependency without requiring human knowledge of the ordering.

The CUDA Language Lock

The practical consequence for CUDA projects specifically is that CMAKE_CUDA_ARCHITECTURES must be set before project(LANGUAGES CUDA) runs. If it is not, CMake uses whatever is in the environment or its own defaults, which are typically wrong for both CI and distribution.

A CMakeLists.txt that correctly relies on the Conan toolchain looks like this:

cmake_minimum_required(VERSION 3.24)

# conan_toolchain.cmake has already set CMAKE_CUDA_ARCHITECTURES
# and CMAKE_CUDA_COMPILER from the profile. project() locks these in.
project(inference_engine LANGUAGES CXX CUDA)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CUDA_STANDARD 17)

find_package(CUDAToolkit REQUIRED)

message(STATUS "CUDA architectures: ${CMAKE_CUDA_ARCHITECTURES}")
message(STATUS "CUDA compiler: ${CMAKE_CUDA_COMPILER}")
message(STATUS "CUDA version: ${CUDAToolkit_VERSION}")

add_library(gpu_kernels STATIC src/attention.cu src/matmul.cu)

set_target_properties(gpu_kernels PROPERTIES
    CUDA_SEPARABLE_COMPILATION ON
    CUDA_RESOLVE_DEVICE_SYMBOLS ON
)

target_link_libraries(gpu_kernels PUBLIC
    CUDA::cudart_static
    CUDA::cublas
    CUDA::cuda_driver
    $<$<PLATFORM_ID:Linux>:${CMAKE_DL_LIBS}>
    $<$<PLATFORM_ID:Linux>:rt>
    $<$<PLATFORM_ID:Linux>:pthread>
)

The CUDA::cuda_driver entry in target_link_libraries is not obvious from reading documentation. It links against a stub library that satisfies the linker without requiring a real CUDA driver to be present. On a machine with a GPU, the real driver is loaded at runtime. On a CPU-only CI runner, the binary links successfully and can be installed, distributed, or tested for host-side logic without a GPU.

This is the mechanism that makes GPU-less compilation in CI possible.

The Two-Stage CI Pipeline

Standard CI runners (GitHub Actions hosted, GitLab shared runners, most Jenkins agents) have no GPU. Requiring a GPU for every build is expensive and slows the feedback loop. The CUDA::cuda_driver stub enables a split that most CUDA C++ projects do not have today but should.

Stage one runs on CPU-only infrastructure and produces compiled binaries:

jobs:
  build:
    runs-on: ubuntu-22.04
    container:
      image: nvidia/cuda:12.4.1-devel-ubuntu22.04
    steps:
      - uses: actions/checkout@v4
      - name: Install Conan
        run: pip install conan
      - name: Install dependencies
        run: |
          conan profile detect --force
          conan install . \
            --profile=profiles/linux-cuda-12.4-ampere \
            --build=missing \
            -of build/Release
      - name: Configure
        run: cmake --preset conan-cuda-release
      - name: Build
        run: cmake --build --preset conan-cuda-release --parallel
      - uses: actions/upload-artifact@v4
        with:
          name: cuda-binaries
          path: build/Release/

The container image matters. nvidia/cuda:12.4.1-devel-ubuntu22.04 includes the CUDA toolkit headers, nvcc, and the static libraries needed for compilation. The -runtime- variant includes only the runtime libraries for executing CUDA code. The -base- variant includes nothing beyond the driver stub. For compilation, -devel- is the correct choice. Using -runtime- produces a configure error from FindCUDAToolkit because the header files are absent.

Stage two runs on GPU-equipped hardware and executes the binaries:

  test:
    needs: build
    runs-on: self-hosted-gpu-runner
    steps:
      - uses: actions/download-artifact@v4
        with:
          name: cuda-binaries
          path: build/Release/
      - name: Run GPU tests
        run: ./build/Release/tests/kernel_tests

The GPU stage does not need the full CUDA toolkit. It needs a compatible driver and the CUDA runtime, both of which are present on any properly configured GPU host.

The Container Trap

One non-obvious issue in this pipeline: nvidia-smi inside a container does not report the toolkit version installed in the container. It reports the maximum toolkit version that the host driver supports, injected at container runtime by nvidia-container-toolkit. A container running nvidia/cuda:12.4.1-devel-ubuntu22.04 on a host with a 560.x driver will show CUDA 12.6 capability in nvidia-smi, even though only the 12.4 toolkit is inside the container.

This matters because it is tempting to use nvidia-smi output to verify which CUDA version is active. The correct verification is:

nvcc --version
# or
python3 -c "import subprocess; print(subprocess.check_output(['nvcc', '--version']).decode())"

The Conan profile and the container image tag are the authoritative version sources, not nvidia-smi. The profile specifies cuda_version=12.4, the container tag pins 12.4.1-devel, and together they constitute the reproducibility contract. Encoding this pairing in version control, rather than in environment configuration, is how the “identical builds on every platform” goal actually gets enforced.

The Conan Profile as Environment Specification

The Conan profile encodes everything the build needs to know about its environment. On the compile side, a Linux profile for Ampere-class hardware looks like this:

[settings]
os=Linux
arch=x86_64
compiler=gcc
compiler.version=12
compiler.libcxx=libstdc++11
build_type=Release
cuda_version=12.4

[buildenv]
CUDA_PATH=/usr/local/cuda-12.4
CUDA_HOME=/usr/local/cuda-12.4
PATH+=/usr/local/cuda-12.4/bin

[conf]
tools.cmake.cmaketoolchain:generator=Ninja
tools.cmake.cmaketoolchain:variables={"CMAKE_CUDA_ARCHITECTURES": "80;86;90"}
tools.build:jobs=16

The [conf] section injects CMAKE_CUDA_ARCHITECTURES into the generated conan_toolchain.cmake. This is where the architecture list lives in version control, not scattered across CI scripts. A developer workstation profile might set "native" for fast iteration; the CI profile sets the explicit list for the production target fleet; a distribution profile sets "all-major" to cover future hardware. The profile becomes the single point of change when you add Hopper support, add a new CI runner class, or onboard a new developer with different hardware.

The CUDA Toolkit release notes and CMake FindCUDAToolkit documentation specify the constraints; the Conan profile is where those constraints become executable, version-controlled configuration rather than documentation that may or may not be read.

Binary Cache Warming

One operational refinement: pre-populating the Conan remote cache with common profile combinations avoids recompiling CUDA code on every CI run. CUDA compilation is slow. A modest inference library with custom attention and matmul kernels can take fifteen minutes to compile from scratch.

A nightly job that runs conan install --build=missing against every defined profile and uploads the results to a Conan remote transforms the day-to-day CI run from a compilation job into a cache retrieval. The --build=missing flag means the nightly job only rebuilds when a dependency version changes, and subsequent runs in the day pull precompiled binaries.

This is the same strategy conda uses with its package index, and what pip’s PyTorch wheel server provides for the Python ecosystem. The C++ version requires more setup because the team maintains the cache server rather than consuming a public one, but the principle is identical: separate the “build the binary” operation from the “use the binary” operation, and cache the expensive one.

The using std::cpp 2026 talk frames this as an infrastructure problem with a build system solution. The Conan profile, CMakePresets.json, and the CUDA::cuda_driver stub are not independently clever. Together they produce a CI pipeline where GPU-less compilation is reliable, reproducible, and fast enough to run on every push.