CUDA Tooling Was Always a Clang Problem in Disguise

For most of CUDA’s existence, writing kernel code in any modern editor meant accepting a compromised experience: spurious errors on __syncthreads, no completion for device intrinsics, no navigation across the host/device boundary. This was not a matter of low priority or insufficient demand; the problem was structural. CUDA’s split compilation model is genuinely hostile to language servers, and fixing it properly required changing which compiler the toolchain is built around.

Spectral Compute’s extended clangd addresses this for both standard CUDA and their Scale toolchain, which compiles CUDA source for AMD GPU targets. Understanding why the fix matters requires understanding why the problem was hard.

The Dual-Compilation Problem

A .cu file is simultaneously two programs. The host code, everything not annotated __device__ or __global__, compiles as standard C++ for the CPU. The device code compiles for the GPU with different builtins, different type rules, and different address spaces. These two programs share the same syntax and the same headers, but they diverge in every semantic detail that matters to a language server.

The __CUDA_ARCH__ macro is defined only during the device compilation pass. Large portions of CUDA code use it to gate device-only logic via #ifdef __CUDA_ARCH__. Device code has access to __syncthreads(), threadIdx.x, atomicAdd(), __shfl_down_sync(), half-precision types (__half), and vector types (float4, int2). None of these exist in the host compilation. Template instantiations valid for device types may not exist for host types.

clangd’s core architecture is one AST per file. A single-pass host parse reports every device-only builtin as an undeclared identifier. A single-pass device parse misses all host-only code and its constraints. Either way, you get a stream of incorrect diagnostics that trains developers to ignore the language server for .cu files entirely.

nvcc deepened the problem from the other direction. nvcc is a driver program: it preprocesses the source, then calls separate binaries (some proprietary, like cicc) for device compilation, then calls GCC or MSVC for the host pass. Because it is a driver rather than a compiler, nvcc does not produce compile_commands.json entries in a format clangd can consume. The flag syntax differs from standard Clang. Even if you solved the dual-AST problem in theory, you could not easily feed nvcc-built projects into clangd.

What Clang’s CUDA Support Actually Changes

Clang gained CUDA support around 2015, driven largely by Google’s interest in unified toolchain diagnostics and sanitizers. The architectural decision Clang made was to visit the same source twice in a single frontend invocation: once with -fcuda-is-device set (device pass), once without (host pass). This produces two LLVM IR modules from one frontend run, rather than splitting the source and handing off to separate programs.

For tooling, the consequences are significant. The compilation pipeline is a single process with standard Clang flags. The resulting compile_commands.json entries are in Clang’s flag format, which clangd already understands. The compiler is open source, which means extending clangd to perform both passes is an engineering problem with a tractable solution rather than a wall.

The -fcuda-is-device flag is the dividing line. When it is set, __CUDA_ARCH__ is defined, device builtins are in scope, and calling a __host__-only function from a __device__ function is a compile error. A clangd extension that maintains two ASTs per file and routes cursor queries to the appropriate AST based on context can provide semantically correct completions and diagnostics for both the CPU and GPU portions of the same source file. Determining which side the cursor is on is answerable from the AST: __device__ and __global__ function bodies are in device context, everything outside them is in host context, and __host__ __device__ functions require merged behavior from both.

This dual-AST model is the architectural contribution. It is not an approximation or a workaround; it is running the actual Clang frontend twice with the correct flags for each compilation side and presenting a unified view to the editor.

Inline PTX and the Language-Within-a-Language Problem

The Spectral Compute extension specifically adds diagnostics for inline PTX, which is worth examining separately because it is a different class of problem from the host/device split.

CUDA supports GNU-style extended assembly syntax for embedding PTX directly in kernel code:

uint32_t lane;
asm volatile(
    "mov.u32 %0, %laneid;\n"
    : "=r"(lane)
);

The constraint letters map to PTX register types: "r" binds to .u32, "f" to .f32, "l" to .u64, "h" to .u16. The assembly template contains PTX instructions referencing outputs via %0, %1, and so on, alongside PTX special registers like %laneid, %warpid, and %smid.

Standard clangd treats the contents of those string literals as opaque. It validates the GNU asm structure (correct syntax for the constraint block, matching types between C variables and constraint letters), but it cannot parse the PTX inside. Errors like using a .u32 register where a predicate register is required, or referencing a special register not available on the targeted architecture, only surface at ptxas invocation time during compilation.

PTX has a versioned ISA, from PTX ISA 1.0 through 8.x as of CUDA 12.x, and valid instructions depend on the targeted SM architecture. SM 8.0 (Ampere) introduced asynchronous memory copy instructions (cp.async) that do not exist on SM 7.0 (Volta). Catching these errors at edit time rather than during a compilation run is the difference between a clear diagnostic and a silent kernel that produces wrong results or fails to compile in the wrong environment.

The engineering required is self-contained but nontrivial. A PTX-aware validator inside clangd needs to understand the typed register file, the instruction set, the constraint letter mapping, and the architecture-specific instruction availability. This is not something you can get by configuring clangd with better flags; it requires adding PTX grammar knowledge to the language server itself.

CUDA on AMD GPUs: The Same Toolchain, the Same Problem

Spectral Compute builds the Scale toolchain, which compiles CUDA source code for AMD GPU targets. Rather than translating CUDA to HIP at the source level (which is what AMD’s HIPIFY tool does), Scale uses Clang’s CUDA frontend directly and substitutes the AMDGPU backend for the nvptx64 backend. A compatibility library maps CUDA runtime API calls to ROCm equivalents at link time. Device library implementations of CUDA builtins are compiled for AMDGPU targets.

The AMDGPU and nvptx64 targets differ architecturally in meaningful ways. NVIDIA uses a 32-thread warp; AMD uses a 64-thread wavefront (with wave32 mode available on RDNA2 and later). NVIDIA shared memory corresponds to AMD’s Local Data Share. Warp-level intrinsics like __syncwarp() and __shfl_down_sync() map to different underlying instructions. Scale’s compatibility layer handles all of this at the compiler and library level, preserving the CUDA source as written.

The clangd extension therefore serves both use cases from the same infrastructure. NVIDIA CUDA users working with the Scale/Clang toolchain get device-code IDE support for standard .cu files. Users targeting AMD hardware via Scale get equivalent support for code destined for amdgcn targets. The source is the same; the diagnostic and completion infrastructure is the same.

This is worth contrasting with the HIPIFY approach. Source-level translation produces HIP code that differs from the original CUDA. The editor tooling you use for that HIP code is editing a generated file, not the file you maintain. Scale’s approach keeps the original CUDA source as the thing being edited, which means the IDE is grounded in the code you actually version and modify.

What This Means in Practice

For CUDA developers using the Scale toolchain, the practical gain is an editor that works correctly: completions for the full range of device intrinsics, diagnostics that distinguish host errors from device errors, go-to-definition that crosses the host/device boundary, hover information that shows device types correctly rather than treating float4 as an undefined identifier, and inline PTX validation before you invoke the compiler.

These are baseline features C++ developers have taken for granted for years. GPU programmers have gone without them because nvcc’s architecture made a proper solution structurally difficult. The language server problem and the nvcc problem were the same problem: building around a closed-source driver that does not expose a compiler-accurate interface to external tools.

Scale’s position as a Clang-based toolchain gave Spectral Compute direct access to Clang internals. The result is a principled extension to clangd rather than a layer of heuristic workarounds sitting on top of a fundamentally mismatched build system. The setup documentation and editor configuration guides are at docs.scale-lang.com, covering both the clang dialect of CUDA and standard CUDA code written against NVIDIA’s headers.