How Spectral Compute Extended clangd to Understand Both Sides of a CUDA File
Source: lobsters
CUDA code occupies an unusual position in the C++ ecosystem. A single .cu file contains two programs: one that runs on the host CPU, compiled with ordinary C++ rules, and one that runs on the GPU, compiled under an entirely different set of constraints. The two coexist in the same syntax, separated not by file boundaries but by attributes and macros. This arrangement has always been problematic for language servers, and it is why CUDA device code has long had worse editor support than nearly any other C++ variant.
Spectral Compute, the company behind the Scale language, has extended clangd to address this directly. The extension adds diagnostics for both the host and device portions of CUDA code, including syntax errors within inline PTX assembly. It works against standard CUDA as written for nvcc, and also against the clang dialect. The same toolchain can compile CUDA code targeting either NVPTX or AMDGPU.
The Host-Device Split as an Architectural Problem
Understanding why this was hard requires understanding what actually happens when a CUDA file is compiled.
When nvcc processes a .cu file, it runs multiple compilation passes internally. One pass produces host machine code for the CPU. Another produces PTX or SASS for the GPU. During device compilation, the macro __CUDA_ARCH__ is defined to the compute capability as a numeric value: 800 for sm_80 (Ampere), 900 for sm_90 (Hopper). During host compilation, that macro is undefined.
This means the following code compiles to entirely different things depending on which pass is running:
#if defined(__CUDA_ARCH__)
__device__ void warp_reduce(float* val) {
for (int offset = 16; offset > 0; offset >>= 1)
*val += __shfl_down_sync(0xffffffff, *val, offset);
}
#else
void cpu_fallback(float* val) {
// host-side reference implementation
}
#endif
A language server that processes this file once, with a single compilation command, sees only one branch. If it processes the host pass, device intrinsics like __shfl_down_sync are invisible. If it processes the device pass, host-only APIs disappear. Either way, the diagnostics are incomplete.
clangd’s compile_commands.json model assumes each source file maps to one set of compilation flags. CUDA breaks this assumption at the architectural level. The workarounds developers have used, such as pointing clangd at a preprocessed version of the file or using separate flag sets, have always been approximations. The device-side code path, gated behind __CUDA_ARCH__, has remained largely opaque to the language server.
Inline PTX: A Language Inside a String
The inline PTX problem is distinct from the host-device split, and in some ways harder.
CUDA allows device functions to embed PTX assembly directly using an extended asm syntax:
__device__ float fast_rcp(float x) {
float result;
asm volatile("rcp.approx.f32 %0, %1;" : "=f"(result) : "f"(x));
return result;
}
From the C++ parser’s perspective, the PTX instruction is an opaque string literal. The compiler parses it eventually, but only in the PTX emission phase. Standard language servers stop at the C++ AST level and never look inside the string. An error in the PTX instruction, a misspelled opcode or a wrong register type, goes undetected until compile time. And compile time for GPU code is slow.
PTX has its own ISA versioning. Compute capabilities are paired with specific PTX ISA versions: PTX 8.0 corresponds to sm_89 (Ada Lovelace), while PTX 8.3 introduces features specific to sm_90a on Hopper. An instruction like wgmma.mma_async for tensor core operations on Hopper simply does not exist on earlier architectures. A static analyzer needs to know not just PTX syntax, but also which target the code is being compiled for, and match the instruction set against it.
The register constraint system adds another layer. Each PTX register type has a corresponding constraint letter in the inline assembly syntax: "=f" for single-precision float, "=r" for a 32-bit integer register, "=l" for a 64-bit register, "=d" for double. A mismatch between the declared constraint and the actual operand type is a real class of error, one that currently surfaces only at compile time. Catching it at edit time requires parsing both the constraint string and the surrounding C++ type information to verify that the two agree.
What the Extension Actually Provides
Spectral Compute’s clangd extension addresses both problems. For the host-device split, it processes CUDA files under both compilation contexts and surfaces diagnostics from both passes. This requires maintaining two ASTs for the same file and reconciling their source positions when reporting errors back to the editor through LSP. It is not a simple configuration change; it requires rethinking how clangd maps from files to compilation units.
For inline PTX, the extension parses the assembly string as PTX and validates it against the target architecture. Writing an instruction that does not exist for your compute capability, or declaring a constraint that does not match the operand type, produces an editor diagnostic rather than a compiler error discovered minutes later. This closes a feedback loop that has been unusually long for CUDA developers compared to what they are used to in ordinary C++.
The extension supports standard CUDA as written for nvcc, not just the clang dialect. This matters because clang-cuda and nvcc are not fully compatible. Clang has supported CUDA compilation for years via the -x cuda flag and the NVPTX LLVM backend, but there are meaningful differences in how they handle templates, constexpr evaluation in device code, and certain intrinsics. Supporting both dialects means the tooling is useful even for codebases that have no intention of migrating away from nvcc.
The AMDGPU Story
Beyond diagnostics, Spectral Compute describes the same infrastructure as part of a complete CUDA toolchain targeting AMD hardware via LLVM’s AMDGPU backend. This connects to what Scale is building more broadly: a CUDA-compatible programming model that compiles to both NVPTX and AMDGPU without requiring developers to rewrite code in HIP.
HIP, AMD’s official CUDA-compatibility layer, translates CUDA API calls to ROCm equivalents and provides headers that map CUDA types to HIP types. The result compiles with clang and targets AMDGPU, but the approach requires either source translation or header-level shimming. Scale takes a different position: the compiler treats AMD hardware as a first-class target from the start, with the tooling designed around that goal rather than adapted to it.
Whether this approach gains traction depends on how far the CUDA semantic model maps cleanly onto AMD’s execution model. Warp-level intrinsics, memory consistency guarantees, and texture hardware differ enough between NVIDIA and AMD that any compatibility layer involves real trade-offs. Spectral Compute’s bet is that handling those trade-offs at the compiler level, with full tooling support, produces better results than asking developers to manage them manually through ifdef branches or HIP ports.
What This Means in Practice
For CUDA developers using clangd, the practical difference is meaningful. Device-side code that was previously invisible to the language server, or that produced spurious errors because the host-pass AST was incomplete, now gets real diagnostics. Inline PTX errors that previously required a full compilation cycle to surface can be caught at edit time.
The demand for GPU compute has grown the size and complexity of CUDA codebases substantially over the last several years. Code that manages device memory explicitly, launches kernels across multiple streams, and mixes device intrinsics with inline assembly is increasingly written by engineers who expect the same editor feedback they get in CPU-side C++. The gap has been real and the tooling has lagged.
Spectral Compute’s disclosure that they work at the company behind Scale is worth keeping in mind. The extension serves their compiler story as much as the broader CUDA community, and upstream clangd adoption is an open question. But the technical problem they solved is genuine, and the solution works on standard CUDA code, not just Scale’s dialect. The documentation describes setup for VSCode and other editors that support LSP.