Three Languages in One File: What It Took for clangd to Understand CUDA
Source: lobsters
A language server is built on a simple assumption: each file belongs to one compilation unit, with one set of include paths, one preprocessor state, and one type system. CUDA breaks this assumption at the file level, at the semantic level, and then once more inside string literals. That last layer is what makes Spectral Compute’s extension of clangd technically significant beyond its immediate practical value.
How clangd normally works
clangd is the language server maintained alongside the Clang compiler. It reads a compile_commands.json file, replays each compilation command as an in-process Clang invocation, and exposes the resulting AST over the Language Server Protocol. Because it uses the actual Clang frontend rather than a separate parsing layer, the diagnostics it produces are semantically equivalent to compiler errors: the same type checker, the same template instantiation logic, the same overload resolution.
The binding constraint is that each entry in compile_commands.json represents a single compiler invocation with a fixed flag set. One file, one set of preprocessor defines, one target architecture. This works well for the C++ and Objective-C codebases the protocol was designed around.
The dual compilation problem
A .cu file is compiled twice. The host pass produces x86-64 object code using a standard C++ compiler. The device pass produces PTX or CUBIN for the GPU target using a separate compiler backend. These two passes see different preprocessor environments: __CUDA_ARCH__ is only defined during device compilation, which means the conditionally compiled sections each pass processes are different subsets of the same file.
Device code also has a set of built-in identifiers that are simply not declared anywhere in CUDA’s public headers. threadIdx, blockIdx, blockDim, and gridDim are intrinsics provided by the device compiler. __syncthreads() maps to a PTX bar.sync opcode. The __global__ and __device__ qualifiers alter calling conventions and symbol visibility in ways that have no host equivalent.
__global__ void vector_add(float* a, float* b, float* c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) c[idx] = a[idx] + b[idx];
}
A standard clangd invocation pointed at this file will process it in host mode, reporting blockIdx, blockDim, and threadIdx as undeclared identifiers. None of those references is wrong. The language server simply lacks the device compilation context, and the result is a diagnostic stream that is uniformly wrong in a way that trains developers to ignore it.
Why Clang is the necessary foundation
NVIDIA’s own compiler, NVCC, is a compilation driver that orchestrates a pipeline of separate tools. It preprocesses the source, splits it, runs the host compiler, runs its internal device compiler (originally based on a forked LLVM), and links the results into a fat binary. This architecture makes building an LSP on top of NVCC essentially impossible: there is no unified AST, no single type system to query, and the internal device compiler is not designed for repeated incremental invocation.
Clang’s CUDA support, which has been in the mainline LLVM tree since around 2016 (initially through work by Justin Lebar and others at Google), takes a structurally different approach. Clang performs both host and device compilation in a single frontend invocation, building a unified AST before forking at code generation. The --cuda-device-only and --cuda-host-only flags, documented in LLVM’s CUDA guide, allow restricting the output to one pass while still using the full Clang frontend.
This is the foundation that makes extending clangd viable. Because Clang models both compilation contexts, it is at least possible to teach clangd to maintain two analysis views of the same file simultaneously, one for the host and one for the device, and merge their diagnostics.
The PTX problem
Device C++ is the more obvious challenge, but PTX is where the problem becomes genuinely unusual. CUDA allows, and in performance-critical code frequently requires, inline PTX assembly embedded in asm volatile() statements:
__device__ void warp_reduce(float* val) {
unsigned mask = 0xffffffff;
float tmp;
asm volatile(
"shfl.sync.down.b32 %0, %1, 16, 31, %2;"
: "=f"(tmp) : "f"(*val), "r"(mask)
);
*val = tmp;
}
PTX (Parallel Thread Execution) is a complete virtual ISA with its own type system, register naming conventions (.b32, .f32, .u64), memory space qualifiers (.global, .shared, .local, .param), and a substantial instruction set. From C++‘s perspective, the contents of an asm volatile string are an opaque payload. Standard Clang does not attempt to parse them.
If there is a type mismatch between a PTX instruction and its operand constraint, say using the integer constraint "=r" where a float constraint "=f" is needed, the error surfaces at ptxas invocation time, seconds into the build. If the PTX instruction itself is malformed for the target architecture, the error may surface even later during JIT compilation at runtime.
Making clangd report errors inside these string literals requires implementing a PTX parser, mapping parse errors back to source positions within the string, and threading those diagnostics through the LSP diagnostic pipeline. Source positions inside a string literal are not positions that the C++ frontend ordinarily needs to track. This is a different class of work than the dual-context problem: it requires enough understanding of PTX grammar to produce useful, positioned error messages rather than simply flagging the entire asm statement.
Spectral Compute’s implementation handles this for both standard CUDA and Clang’s CUDA dialect, covering the cases where PTX instruction semantics differ slightly between compiler frontends.
The AMDGPU angle
Spectral Compute’s offering extends beyond the LSP. They have built a complete CUDA-to-AMDGPU toolchain, meaning the same source file can target both nvptx64-nvidia-cuda and amdgcn-amd-amdhsa without the source rewrites that porting to HIP normally requires. The clangd extension operates at the AST level, before code generation, so it provides the same diagnostic feedback regardless of the target GPU vendor.
For teams writing performance code that needs to run on both NVIDIA and AMD hardware, this matters because the standard cross-vendor alternative, HIP, involves maintaining a parallel codebase. A language server that understands device code for both targets is a prerequisite for that toolchain to be practically usable day-to-day.
The long tail of GPU tooling debt
The mismatch between CUDA’s compilation model and what language servers assume has existed since CUDA was introduced in 2007. NVIDIA’s own investments in developer tooling have concentrated on runtime analysis: Nsight Systems for profiling, Nsight Compute for kernel analysis, compute-sanitizer for memory errors. These are valuable tools, but they operate after compilation. The edit-time experience for GPU code has been, for most of that period, whatever an IDE could produce by treating .cu files as ordinary C++ and tolerating the cascade of device-code false positives.
Getting this right requires working simultaneously in three domains: the C++ host, the C++ device dialect, and the PTX virtual ISA embedded in string literals. The host/device split could theoretically be addressed by someone adding better CUDA support to upstream clangd. The PTX problem requires a distinct effort that probably only makes sense for a compiler toolchain vendor with a stake in the CUDA ecosystem.
Spectral Compute has done both. The practical result is that a developer working in CUDA can now get the same class of edit-time feedback, including inline PTX errors, that C++ and Rust programmers have had access to for years.