The CUDA Compiler Built for AMD That Gave NVIDIA Code a Language Server

For years, writing CUDA kernels in any editor with clangd integration meant tolerating a broken experience: threadIdx flagged as undeclared, __syncthreads unknown, completions stopping at the boundary of device code. GPU programmers learn quickly to ignore the diagnostic panel in .cu files entirely, which means they also miss the legitimate errors buried in the noise.

Spectral Compute’s extension of clangd addresses this directly. It provides accurate IDE feedback for both host and device portions of a CUDA file, including error detection inside inline PTX assembly strings. The extension works against standard CUDA as written for nvcc, not just the Clang dialect, which makes it usable for existing codebases without any migration.

The engineering story behind this is worth tracing, because the clangd work did not originate as an IDE improvement. It came from a structural requirement of building a CUDA compiler that targets AMD hardware.

AMD’s Tooling Advantage Was Always Architectural

HIP, AMD’s CUDA compatibility layer, has had workable language server support for device code since its early days. The reason has nothing to do with investment in IDE tooling. It is a consequence of AMD’s choice to build ROCm on Clang from the start.

AMD’s GPU compiler, amdclang++, is Clang targeting amdgcn-amd-amdhsa. It is not a separate tool with Clang-like syntax; it is Clang with the AMDGPU backend. That means it generates compile_commands.json natively, accepts standard Clang flags, and is built on the same frontend that clangd uses. A HIP developer who sets up their compilation database with device-side flags gets working language server support for kernel code because clangd and amdclang++ share an implementation.

This was never a deliberate AMD investment in editor tooling. It fell out of the architectural choice to avoid maintaining a proprietary compiler frontend.

CUDA’s tooling developed differently. NVCC is a compilation driver that preprocesses source files, splits them, and orchestrates separate tools for host and device compilation. The device compiler includes a proprietary frontend (cicc) that has no public API, does not expose an AST, and cannot be used as a library. There is no nvcc --language-server mode, and there is no path toward one from inside NVCC’s architecture. A language server for CUDA device code would require a different compiler entirely.

Why Scale Was in Position to Do This

Spectral Compute is building Scale, a CUDA-compatible toolchain that compiles the same source to both NVIDIA’s NVPTX target and AMD’s AMDGPU. The AMD target is the architectural constraint that matters here.

To compile CUDA for AMD hardware, you cannot use NVCC. That compiler only targets NVIDIA. The only viable path is Clang, which provides both the NVPTX backend and the AMDGPU backend under a single frontend, as documented in LLVM’s CUDA guide and AMDGPU backend documentation. Scale’s requirement to support AMD hardware forced them entirely onto Clang-based compilation infrastructure.

Once on Clang, the prerequisites for a working language server are present. Clang performs both host and device compilation in a single frontend invocation, building a unified AST before forking at code generation. The --cuda-device-only and --cuda-host-only flags restrict output to one pass while using the full frontend. Device builtins like __syncthreads() are declared as LLVM intrinsic-backed functions in Clang’s CUDA headers; threadIdx and blockIdx are declared as extern const dim3 variables visible only during device compilation. These declarations exist precisely because Clang needs them to type-check device code.

The Host-Device Split in Practice

The specific problem that clangd faces with a .cu file is that the Language Server Protocol and compile_commands.json assume each source file maps to a single compilation context. CUDA requires two.

During device compilation, __CUDA_ARCH__ is defined (to 800 for Ampere/sm_80, 900 for Hopper/sm_90, etc.), and large sections of the CUDA runtime headers gate their content behind this macro. A file like this:

__global__ void vector_add(float* a, float* b, float* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) c[idx] = a[idx] + b[idx];
}

processed in host mode reports blockIdx, blockDim, and threadIdx as undeclared identifiers. None of those diagnostics is correct. The language server is operating correctly given a single-context analysis; the problem is that a single context is insufficient.

Scale’s extension maintains two ASTs for the same file simultaneously, one for each compilation context, and reconciles source positions when reporting diagnostics back to the editor through LSP. This requires rethinking how clangd maps files to compilation units, but the underlying infrastructure, two separate Clang frontend invocations over the same file with different flag sets, is available because that is how Clang already processes CUDA internally.

Inline PTX Is a Separate Problem

The dual-context analysis addresses device C++ correctly. Inline PTX requires different work.

CUDA allows embedding PTX assembly directly in device functions using GCC-style extended asm:

__device__ float fast_reciprocal(float x) {
    float result;
    asm volatile("rcp.approx.f32 %0, %1;" : "=f"(result) : "f"(x));
    return result;
}

From the C++ parser’s perspective, the string "rcp.approx.f32 %0, %1;" is an opaque payload. Standard Clang passes it to ptxas at link time without examining its contents. An error inside that string, a misspelled opcode, a wrong constraint letter, an instruction that does not exist on the target architecture, surfaces in a ptxas error message minutes into the build, pointing at a generated .ptx file with no useful source location.

PTX is a versioned ISA with architecture-specific instruction availability. The cp.async non-blocking copy instructions require at least sm_80. wgmma.mma_async, used for Hopper tensor core operations, requires sm_90a. An instruction valid on your development machine may not exist on the hardware your code actually runs on:

// cp.async requires sm_80 (Ampere) or later.
// On sm_75 (Turing), ptxas fails with a misleading error.
asm volatile(
    "cp.async.ca.shared.global [%0], [%1], 16;"
    :: "r"(dst_shared), "l"(src_global)
);

The constraint letter system adds another verification layer. Each constraint maps a C++ variable to a PTX register type: "=f" for 32-bit float, "=r" for 32-bit integer, "=l" for 64-bit, "=d" for double. A mismatch between the declared constraint and the actual C++ type is an error that both compilers often fail to diagnose until runtime.

Catching these at edit time requires parsing PTX as a language, understanding which instructions are available at which compute capabilities, checking constraint letter types against the surrounding C++ types, and mapping all parse errors back to positions inside string literals. String literal interior positions are not something the C++ frontend tracks ordinarily; this is a distinct implementation problem from the dual-context analysis.

What Remains Beyond Static Analysis

Two important error classes are not addressed by edit-time linting. State space mismatches, writing ld.global.f32 when the pointer targets shared memory, are syntactically valid PTX. They compile without warnings and produce incorrect results at runtime by reading from the wrong memory region. Verifying state space correctness requires tracking which memory space each pointer was allocated in across function boundaries, a problem that current static approaches do not handle reliably.

Barrier correctness falls in the same category. A __syncthreads() call reached conditionally by some threads in a block is semantically invalid but syntactically acceptable. These bugs require compute-sanitizer or similar runtime analysis to catch reliably.

The Upstream Question

Scale’s extension is currently packaged as part of their toolchain. Whether dual-context analysis and inline PTX linting make it upstream to mainline clangd is not settled. The dual-context work requires changes to how clangd maps files to compilation units, which is central enough to require maintainer buy-in. For now, the setup documentation covers VSCode and other LSP-compatible editors, and the extension works on standard CUDA code without requiring Scale’s dialect or targeting AMD hardware.

Spectral Compute disclosed that their Lobsters submission came from someone working at the company, which is worth keeping in mind. The tooling serves their compiler story and their AMD portability pitch. But the technical problem is genuine, the solution operates on standard CUDA, and the structural argument holds regardless of commercial context: AMD’s decision to build on Clang gave their users working language servers for GPU code years before CUDA had it. Scale’s requirement to support AMD hardware put them on the same infrastructure, which is why they were in position to close the gap.