· 6 min read ·

CUDA Device Code Finally Gets a Real Language Server

Source: lobsters

If you have written CUDA kernels in any modern editor, you know the experience: squiggly red lines under threadIdx.x, go-to-definition failing on __syncthreads(), code completion giving up inside a __device__ function. The tooling that works reliably for C++ stops working the moment you cross into device code. This has been the state of things for over a decade, and it stems from a structural property of how CUDA compilation works, not from any lack of effort on the editor tooling side.

Spectral Compute recently published documentation for a clangd extension that addresses this as part of their Scale language toolchain. It provides IDE feedback for both host and device code simultaneously, including error detection within inline PTX assembly. Understanding why this is harder than it looks requires going back to the dual compilation model at the heart of CUDA.

The Dual Compilation Model and Its Consequences

When you compile a .cu file, the compiler processes it twice. The first pass targets the host CPU (x86-64 in most cases) and produces standard object code. The second pass targets the GPU, producing either PTX (a portable virtual ISA) or directly a device object. These two passes share the same source file but operate in distinct semantic worlds.

In device compilation, certain identifiers exist that have no equivalent in C++. threadIdx, blockIdx, and blockDim are built-in variables. __syncthreads() is a built-in function with no declaration in any header. The macro __CUDA_ARCH__ is defined (e.g., __CUDA_ARCH__ == 900 for Hopper hardware), and large portions of the CUDA runtime headers gate their content behind this macro. In host compilation, none of this exists. The two passes see different preprocessed source, different symbol tables, and different sets of valid identifiers.

The Language Server Protocol and tools built on it, including clangd, operate on the assumption that a source file maps to a single compilation context. The compile_commands.json format has one entry per file, with one set of compiler flags, one target architecture, one set of preprocessor definitions. This model works for C++ and works for most cross-compilation scenarios, but it breaks for CUDA because the semantically correct way to analyze a .cu file requires two separate analyses with incompatible flag sets.

What clangd Sees Without This Extension

When clangd opens a .cu file using a standard compilation database, it runs its analysis in host mode. From that perspective, threadIdx is undeclared. __syncthreads() is undeclared. Every call to a __device__-only function from within another __device__ function generates a false error because the host-side analysis does not see device function definitions as valid call targets in the current context. Warp-level primitives like __shfl_sync() show up as undeclared identifiers.

The result is a stream of false positives that makes the diagnostics panel useless. Developers learn to ignore everything clangd reports in .cu files, which means they also miss the real errors. The tooling degrades to a text editor with syntax highlighting.

This is not a clangd bug. The tool is doing exactly what you would expect given a single-context analysis. The gap is at the level of the compilation model.

Why Inline PTX Makes This Worse

CUDA allows embedding PTX assembly directly in device code using GCC-style extended asm syntax. A typical inline PTX block looks like this:

__device__ void load_shared(float* out, const float* src) {
    asm volatile(
        "ld.shared.f32 %0, [%1];"
        : "=f"(*out)
        : "r"((unsigned)((uintptr_t)src))
    );
}

The constraint letters (=f, r) map C++ variables to PTX register types: f for 32-bit float, r for 32-bit general-purpose register, l for 64-bit, h for 16-bit, d for double. Mismatches between the C++ operand type and the PTX instruction type are not always caught by the compiler frontend; some only surface when ptxas (NVIDIA’s PTX-to-SASS assembler) processes the output. Others compile silently and produce wrong results at runtime.

State space specifiers are another class of silent bug. PTX memory operations require explicit state space qualifiers: .global for GPU DRAM, .shared for on-chip shared memory, .local for per-thread stack. Writing ld.global.f32 when you meant ld.shared.f32 is syntactically valid PTX that compiles, links, and runs, but reads from the wrong memory region with undefined results. No current static analysis tool catches this reliably.

Validating inline PTX at the clangd level requires parsing the PTX string as PTX, not just as an opaque string constant. That means a PTX parser integrated into the analysis pipeline, architectural awareness (some instructions only exist at specific compute capabilities), and understanding of CUDA’s constraint letter system. Spectral Compute’s extension addresses this class of error specifically, which is a nontrivial piece of the implementation.

Why Clang Makes This Tractable

The traditional CUDA compiler, nvcc, was designed as a build driver. It orchestrates multiple internal tools, processes files sequentially, and does not expose an AST or semantic model that external tools can query. There is no nvcc --language-server mode and no path to adding one.

Clang’s CUDA support, which reached practical maturity around 2017 after initial work by Justin Lebar at Google (documented in LLVM’s CUDA compilation guide), takes a different approach. Clang performs both host and device compilation within a single compiler invocation, building a unified AST that spans both contexts before forking into two CodeGen passes. More importantly for tooling purposes, Clang accepts --cuda-device-only and --cuda-host-only flags that restrict compilation to one side of the split.

This means a compilation database can include two entries for the same .cu file: one for host analysis (standard flags) and one for device analysis (with --cuda-device-only, --cuda-device-code, and __CUDA_ARCH__ defined to a representative value). clangd, when presented with both entries, can run separate analyses for each context and merge the results. Device builtins like __syncthreads() are declared as LLVM intrinsic-backed functions in Clang’s CUDA headers, so they have proper declarations visible to the device-side analysis. threadIdx and friends are declared as extern const variables in device mode.

The AMDGPU backend follows the same pattern. Clang targeting amdgcn-amd-amdhsa for AMD hardware, combined with the ROCm HIP layer, uses the same dual-analysis approach. Scale’s toolchain compiles to both nvptx64-nvidia-cuda and amdgcn-amd-amdhsa targets from the same source, so a clangd setup that understands device compilation for one can be extended to the other without fundamental architectural changes.

What the Extension Provides

The practical result, according to Spectral Compute’s documentation, is IDE feedback across the full surface of a CUDA kernel: syntax errors in device code, type mismatches on device function calls, errors within inline PTX strings, and hover documentation for CUDA built-ins. This works for standard CUDA as written with nvcc compatibility in mind, and also for Clang’s CUDA dialect where the two differ.

The IDE improvements are independent of Scale’s portability story. You do not need to switch your build system to Scale to benefit from better clangd integration; the compilation database setup that enables device-side analysis is a configuration change, not a rewrite. Whether Spectral Compute exposes this as a standalone tool or as part of the Scale package is worth watching.

The Broader Gap

NVIDIA’s developer tooling for CUDA has historically concentrated on runtime analysis: Nsight Systems for profiling kernel timelines, Nsight Compute for per-kernel hardware metrics, compute-sanitizer for memory errors, CUDA-GDB for device debugging. These are all post-compilation tools. The gap in pre-compilation static analysis and IDE integration has been treated as a second-order problem, partly because the CUDA ecosystem grew up around researchers and HPC programmers who tolerated raw compilation workflows, and partly because nvcc’s architecture made a language server impractical.

Clang’s CUDA support changed the underlying constraints, and Scale’s clangd work is a concrete demonstration of what becomes possible when you build IDE tooling on a compiler that was designed for it. The inline PTX validation is the part that requires genuinely new work; everything else is mostly a matter of correctly configuring the compilation pipeline that Clang already supports.

For anyone writing GPU kernels daily, this is worth tracking. Reliable IDE feedback in device code is not a luxury; it is the same table-stakes tooling that the rest of systems programming has had for years.

Was this interesting?