Why CUDA Device Code Has Always Broken Language Servers

If you have written CUDA kernels in a modern editor, you know the experience: squiggly lines under threadIdx.x, go-to-definition failing on __syncthreads(), completions giving up inside a __device__ function body. The tooling that works reliably for C++ stops at the device boundary. This has been the state of things for most of CUDA’s existence, and it is not accidental.

Spectral Compute recently published documentation for a clangd extension that provides IDE feedback across both host and device code in .cu files, including error detection within inline PTX assembly strings. Understanding why this is harder than it looks requires going back to the compilation model at the root of the problem.

The Dual Compilation Model and the compile_commands.json Gap

CUDA compilation is fundamentally a two-pass process. The same .cu file gets compiled twice: once targeting the host CPU, once targeting the GPU (producing PTX or a device object). In device compilation, threadIdx, blockIdx, and __syncthreads() exist as built-in identifiers. In host compilation, they do not. The macro __CUDA_ARCH__ is defined in device passes and undefined in host passes, and large portions of the CUDA runtime headers gate their content behind it. The two passes see different preprocessed source and different symbol tables.

The Language Server Protocol assumes one compilation context per source file. The compile_commands.json format, which clangd uses to determine what flags were used to build each file, has one entry per file with one set of flags, one target architecture, one set of preprocessor definitions. This works for C++ and for most cross-compilation setups. For CUDA, the semantically correct analysis requires two separate, incompatible contexts for the same file.

Compounding this is a more basic issue: nvcc is not a compiler. It is a driver program that preprocesses .cu files, splits them, and delegates host code to whatever system C++ compiler is available while running device code through NVIDIA’s internal cicc frontend. NVCC does not emit compile_commands.json. There is no supported path from an NVCC-based build to the compilation database that clangd needs, because NVCC’s internal flags are not a public interface. Tools like Bear or compiledb can intercept build commands and generate a database, but the resulting entries contain NVCC invocations that clangd cannot parse as Clang flags. Developers resorting to .clangd config files with hardcoded -x cuda --cuda-gpu-arch=sm_89 -I/usr/local/cuda/include flags get partial functionality but miss anything that depends on the actual project-specific compilation flags.

What clangd Sees Without Device Awareness

When clangd opens a .cu file using a standard host-mode compilation database entry, it analyzes the file in host context. threadIdx is undeclared. __syncthreads() is undeclared. Every call to a __device__-only function generates a false error because the host-side analysis does not model device function visibility. Warp primitives like __shfl_sync() and cooperative_groups types are unresolved.

The diagnostic panel fills with false positives and developers learn to ignore it. That means real errors, actual type mismatches and missing headers and invalid device code, go unnoticed until compilation. The tooling degrades to syntax highlighting with a broken error reporter.

This is not a bug in clangd. The tool is behaving correctly given a single-context analysis of a file that needs dual-context analysis. The architectural mismatch is upstream.

Why Inline PTX Specifically Is Hard

CUDA lets you embed PTX assembly directly in device code using GCC-style extended asm:

__device__ void atomic_add_shared(float* addr, float val) {
    asm volatile(
        "atom.shared.add.f32 %0, [%1], %2;"
        : "=f"(val)
        : "r"((unsigned)((uintptr_t)addr)), "f"(val)
    );
}

The constraint letters (=f, r, l, h, d) map C++ operands to PTX register classes. The PTX string itself is opaque to the C++ frontend: both NVCC and standard Clang treat it as a string literal and pass it through to NVIDIA’s ptxas assembler. Errors in the PTX only surface when ptxas runs, late in the pipeline, with error messages pointing at the generated .ptx file rather than the original source line.

Some of the most insidious bugs here are not syntax errors but semantic mismatches. State space specifiers in PTX memory operations, .global versus .shared versus .local, must match the actual memory being addressed. Writing ld.global.f32 when targeting shared memory is syntactically valid PTX that compiles, links, and runs, but reads from GPU DRAM instead of on-chip shared memory, producing wrong results with no error message. Register constraint mismatches, using "r" (32-bit) for a 64-bit pointer instead of "l", can silently generate incorrect code on some architectures.

Catching these at the editor level requires a PTX parser integrated into the language server analysis pipeline, awareness of compute capability constraints (some instructions like tensor core mma operations only exist at sm_70+), and cross-referencing PTX operand types against the C++ types of the variables being passed. This is genuinely new work, not just a configuration problem.

Why Clang Makes This Tractable

Clang’s CUDA support, which reached practical maturity around 2017 after initial work documented in LLVM’s CUDA compilation guide, takes a fundamentally different approach from NVCC. Clang performs host and device compilation within a single compiler invocation, building a unified AST before forking into two separate CodeGen passes. The --cuda-device-only and --cuda-host-only flags restrict this to one side of the split.

For tooling purposes, this means a compilation database can include two entries for the same .cu file: one with standard host flags, one with --cuda-device-only and a --cuda-gpu-arch flag. clangd, presented with both entries, can run separate analyses and present device-side diagnostics for __device__ functions and host-side diagnostics for host code. Device built-ins like __syncthreads() are declared as LLVM intrinsic-backed functions in Clang’s CUDA headers, so the device-side analysis sees proper declarations rather than undefined symbols. The threadIdx and blockIdx built-in variables are declared as extern const struct variables in device mode.

Because Clang emits compile_commands.json natively, and because clangd is built on the same Clang frontend, a compiler that uses Clang under the hood automatically gets a language server that understands its output. This is the core architectural insight behind Scale’s approach.

The Scale Implementation

Spectral Compute’s Scale language uses a Clang-based compiler rather than NVCC. The Scale+clangd integration bundles a custom clangd build that understands Scale’s dialect and the CUDA device/host split, packaged with a VS Code extension that handles compilation database configuration automatically. The compiler and the language server share the same frontend, so clangd’s analysis is semantically equivalent to what the compiler would see.

The extension claims support for both standard CUDA and the Clang dialect (where the two differ, mostly in edge cases around template instantiation and some NVCC-specific relaxations), and for compilation targeting both NVIDIA’s nvptx64 backend and AMD’s AMDGPU backend. The AMDGPU connection is straightforward: the same dual-analysis setup that enables device-side feedback for NVIDIA targets works for AMD targets, since both go through Clang’s unified frontend. ROCm’s HIP stack already uses amdclang++ as its compiler, which is why HIP projects have historically had somewhat better language server support than pure NVCC CUDA projects.

What This Does Not Solve

Even with full device-side analysis, some GPU-specific correctness problems remain outside what a language server can practically catch.

Warp divergence is the canonical example. When threads in a warp take different branches, the warp serializes, executing both paths with inactive threads masked. This is not a C++ or PTX type error. Detecting it statically requires dataflow analysis tracking which values depend on threadIdx and whether those values control branches that can diverge within a warp. This is well outside the scope of clangd’s diagnostic model.

Barrier correctness is similar. __syncthreads() requires that all threads in a block reach the call, or the behavior is undefined. Conditionally-reached barriers are a common source of correctness bugs. No current static tool catches this reliably.

Inline PTX semantic errors, the state space and constraint mismatches described above, require additional infrastructure beyond PTX syntax checking. Spectral Compute’s documentation describes catching syntax errors within inline PTX; the semantic layer is harder and it is not clear from the published documentation how deep that analysis goes.

Why It Took This Long

NVIDIA’s developer tooling has historically focused on runtime analysis: Nsight Systems for kernel timeline profiling, Nsight Compute for hardware metrics, compute-sanitizer for memory errors, cuda-gdb for device debugging. These are all post-compilation tools. Pre-compilation static analysis and editor integration were treated as second-order concerns, partly because the CUDA ecosystem grew up in HPC and research contexts where raw compilation workflows were accepted, and partly because NVCC’s architecture made a conventional language server impractical without replacing the compiler.

Clang’s CUDA support changed the underlying constraints. Scale’s clangd work demonstrates what becomes possible when you build on a compiler that was designed to expose a semantic model. The inline PTX validation is the part that requires new work. Everything else, the device-side completions, go-to-definition for __device__ functions, accurate diagnostics in kernel bodies, follows from configuring the compilation pipeline that Clang already supports and that NVCC never made accessible.