The Language Inside the String: What It Takes to Lint Inline PTX in CUDA
Source: lobsters
CUDA developers who write performance-critical kernels often reach for inline PTX. It is the escape hatch from the abstraction: when the compiler’s register allocation is not what you need, when you want a memory fence variant that the CUDA runtime does not expose, when a new GPU instruction was added before NVIDIA wrapped it in a C++ API. You drop into asm volatile(...) and write raw PTX.
That string literal is invisible to every language server that has ever tried to support CUDA. Until Spectral Compute’s extension to clangd, a typo in an instruction mnemonic, a wrong register class, or a bad memory qualifier would be invisible until you compiled. The PTX assembler, ptxas, would then report an error with source information stripped, pointing to a line in the preprocessed output rather than your .cu file.
This is not an accident of neglect. Implementing inline PTX diagnostics correctly requires building what is essentially a full PTX parser, mapping error spans from inside string literals back to editor cursor positions, and wiring the result into the LSP diagnostic pipeline. Spectral Compute’s extended clangd delivers PTX feedback as ordinary editor diagnostics, indistinguishable in the editor from any other warning or error.
PTX Is a Complete Language
PTX, the Parallel Thread Execution virtual ISA, is not a simple assembly language. It has been NVIDIA’s stable intermediate representation since CUDA’s introduction in 2006. The design goal was a virtual ISA that NVIDIA’s driver could JIT-compile to SASS (the actual GPU machine code) at load time, allowing compiled programs to run on future GPU generations without recompilation.
PTX has a type system. Registers are typed: %r for 32-bit integers, %rd for 64-bit addresses, %f for single-precision floats, %fd for double-precision, %p for predicates. Instructions carry type qualifiers: .u32, .f32, .b64, .pred. Memory accesses carry space qualifiers: .shared, .global, .local, .param. Synchronization instructions carry scope qualifiers: .cta, .gpu, .sys. Newer PTX versions add atomic ordering qualifiers: .acquire, .release, .acq_rel.
A load with acquire semantics on a 32-bit integer looks like this:
__device__ void load_acquire(const int* addr, int* result) {
asm volatile(
"ld.acquire.gpu.b32 %0, [%1];"
: "=r"(*result)
: "l"(addr)
);
}
There are at least six ways to write that instruction mnemonic incorrectly: wrong type qualifier, wrong scope qualifier, wrong memory space, misspelled instruction name, wrong operand count, mismatch between the inline assembly constraint ("=r", "l") and the operand’s actual PTX register class. None of these were diagnosable at edit time before Spectral Compute’s work. All of them produce either a silent miscompilation or a confusing compile-time error from ptxas.
What String Literal Parsing Means for a Language Server
Most language server challenges are about scale, latency, or multi-file coordination. Inline PTX presents a structurally different problem: there is a complete second language embedded inside string literals in the first language.
This is not unique to PTX. SQL embedded in application code has the same structure. Regex patterns passed to std::regex constructors are a simpler version. GraphQL queries inside JavaScript template literals are another. The general problem is that a string opaque to the outer language’s parser is actually a program in an inner language with its own grammar and semantics.
The standard approaches to this problem are: ignore it entirely, detect patterns heuristically and run a separate linter, or build first-class support with accurate span tracking. Spectral Compute took the third path for PTX.
The challenge with accurate span tracking is that error positions reported by the PTX parser refer to offsets within the string literal content, not to positions in the source file. Converting those offsets to editor positions requires tracking where the string literal starts in the source, how the string is structured (raw string literals versus ordinary literals, concatenated literals), and mapping character offsets through any preprocessor transformations.
This is not deeply complex, but it is engineering work that has to be done carefully. A language server that reports a PTX error at the wrong line is arguably worse than no error at all, because it trains the developer not to trust the feedback.
Where Inline PTX Actually Appears in Production CUDA
Inline PTX is not a niche feature. In a well-tuned CUDA codebase it appears in several recurring patterns:
Custom atomics: Before NVIDIA expanded the typed atomic API, developers wrote atomics directly in PTX. Code from a few years ago still carries these patterns, and some atomic combinations still lack high-level equivalents.
Memory fence variants: The .acquire, .release, and .acq_rel qualifiers on loads and stores became available in PTX before the CUDA API exposed them through cuda::memory_order. High-performance lock-free data structures on GPU routinely contain inline PTX for precise memory ordering.
Hardware register access: %laneid, %warpid, %smid, and %nsmid are hardware registers that the CUDA C++ API does not expose directly. Accessing the current SM index, for example, requires inline PTX:
__device__ int get_sm_id() {
unsigned sm_id;
asm volatile("mov.u32 %0, %smid;" : "=r"(sm_id));
return sm_id;
}
Warp-level primitives: shfl, vote, and similar instructions appeared in PTX before they had C++ wrappers. Older codebases use inline PTX for these regardless of whether the modern API covers them.
Speculative execution and cache control: nanosleep, prefetch, and discard cache operations are sometimes accessed through PTX rather than through API equivalents, particularly when precise control over timing or cache behavior is required.
These patterns appear in production kernels at organizations doing serious GPU work: deep learning frameworks, scientific computing, real-time rendering. The developers writing this code are writing it precisely because they need control that the higher-level API does not provide. Having a language server that understands what they are writing is genuinely valuable, not merely convenient.
The Dual-Context Problem as Foundation
The inline PTX support sits on top of the more fundamental work Spectral Compute did on host and device code separation. Standard clangd parses a CUDA file from a single host perspective, which means device-only functions, __device__-qualified types, and CUDA built-ins like __syncthreads() or threadIdx.x either produce spurious errors or are silently ignored. The extended clangd understands that certain regions of the file are device code and applies device-compilation semantics there.
This dual-context awareness is what enables the PTX support to be meaningful. If the language server does not correctly understand device code, PTX errors in device functions appear on top of a background of false positives, and the signal is lost. The PTX parsing is only useful because it ships with correct handling of the broader device code context.
The implementation path both features share is Clang’s CUDA mode, which handles the host/device split internally across multiple passes within a single frontend invocation rather than splitting into separate tools the way nvcc does. Because clangd replays compilation commands from compile_commands.json, and because Clang’s CUDA invocations are valid compilation commands that clangd can replay, the host and device analysis happens within the same language server infrastructure.
The Cross-Platform Dimension
Spectral Compute builds the Scale language, which targets both NVIDIA’s NVPTX backend and AMD’s AMDGPU backend. The extended clangd works for both targets. The language server operates at the source and AST level, so the feedback is consistent whether you are compiling for an NVIDIA or AMD GPU. The AMDGPU ISA differs from PTX at the machine level, but the tooling story converges at the language server layer because the source language is the same.
This matters beyond the immediate user base for Scale. It demonstrates that the approach, extending a general-purpose language server to handle split-compilation GPU programming models, generalizes across GPU vendors. The inline PTX parsing is specific to NVIDIA’s virtual ISA, but the architecture for detecting embedded languages inside string literals and surfacing accurate diagnostics applies to inline AMDGPU assembly as well.
What Gets Better
The practical improvement is a shorter feedback loop for a category of subtle bugs. Writing inline PTX today means compiling to see your errors. With the extended clangd, they appear in the editor while typing, in the same pass that catches C++ type errors. The constraint mismatch between a register class declaration and an inline assembly operand shows up immediately rather than as a confusing ptxas error after a full compile cycle.
The PTX feedback appears through the standard LSP diagnostic mechanism, with the same display and severity levels as any other clangd diagnostic. No separate PTX debugger or specialized tool UI is required. That uniformity matters because any friction in the error-feedback loop reduces how reliably developers use the feedback.
GPU programming tooling has been behind CPU-side tooling for a long time, and the gap closes through work like this: specific, concrete extensions that address categories of bugs that were previously invisible, built into existing tooling infrastructure rather than requiring separate tools. Inline PTX has always been a real language. It is just that until now, no one had built it a parser where developers could actually see the results.