Why npm Is the Weakest Link in Unicode Supply Chain Attacks

The Glassworm campaign, documented by Aikido Security in early 2026, targets GitHub repositories, npm packages, and VSCode extensions using invisible Unicode characters that cause the bytes in a file to tell a different story than the rendered text. Three vectors, one campaign. The npm surface is the one that deserves the closest examination, not because npm is the largest target but because JavaScript, at the language specification level, made choices that remove the enforcement points other ecosystems have used to contain this class of attack.

What ECMAScript Chose and What It Did Not

ECMAScript’s specification governs what characters are valid in identifiers. Since ES5, the language has followed Unicode TR31, the standard that defines identifier-safe characters across scripts. The rationale was sound: enable multilingual programming so developers can write code using identifiers from Arabic, Chinese, Japanese, Cyrillic, Greek, Hangul, and every other Unicode script. That goal was achieved, and the intent is not the problem.

The security consequence is that Cyrillic а (U+0430) and Latin a (U+0061) are both valid identifier characters per TR31, and they are visually indistinguishable in almost every font stack at normal display sizes. сonfig using Cyrillic с (U+0441) and config using Latin c (U+0063) are two distinct, valid JavaScript identifiers. One impersonates the other in any code review that relies on visual inspection.

The Unicode Consortium anticipated this problem. Unicode Technical Standard #39 (UTS #39) defines security profiles for identifiers: mechanisms for detecting mixed-script identifiers that could be confusable substitutions, restrictions on combining characters from scripts with mutual homoglyphs, and a publicly maintained confusables dataset that maps dangerous character pairs. TR31 defines what is allowed; UTS #39 defines how to apply security constraints on top of that. ECMAScript adopted TR31 and did not adopt UTS #39’s security guidance.

The Unicode Security Considerations document (UTR #36), first published in 2003 and updated regularly since, explicitly documents the homoglyph attack surface and the zero-width character attack surface. The knowledge has existed for over two decades. The language specification never incorporated the countermeasures.

The Contrast with Rust and Python

Rust’s response to the Trojan Source disclosure in November 2021 illustrates what an aggressive toolchain-level response looks like. The Rust compiler, as of 1.56.1, treats bidirectional control characters in source files as a compile error outside string literals. More relevant to the homoglyph attack, Rust’s identifier rules reject mixed-script identifiers: an identifier containing both Cyrillic and Latin characters will not compile. You can write Rust with Cyrillic identifiers; you cannot create a Cyrillic lookalike for a Latin identifier in the same file without the compiler flagging the collision. The Rust reference implementation enforces UTS #39 semantics at compile time.

CVE-2021-42694 was assigned for the homoglyph attack surface as a companion to the primary Trojan Source CVE. Most toolchains’ responses to that companion CVE ranged from minimal to nonexistent. Rust’s was substantive.

Python sits in an intermediate position. PEP 3131, which enabled non-ASCII identifiers, predates UTS #39’s security profile documentation. After Trojan Source, Python 3.12 added a SyntaxWarning for bidirectional characters and introduced some mixed-script identifier detection. The detection has been softened in subsequent releases due to real conflicts with legitimate international codebases, but the intention is present in the language’s design. Python made the problem visible at parse time, which creates an enforcement surface even if that surface has edge cases.

Node.js and V8 produce no warning. A file containing Glassworm’s invisible characters runs silently. TypeScript’s compiler does not flag homoglyphs in identifiers. ESLint’s no-irregular-whitespace rule catches some zero-width characters in whitespace positions but misses them inside string literals and identifier names, which is exactly where they do damage in Glassworm’s payloads.

Why the Missing Compile Step Matters

The structural gap goes beyond identifier rules. JavaScript in the npm context has no mandatory parse-and-lint step between package publication and execution.

When a developer runs npm install, Node.js executes preinstall and postinstall scripts from the installed packages directly. A postinstall script containing a URL string with an embedded U+200B (ZERO WIDTH SPACE) looks syntactically clean to any reviewer, because the invisible character produces no visual artifact. The runtime resolves the actual byte string, which may differ from what any human-readable rendering would suggest. The script runs before the developer can examine it, because that is how the npm lifecycle is designed.

npm audit addresses a different threat model: it queries a database of known-vulnerable package versions. It does not perform byte-level analysis of package contents. The integrity field in package-lock.json protects against post-publication tampering, but it does nothing if the invisible characters were present in the package at the moment of first publication. The hash over a file containing zero-width characters is just as valid as a hash over the same file without them.

Socket.dev, launched in 2022, performs the analysis that npm’s tooling does not: behavioral and static analysis of published packages, including detection of hidden Unicode characters. Their tooling flags packages containing U+200B through U+200D and U+FEFF in non-trivial contexts. It is not a default part of npm install; it is an opt-in third-party layer. That Socket is doing this work at all is evidence of how large the gap is.

What the Toolchain Could Actually Do

Concrete options exist. None of them require abandoning multilingual identifier support.

An ESLint rule covering the full dangerous Unicode range is buildable today. The rule would flag U+200B through U+200D (zero-width space, non-joiner, joiner), U+FEFF in mid-string positions, the bidirectional control range U+202A through U+202E, the isolate range U+2066 through U+2069, and the deprecated Tag block U+E0000 through U+E007F, a range entirely absent from most existing detection rulesets. For homoglyphs, integrating against the Unicode confusables dataset would identify Cyrillic/Latin collisions in identifiers. As an opt-in security/no-dangerous-unicode rule it would cover the gap that no-irregular-whitespace leaves open. The ESLint plugin API supports this without any changes to ESLint itself.

The npm registry could apply character-level scanning to packages before making them available for installation. PyPI added bidirectional character scanning in 2023, demonstrating that registry-level scanning is feasible at scale. The decision is one of prioritization.

Bundlers could normalize invisible characters during build. webpack, esbuild, and rollup transform JavaScript source into output artifacts but do not strip zero-width characters from string literals or identifier names. A build step that errors on invisible characters in non-comment contexts would catch malicious insertions before they reach production bundles. The esbuild plugin API would support this today.

V8 could emit a warning when parsing files containing bidirectional control characters outside string literals, matching what GCC provides with -Wbidi-chars, enabled by default with -Wall. A Node.js --warn-invisible-unicode flag would give an opt-in enforcement point for CI environments.

The Tradeoff Is Genuine

The reason these things have not been done is not purely inertia. JavaScript’s Unicode identifier support enables legitimate code. A developer writing software for an Arabic-speaking market should be able to use Arabic identifiers; the friction of restricting non-ASCII characters falls disproportionately on developers and codebases in non-English-language contexts. Any rule that treats non-ASCII identifiers as suspicious by default would be a real cost to real people.

UTS #39’s approach offers a better framing: the problem is not non-ASCII identifiers, it is mixed-script identifiers that impersonate single-script identifiers. An identifier entirely in Cyrillic script is not the attack; an identifier that mixes Cyrillic and Latin to collide visually with a Latin-only identifier is. Implementing that distinction requires character-level script categorization, and it has edge cases where scripts share characters across what Unicode defines as distinct writing systems. Python’s implementation has run into those edge cases and softened accordingly.

Rust made a stricter tradeoff for a language where security posture is an explicit design value. The same tradeoff is harder to defend for a language explicitly designed for global accessibility. That context is real.

What Remains Open

The Hacker News discussion of the Aikido research surfaces the argument that dependency pinning with hash integrity closes the attack. Hash integrity protects against post-publication modification of a package that was clean on first publish. Glassworm’s payloads, if present at the time of first publication, are covered by the hash. The hash of a poisoned file is still a valid hash.

Until the JavaScript toolchain adds enforcement at one of the chokepoints where it has leverage, the practical defenses are pre-commit hooks scanning staged files for the relevant Unicode ranges, Socket.dev or an equivalent supply chain scanner in CI, and the VSCode settings editor.unicodeHighlight.invisibleCharacters: true and editor.unicodeHighlight.ambiguousCharacters: true when reviewing dependency code directly. None of these are enabled by default. Each requires an informed decision to turn on.

The Unicode Consortium documented this attack surface in UTR #36 and provided countermeasures in UTS #39. That documentation existed before Trojan Source and before Glassworm. The gap between what the specification makes possible and what the toolchain enforces is a choice that the JavaScript ecosystem has made repeatedly, for understandable reasons, and that adversaries are counting on remaining unchanged.