· 7 min read ·

JavaScript Has No Compiler to Defend Against Glassworm, and That's By Design

Source: hackernews

The Glassworm campaign documented by Aikido Security is getting coverage as a supply chain security story, which it is, but the framing tends to treat the JavaScript ecosystem’s exposure as a tooling gap. It is deeper than that. The reason npm packages and Node.js runtimes process invisible Unicode characters without complaint is not that nobody got around to adding a warning. It is that the specification explicitly allows them, the decision was intentional, and reversing it would break things people care about.

Understanding that structural difference is what separates “run a grep check in CI” from understanding why the JavaScript ecosystem is in a qualitatively different position than Rust or even Python when it comes to this class of attack.

What ECMAScript Actually Says About Identifiers

The ECMAScript specification delegates identifier character validation to Unicode Technical Report 31 (UTR #31), which defines which Unicode codepoints are valid in identifier names for programming languages. The relevant ECMAScript grammar rule is IdentifierPartChar, which includes UnicodeIDContinue, defined as any character with the Unicode property ID_Continue. That property covers the full Cyrillic script, Greek, Arabic, Devanagari, fullwidth Latin, and several ranges that include characters visually indistinguishable from their ASCII counterparts at common font sizes.

This is not an oversight. The ECMAScript internationalization design rationale is explicit: code written by developers who think in Korean, Arabic, or Russian should allow identifiers in those scripts. const 설정 = {}; should work, and it does. That goal is legitimate.

The security consequence is that сonfig using Cyrillic с (U+0430) and config using Latin c (U+0061) are two legal, distinct variables in any JavaScript runtime. A reviewer reading one sees the other. The homoglyph attack that CVE-2021-42694 formalized is not a bug in V8; it is a natural consequence of implementing the spec correctly.

Separately, zero-width characters (U+200B ZERO WIDTH SPACE, U+200C ZERO WIDTH NON-JOINER, U+200D ZERO WIDTH JOINER) and the deprecated Unicode Tag block (U+E0000 through U+E007F) have no restricted status in JavaScript string literals. A string "admin" containing U+200B between a and d is valid JavaScript. The runtime evaluates it without warning. Whether the character is a security problem depends entirely on what the string is compared against, and V8 has no opinion on that.

What Rust Did Differently

Rust 1.56.1, released in November 2021 as a direct response to the Trojan Source disclosure (CVE-2021-42574), made bidirectional control characters a compile error in source files outside string literals and a lint warning inside them. The text_direction_codepoint_in_literal lint was stabilized in Rust 1.60. By Rust 1.61, the compiler’s stance on Bidi characters in source was essentially: not in my build.

The Rust compiler team made a deliberate tradeoff. Bidi characters in Rust source files have no legitimate use case in the token stream, as opposed to within string data that a program processes. String data is handled differently from source tokens, and the language enforces that distinction. Identifiers in Rust are restricted to ASCII plus a narrower set of Unicode ranges than ECMAScript allows.

This is not the only possible choice, but it is a coherent one. You can write Rust libraries for applications in Arabic; you cannot name your Rust functions with Arabic characters in the token stream. Some communities find that frustrating, and the tradeoff is real. The security benefit is that an entire class of source-level attacks becomes a build failure rather than a runtime surprise.

Python 3.10.1 added a SyntaxWarning for Bidi characters in source. Python subsequently softened that warning because it caused false positives in codebases with legitimate internationalized content embedded in string literals. That softening was not negligence; it reflects genuine tension between internationalization requirements and security enforcement. The Python core developers are navigating the same structural problem, and they have chosen a different point on the tradeoff curve than Rust.

V8 and Node.js have not added any warning for Bidi characters, zero-width characters, or Tag block codepoints in source. The ECMAScript specification does not require it, and the ecosystem precedent has been to treat these as the runtime’s problem, not the parser’s. There is no equivalent of -Wbidi-chars (GCC 12’s opt-in flag) for node --check.

TypeScript Is Not Filling the Gap

TypeScript sits on top of JavaScript and adds a type system, but it does not restrict the identifier and string character sets beyond what ECMAScript requires. The TypeScript compiler (tsc) will happily process source files containing homoglyph identifiers, zero-width characters in string literals, and Tag block codepoints in any position. This is consistent with TypeScript’s design goal of being a strict superset of valid JavaScript.

TypeScript could theoretically add a strict mode flag that rejects non-ASCII source tokens outside of string literals, similar to what Rust does. That would be a compiler option, not a language change. The implementation would be straightforward: the scanner already processes each token; adding a character-range check before emitting an Identifier token is not architecturally difficult. As of TypeScript 5.x, no such flag exists and there is no active TC39 proposal or TypeScript issue tracking it.

ESLint’s no-irregular-whitespace rule catches some zero-width characters when they appear as unexpected whitespace in certain positions, but it does not cover ZWCs embedded inside string literals, which is where Glassworm’s string-identity attack lives. There is no widely maintained ESLint plugin that covers the full ZWC and Tag block character ranges in all string contexts as of early 2026.

The Build Pipeline Does Not Help

A reasonable expectation would be that bundlers, minifiers, or the build pipeline generally would normalize source content and strip invisible characters before producing deployment artifacts. This expectation is wrong.

webpack, esbuild, and rollup all process JavaScript source and produce output bundles, but none of them strip ZWCs from string literals or identifier names as part of their default behavior. The transformation pipeline respects the byte content of identifiers and strings: a ZWC-bearing string that fetches from attacker infrastructure in a postinstall script will produce the same network request in the minified bundle. Terser, the most widely used JavaScript minifier, does not normalize invisible Unicode; it operates on the AST that the parser produces, and the parser faithfully represents the string content including invisible codepoints.

The package-lock.json integrity field compounds this. Integrity hashes protect against post-publication modification of a package, but they are computed over whatever bytes the package contains at publish time. A package published with U+200B inside a critical string has its integrity hash computed over that content, including the invisible character. Any subsequent npm ci run will verify the hash, pass, and install the poisoned package. As noted in the Hacker News discussion of the Aikido research, some commenters argued that pinning protects against this; it does not, for exactly this reason.

The Realistic Defense Stack

Given that the enforcement gap exists at the specification level, the practical defense has to live elsewhere in the pipeline. None of these are perfect, but layered together they cover most of the attack surface.

A pre-commit hook that scans staged files for the relevant codepoint ranges covers the developer’s own commits before they enter shared history:

#!/usr/bin/env bash
if git diff --cached --name-only | xargs grep -lP \
  '[\x{200B}-\x{200D}\x{FEFF}\x{2060}\x{202A}-\x{202E}\x{2066}-\x{2069}\x{E0000}-\x{E007F}]' \
  2>/dev/null; then
  echo "Suspicious Unicode detected in staged files."
  exit 1
fi

Scope this to source file extensions (.js, .ts, .mjs, .cjs, .py) rather than running it against all files, to avoid false positives from legitimate binary content or Unicode text files. The pattern covers Bidi controls, ZWCs, Word Joiner (U+2060), and the Tag block, which is the range the 2021 patches consistently missed.

The same pattern as a GitHub Actions step scans PR diffs without depending on GitHub’s built-in Bidi warning, which does not cover ZWCs:

- name: Unicode audit
  run: |
    git diff origin/main...HEAD -- '*.js' '*.ts' '*.mjs' | \
      grep -P '[\x{200B}-\x{200D}\x{FEFF}\x{202A}-\x{202E}\x{E0000}-\x{E007F}]' && \
      { echo "Suspicious Unicode in diff"; exit 1; } || true

For dependencies, Socket.dev currently provides the most comprehensive Unicode anomaly scanning for npm packages at the point of installation. It is opt-in via a GitHub app and does not operate by default on npm install. The gap between what the registry enforces and what Socket detects represents the window during which a poisoned package is available and installable without warning.

The Structural Problem

The Unicode Consortium’s security guidelines (UTR #36) and the confusables dataset have documented this attack surface for longer than Trojan Source has existed. The reason JavaScript is in a worse position than Rust is not that the Rust community read UTR #36 and the JavaScript community did not. It is that Rust was designed as a systems language with explicit security goals, while JavaScript was designed as a web scripting language with explicit internationalization goals, and those different origins produced different defaults.

Changing those defaults in JavaScript would require either a breaking change to the ECMAScript specification, which TC39 moves slowly to adopt, or a new TypeScript compiler flag that enough toolchain configurations enable to become a meaningful enforcement layer, which has not materialized. In the interim, Glassworm operates in the gap between what the spec allows and what individual teams choose to audit, and that gap is wide.

The attack does not require a novel zero-day. It requires that someone publishing an npm package understands that U+200B is a valid byte in a JavaScript string, that the registry will not reject it, that the bundler will not strip it, and that npm audit will not flag it. That is all publicly documented behavior. The Aikido research confirms that someone has internalized it and is deploying it at scale.

Was this interesting?