· 6 min read ·

The Rendering Gap: Why Unicode Attacks on npm Keep Working

Source: hackernews

Source code is a sequence of bytes, and developer tools render those bytes as text. Most of the time those two things agree with each other, and the entire mental model of software development depends on that agreement. Unicode-based code attacks exist because that agreement is not guaranteed.

Aikido Security documented a resurgence of a campaign they call Glassworm, targeting npm packages, GitHub repositories, and VS Code by using invisible Unicode characters to hide malicious payloads in code that passes visual inspection. The campaign is not technically novel; it draws on techniques that have been documented for years. What makes it worth examining now is that the ecosystem has had years to respond and the defenses remain shallow.

What the Rendering Gap Looks Like in Practice

Three distinct Unicode attack families exploit this gap, and Glassworm uses elements of all of them.

The first is zero-width characters (ZWCs): codepoints with no visual representation. U+200B (ZERO WIDTH SPACE), U+200C (ZERO WIDTH NON-JOINER), U+200D (ZERO WIDTH JOINER), and U+FEFF (the byte-order mark, also used as ZERO WIDTH NO-BREAK SPACE) are all invisible in GitHub’s file viewer, VS Code’s editor, and npm’s web interface. They are also completely legal in JavaScript string literals and, in some cases, in identifier names.

The practical attack is straightforward. A postinstall script contains a require() call or a URL construction. Visually, the string appears to reference a legitimate package or endpoint. The actual bytes in the file include ZWCs that cause the resolved string to differ from what the reviewer sees. The script fetches from attacker infrastructure; the developer sees what looks like a clean installation routine.

The second family is homoglyph substitution. Unicode contains characters that look identical or nearly identical to ASCII characters at common font sizes. Cyrillic а (U+0430) is visually indistinguishable from Latin a (U+0061) in most monospace fonts. Cyrillic е, о, and с each have Latin counterparts that are pixel-for-pixel matches in editors like VS Code. ECMAScript, following Unicode TR31, allows these characters in identifiers. So сonfig using Cyrillic с and config using Latin c are two completely different variables that no reviewer will catch without dedicated tooling.

The third family is bidirectional override characters, which were the focus of the Trojan Source disclosure in November 2021. Researchers Nicholas Boucher and Ross Anderson at the University of Cambridge published a paper demonstrating that Unicode bidi control characters, legal in comments and string literals in nearly every programming language, cause visual reordering that diverges from what compilers and interpreters process. The technique can make a comment appear to contain code, make code appear to be commented out, or stretch a string literal to visually enclose what appears to be active logic outside it. CVE-2021-42574 (CVSS 8.3) was assigned, covering C, C++, Python, Java, JavaScript, Go, Ruby, Rust, and C#.

The npm Exposure Surface

JavaScript and the npm ecosystem are the highest-risk targets for these attacks, for reasons that compound each other.

ECMAScript fully supports Unicode identifiers per the specification. Unlike some languages that restrict identifiers to ASCII, JavaScript accepts any character conforming to Unicode TR31, which includes the full Cyrillic, Greek, and fullwidth Latin ranges. This is not a bug; it reflects a deliberate internationalization choice. The security consequence is that the language specification itself provides the attack surface.

npm packages are installed with minimal review under normal conditions. A developer adding a dependency, or more critically, a CI pipeline running npm install, never examines the byte sequences in every file of every transitive dependency. The postinstall lifecycle hook executes automatically at install time, before any human can notice something is wrong. This combination means that a ZWC-bearing string in a postinstall script can exfiltrate data or establish persistence before any code review would catch it.

Bundlers do not consistently normalize invisible characters. webpack, esbuild, and rollup process and emit JavaScript but do not strip ZWCs from string literals or identifier names. The malicious payload survives the build pipeline intact.

The Tooling Response Has Been Incomplete

After the Trojan Source disclosure in 2021, the ecosystem moved. GCC added -Wbidi-chars (enabled by default with -Wall). Clang added -Wbidirectional-control-characters. Python 3.10.1 added a SyntaxWarning for bidi characters, though this was later softened due to conflicts with legitimate internationalization use cases. GitHub added yellow warning banners on files containing bidi control characters.

Rust took the most aggressive stance: bidi characters in source files are a compile error, not a warning. This is the correct security posture. When a class of input consistently enables supply chain attacks and has no legitimate use in source code, as opposed to string data, treating it as an error rather than a warning is defensible. Other toolchains chose opt-in warnings; Rust chose opt-out safety.

None of these responses cover zero-width characters or homoglyphs, and that is where Glassworm operates. GitHub’s bidi warning does not trigger on U+200B. VS Code’s bidi indicator does not flag U+200D. npm audit does not examine character-level content of package files. Standard ESLint rules do not detect ZWCs in string literals.

Socket.dev, the supply chain security scanner launched in 2022, added detection for hidden Unicode characters in npm packages, flagging packages containing U+200B through U+200D and FEFF in non-trivial contexts. That is currently one of the only widely available commercial tools that covers this specific attack surface. Snyk has added some package behavioral analysis, but character-level Unicode scanning is not a standard feature across the field.

What Detection Actually Looks Like

For repositories you control, grep-based CI checks are the most accessible first layer:

grep -rP "[\x{200B}-\x{200D}\x{FEFF}\x{2060}\x{202A}-\x{202E}\x{2066}-\x{2069}]" \
  --include="*.js" --include="*.ts" --include="*.py" .

A pre-commit hook that rejects invisible Unicode in staged files:

#!/bin/sh
if git diff --cached --name-only | xargs grep -lP \
  '[\x{200B}-\x{200D}\x{FEFF}\x{2060}\x{202A}-\x{202E}\x{2066}-\x{2069}]' 2>/dev/null; then
  echo "Invisible Unicode detected in staged files"
  exit 1
fi

For homoglyph detection in Python, the confusable_homoglyphs library checks identifiers against Unicode’s confusables database, though integrating it into a standard lint pipeline requires custom work. There is no equivalent widely-adopted ESLint plugin for JavaScript as of early 2026.

The fundamental limitation of these approaches is that they apply to code you write and review. Transitive npm dependencies contain thousands of files across hundreds of packages; running character-level scans on all of them at install time is not standard practice, and most teams do not do it.

The Underlying Problem Has Not Been Solved

Glassworm is a reminder that the security boundary between what developers see and what computers execute has never been formally closed. The Trojan Source paper established that the boundary exists and can be exploited; four years later, zero-width characters and homoglyphs remain outside the coverage of most standard toolchains.

The Unicode Consortium’s security guidelines (Unicode Technical Standard #39) define mechanisms for detecting confusable and mixed-script identifiers. Programming language implementations largely do not apply them. That reflects genuine difficulty: restricting identifier characters can break legitimate multilingual code, and the Unicode standard itself was not designed with source code security as a primary constraint. But it is still a choice, made repeatedly, by the maintainers of Python, JavaScript, and Go.

Rust made a different tradeoff for one part of the problem. The Trojan Source companion CVE for homoglyphs (CVE-2021-42694) has seen even less toolchain-level response than the bidi variant. Until more compilers and linters treat these character classes as errors in source files, or until supply chain scanners with Unicode-aware analysis become standard in CI pipelines, the attack surface stays open. Glassworm is not a new vulnerability; it is a persistent one that most of the ecosystem has chosen not to make expensive enough to exploit.

Was this interesting?