The Invisible Characters That GitHub's Bidi Warning Doesn't See

The Glassworm campaign, documented by Aikido Security, targets GitHub repositories, npm packages, and VSCode extensions using Unicode characters that are invisible or visually indistinguishable from whitespace inside source files, package metadata, and extension identifiers. A reviewer looking at a diff sees nothing; a compiler or runtime sees something different, and that gap is the attack.

Trojan Source Set the Stage

In November 2021, Nicholas Boucher and Ross Anderson at the University of Cambridge published Trojan Source, a paper formalizing a specific class of this attack using bidirectional Unicode control characters. The core mechanism exploits the Unicode Bidirectional Algorithm (UAX #9), which allows text to switch rendering direction for right-to-left scripts like Arabic and Hebrew. By inserting characters like U+202E (RIGHT-TO-LEFT OVERRIDE) or U+2066 (LEFT-TO-RIGHT ISOLATE) inside string literals and comments, an attacker can make the rendered source look completely different from the token sequence the compiler actually parses.

The classic demonstration in Python:

access_level = "user"
# Check if admin⁦ ⁩ ⁦# Check if admin
if access_level != "user⁩ ⁦":
    print("You are not an admin")

What a reviewer sees in the editor: a comment and a conditional. What the parser sees: the string literal swallows the comment text, making the condition behave opposite to its apparent meaning. The paper triggered coordinated disclosure to roughly 200 language toolchains and editors, generating CVE-2021-42574 and a wave of patches. GCC added -Wbidi-chars. Rust stabilized its text_direction_codepoint_in_literal lint in 1.60. Python added a SyntaxWarning for Bidi characters in source. GitHub rolled out a yellow warning banner in PR diffs when those characters appeared. VSCode 1.63 shipped Unicode Highlight, flagging ambiguous and invisible characters in the editor.

All of that was appropriate response. The problem is what those mitigations covered.

The Characters That Were Left Out

GitHub’s Bidi warning covers a specific list of bidirectional control codepoints: U+202A through U+202E, U+2066 through U+2069, U+061C, U+200E, and U+200F. That list maps directly to what Trojan Source demonstrated. It does not cover zero-width characters or the Unicode Tag block.

Zero-width characters are a separate family entirely:

U+200B: ZERO WIDTH SPACE
U+200C: ZERO WIDTH NON-JOINER
U+200D: ZERO WIDTH JOINER
U+FEFF: ZERO WIDTH NO-BREAK SPACE (also the BOM marker)

These have different semantics from Bidi controls. They do not reverse rendering direction; they insert invisible structure that separates, joins, or marks content without affecting visual layout. Inserted into an identifier name, U+200B creates two tokens that look identical to a human but are distinct strings to the runtime. Inserted into an npm package name, it creates a package that renders as lodash in every terminal and web UI while being a completely different string at the registry level, where name-collision logic operates on raw bytes.

The Unicode Tag block (U+E0000 through U+E007F) is worse. These codepoints, originally intended for language tagging and now deprecated for general use, have no visual representation at all. They are absent from most detection rule sets because they postdate the Trojan Source disclosure tooling and were never part of the coordinated patch cycle.

The attack surface that Glassworm exploits sits precisely in the gap between what Trojan Source disclosed and what the subsequent tooling actually checks.

Three Attack Surfaces, Three Exploitation Patterns

npm package names. A package named react with U+200B inserted between any two characters is distinct from the legitimate react package at the registry level but visually identical in npm install output, package-lock.json viewers, and most dependency audit dashboards. An attacker publishes the malicious package; developers who see it referenced in a compromised package.json read the name as legitimate. npm’s name normalization strips hyphens and underscores but historically has not normalized zero-width characters.

GitHub pull requests. When a contributor submits a PR modifying a security-critical code path, a reviewer reads the diff on GitHub. If the changed lines contain ZWC or Tag block characters, no warning appears. The actual byte sequence in the file differs from what the reviewer approved. A single invisible character inserted into a condition string can invert a permission check; the diff view makes the modification look like nothing happened.

VSCode extensions. Extension publisher IDs and display names support Unicode. A publisher named ms-рython using Cyrillic р (U+0440) instead of Latin p (U+0070) renders identically to ms-python in almost any proportional font. The VSCode marketplace UI and extension detail page show the same glyphs to a user searching for or installing the extension.

What Actually Catches This

VSCode’s Unicode Highlight feature, when fully enabled, covers the invisible character category. The three relevant settings:

{
  "editor.unicodeHighlight.ambiguousCharacters": true,
  "editor.unicodeHighlight.invisibleCharacters": true,
  "editor.unicodeHighlight.nonBasicASCII": true
}

These are not all enabled by default in every file context, and they operate at the editor view layer, not at the diff review layer where PRs are typically evaluated.

A pre-commit hook gives enforcement at commit time. A minimal Python check covers the main dangerous character classes:

import sys, re

dangerous = re.compile(
    r'[\u200b-\u200d\u202a-\u202e\u2066-\u2069\ufeff\ue0000-\ue007f]'
)
for line in sys.stdin:
    if dangerous.search(line):
        print("Dangerous Unicode found:", repr(line))
        sys.exit(1)

Pipe git diff --cached through that check before committing and you catch injections before they reach the remote. For repository maintainers, a GitHub Actions step running the same pattern on every PR diff covers the review surface that GitHub’s own warning misses. The regex above includes both the Bidi controls and the ZWC and Tag block ranges that GitHub’s banner ignores.

For dependency security, Socket.dev includes Unicode anomaly detection in its package scans. Snyk added ZWC pattern rules after 2022. PyPI added Bidi scanning in 2023. npm’s progress has been slower. Aikido’s scanner is what identified the current Glassworm campaign in the wild.

The Broader Pattern

The Unicode Security Considerations (UTR #36) document and the Unicode confusables dataset have mapped the full attack surface for years. The gap is not in documentation; it is in tooling that defaults to covering the narrowest possible set of characters, typically those disclosed in the most recent prominent paper, rather than the full range that the Unicode specification itself flags as security-relevant.

Trojan Source disclosed a real problem, generated appropriate CVEs, and motivated real improvements. Those improvements targeted the specific characters and contexts the paper described. Every detection system has a boundary, and attackers find what sits just outside it. The characters that Glassworm uses were documented as dangerous in UTR #36 long before Trojan Source was published; they just were not what the 2021 patches emphasized.

This is the standard lifecycle for this class of vulnerability. A high-visibility paper focuses attention on a specific mechanism. Tooling catches up to that mechanism. Subsequent campaigns shift to adjacent mechanisms in the same character space. Repeat until the tooling either covers the full dangerous range or the ecosystem adopts a different approach, such as allowlisting source files to only permit printable ASCII and explicit Unicode ranges rather than blocklisting known-bad codepoints.

The blocklist approach will keep losing. The Unicode standard adds characters in every release, and the definition of “dangerous in source code” expands with them. A pre-commit hook checking the full Tag block today will need to be updated when the next problematic block surfaces. Allowlisting printable ASCII for source files while requiring explicit opt-in for Unicode is the more durable posture, though it creates friction for internationalized codebases and is unlikely to be adopted broadly in the near term.

In the meantime, the VSCode settings above, a regex-based pre-commit hook, and a supply chain scanner that explicitly covers ZWC and Tag block characters represent the current practical defense, covering the gap that the 2021 disclosure left open.