Unicode Invisible Characters Are Now a Supply Chain Weapon

The conventional supply chain attack has a legible shape: a compromised maintainer account, a suspicious new dependency, a base64-encoded payload buried in a postinstall script. Security tooling has gotten reasonably good at this shape. The attack documented by Ars Technica breaks the shape entirely. The malicious code exists in the repository. It passes diff review. It is, in the most literal sense, invisible.

The mechanism is Unicode. Specifically, the category of Unicode code points that produce no visible glyph in most rendering environments, or that actively reorder the visual display of surrounding text while leaving the parsed representation unchanged. These characters have been in the standard for decades, serving real purposes: bidirectional text support for mixed Arabic and Latin content, zero-width joiners for emoji sequences, invisible mathematical operators for semantic markup. Their presence in source code repositories is where things go wrong.

The Two Attack Primitives

There are two distinct ways invisible Unicode characters get weaponized in source code, and understanding both matters for understanding why this keeps resurfacing.

The first is bidirectional text manipulation. Unicode includes a set of directional control characters, the most dangerous of which are the directional override codes: right-to-left override (U+202E), left-to-right override (U+202D), and the isolate variants (U+2066 through U+2069). These were designed to force correct rendering of mixed bidirectional text. In source code, they can make a code reviewer’s terminal or browser display code in a fundamentally different order than the compiler or interpreter processes it.

This is the mechanism at the core of Trojan Source, documented in a 2021 paper by Nicholas Boucher and Ross Anderson at the University of Cambridge. The paper demonstrated working attacks against C, C++, C#, JavaScript, Python, Java, Rust, and Go. A comment delimiter that visually appears after a block of code might, once directional overrides are in play, actually enclose and nullify that code while leaving different logic exposed to the runtime. The reviewer and the compiler read different programs.

The second primitive is zero-width identifier collision. Languages with Unicode identifier support, including Python 3 (via PEP 3131), JavaScript (per the ECMAScript spec), and Rust, allow variable names to include Unicode characters beyond the ASCII range. Zero-width space (U+200B), zero-width non-joiner (U+200C), and zero-width joiner (U+200D) are all invisible in every standard code editor but are distinct code points. A variable named config and a variable named config followed by a zero-width space are, to the runtime, different bindings. Validation logic reading one while assignment targets the other is a primitive but functional attack vector, and it renders identically in every standard code review interface.

Why Git and Code Review Cannot Catch This

Git operates on bytes. A commit that inserts zero-width characters into a source file will appear in git diff as a change to that line, but the diff renderer, whether in a terminal or in GitHub’s pull request view, will display the line as if nothing changed. The invisible characters are present in the patch but produce no visible output. A reviewer who reads the diff carefully sees nothing to investigate.

GitHub introduced a warning for bidirectional Unicode control characters in code files in late October 2021, following the Trojan Source disclosure. Files containing those specific code points receive a banner and require a deliberate click-through before the content renders. This was a meaningful improvement. It does not cover the zero-width character set used in identifier collisions, and it applies to GitHub specifically, not to every platform that hosts repositories referenced by package managers.

Static analysis tools face a different version of the same problem. Tools like ESLint or Pylint operate on the abstract syntax tree produced after parsing, which means they see the code as the runtime sees it, not as the reviewer sees it. For the bidirectional attack, this is somewhat useful: the AST reflects the actual parsed structure, so a linter may catch semantic issues in the manipulated code. For the identifier collision attack, the AST correctly distinguishes the two different identifiers, but will not flag this as suspicious because both are syntactically valid Unicode identifiers.

grep searches for byte patterns. Searching for a suspicious function call will not find it if invisible characters surround or alter the name in ways not included in the search pattern. Running cat -A or hexdump on individual files exposes raw bytes, but doing this across every file in a dependency tree during review is not a realistic expectation.

The Supply Chain Dimension

When Boucher and Anderson published Trojan Source in 2021, it was primarily framed as a code injection vulnerability, a way to sneak malicious behavior past human reviewers in a single repository. The supply chain dimension is what this class of attack develops into over time.

Package registries in the npm, PyPI, and crates.io ecosystems largely trust the source repositories they index. A package published with invisible-character payloads in its source passes through the standard submission pipeline. Malware scanners at the registry level are generally looking for known-malicious patterns in the bytes that will execute, not for divergence between the visual and semantic representations of the source text. An attacker with commit access to a transitive dependency, or with the ability to publish a new version of a small utility package, can inject invisible characters that change program behavior and ship them to every downstream consumer without triggering standard registry-level checks.

The breadth of affected platforms in the Ars Technica report points to either independent rediscovery across ecosystems or deliberate cross-platform targeting. Either way, it reflects the core problem: the fix for invisible Unicode in source code requires coordination across editors, review platforms, CI systems, package registries, and language toolchains. GitHub improved its rendering. Other platforms have inconsistent coverage. The attacker only needs one gap.

The AI Code Review Problem

There is an additional wrinkle that was not part of the threat model in 2021. AI-assisted code review tools are now common enough that some teams rely on them as a primary review pass. These tools are trained on diffs and source text, which means they receive the same visual representation that human reviewers get. A model that does not normalize Unicode before analysis reads the manipulated version of the code, not the parsed version.

Whether current code review assistants apply Unicode normalization before analysis is not well documented by major vendors. The attack surface is real: a model that confidently reviews a diff and finds nothing suspicious because the suspicious content is invisible provides a false confidence layer on top of an already-broken control.

Detection That Works

The most reliable defense is a pre-commit hook or CI step that scans source files for unexpected Unicode code points before changes merge. A grep pattern covering the relevant ranges catches both attack primitives:

grep -rP '[\x{200B}-\x{200F}\x{202A}-\x{202E}\x{2060}-\x{2069}\x{206A}-\x{206F}\x{FEFF}]' .

This covers zero-width space through right-to-left mark, the directional embedding and override characters, word joiner through pop directional isolate, the deprecated formatting characters, and the zero-width no-break space used as a BOM. Any match in a source file is worth examining; legitimate use of these characters in code is rare enough to warrant review.

For languages that compare string values containing user-supplied input, Unicode normalization before comparison eliminates the identifier-collision class of attack on the data side. In Python:

import unicodedata

def safe_compare(a, b):
    return unicodedata.normalize('NFC', a) == unicodedata.normalize('NFC', b)

In JavaScript:

const safeCompare = (a, b) => a.normalize('NFC') === b.normalize('NFC');

NFKC normalization is stricter and collapses more compatibility equivalents; NFC is sufficient for the zero-width character cases and has fewer side effects on legitimate Unicode content.

For dependency consumers, the standard advice about pinning versions and auditing upgrades applies with added weight here. A dependency upgrade that introduces Unicode anomalies in source files is a flag worth investigating, and running the grep pattern above across the changed files in any upgrade is low-cost compared to the alternative.

What Stays Broken

Four years after the Trojan Source paper identified the fundamental gap between the source code a developer reads and the source code a compiler processes, that gap is still being exploited. The underlying problem is that the source text is not the program; it is a representation of the program, and that representation passes through a rendering stack before human eyes. When that stack faithfully reproduces invisible characters as invisible, the reviewer’s model of the code is wrong.

The fix is coherent in principle: normalize source text before rendering diffs, flag unexpected Unicode in review interfaces, run Unicode anomaly checks in CI, require language toolchains to warn on invisible identifiers. The difficulty is that every node in the ecosystem has to do this independently, and the attacker’s job is to find the node that has not. GitHub improved its tooling. That left the other platforms. The platforms improve. That leaves the editors. The editors improve. That leaves the CI systems that check out code and run it without normalizing it first.

The invisible-character attack keeps returning because fixing it completely requires more coordination than the ecosystem has managed to produce. Until that changes, the grep pattern in your pre-commit config is doing more work than the review interface your team trusts.