· 6 min read ·

What Code Review Cannot See: The Unicode Technique Behind the Latest Supply Chain Attacks

Source: lobsters

Source code is text. That sounds trivial, but it matters enormously when the definition of “text” is Unicode, a standard that encodes over 150,000 characters, many of which render as nothing at all.

A supply chain attack reported by Ars Technica has been exploiting exactly this: invisible Unicode characters embedded in source code that appears completely clean to a human reviewer, but contains hidden logic that compilers, interpreters, and runtime environments execute faithfully. The attack targeted repositories on GitHub and other platforms, smuggling malicious contributions through the pull request review process.

This is not a new class of vulnerability. But each time it resurfaces in a real supply chain compromise, it is worth going deeper than the news cycle does.

The Unicode Character Taxonomy That Makes This Possible

Unicode contains several categories of characters that are invisible or near-invisible in standard rendering:

Zero-width characters: U+200B (Zero-Width Space), U+200C (Zero-Width Non-Joiner), U+200D (Zero-Width Joiner), and U+FEFF (Zero-Width No-Break Space). These produce no visible glyph and no horizontal advance. A string that contains them looks identical to one that does not.

Bidirectional control characters: U+202A through U+202E and U+2066 through U+2069. These are the characters behind the Trojan Source attack, published in 2021 by Nicholas Boucher and Ross Anderson at Cambridge University and assigned CVE-2021-42574. They instruct text renderers to change the direction text flows, which means code that appears to flow left-to-right on screen can actually be stored in a different byte order, causing the rendered view and the compiled interpretation to diverge.

Soft hyphen: U+00AD, technically a formatting hint for line-breaking that renders as nothing in most contexts.

Tag characters: U+E0000 through U+E007F, originally reserved for language tagging. Compilers generally skip them; renderers display nothing.

The key property all of these share is that most developer tools, including web-based diff viewers and code review interfaces, will render them as empty space or nothing, while compilers, linters, and language runtimes may treat them differently depending on where they appear.

The Trojan Source Precedent

Boucher and Anderson’s 2021 paper demonstrated that bidirectional control characters could be placed inside string literals and comments to make the rendered source code fundamentally misrepresent what the compiler sees. Their canonical example works roughly like this:

/* Assume access_level != "user\u202e \u2066// Check if admin\u2069 \u2066" */
if (access_level != "user") {
    grantAdminAccess();
}

In a code review tool that respects bidirectional Unicode rendering, the comment appears to read: Assume access_level != "user" // Check if admin. The string literal boundary looks closed before // Check if admin. But the compiler, which ignores bidirectional formatting entirely, sees a string literal that ends much later, making the if condition always false or always true in ways invisible to the reviewer.

GitHub responded to Trojan Source in late 2021 by adding a warning banner to file views containing bidirectional Unicode control characters. That was a meaningful step, but it addressed one specific character category and one specific platform surface. The broader class of invisible character abuse remained.

How This Translates to Supply Chain Attacks

The supply chain variant is more operationally sophisticated than a simple code trick. The attack chain typically looks like this:

  1. The attacker identifies a widely-used open source library with an active contributor base.
  2. They submit a pull request containing a seemingly legitimate fix or feature, with invisible characters embedded in a strategic location.
  3. Reviewers read the diff, see nothing unusual, and approve the change.
  4. The change merges, is tagged in a release, and lands in downstream package registries.
  5. Every project that depends on that library now runs code that differs from what the maintainers reviewed.

The exact placement of the invisible characters determines the attack payload. Common targets include:

String comparisons: A zero-width character inside a string literal used for authentication or feature-flag checks means the string will never match user-provided input, because user input will not contain those characters.

# Looks like: if role == "admin":
# The string contains U+200B after 'a'
if role == "a\u200bdmin":
    grant_admin_access()

Identifier shadowing in Unicode-aware languages: Python 3, JavaScript, and Rust allow Unicode in identifiers. Two variable names that appear identical on screen can be different bytes if one contains a zero-width character, creating a silent shadowing bug.

const isAdmin = false;  // safe default
const isAdmin\u200b = true;  // zero-width space makes this a different identifier

// Later code referencing isAdmin may pick up either binding
// depending on scope and which one was resolved

Comment-encoded logic: The Trojan Source bidirectional technique puts executable code inside what appears to be a comment, or moves real code into an apparent comment.

Why Code Review Fails Here

Code review is a social and cognitive process optimized for reading. It assumes that what renders in the diff is what will run. That assumption is false when Unicode invisible characters are in play, and there is no natural corrective mechanism in the review loop.

Code reviewers are not typically looking at raw bytes. They are reading a web-rendered diff. Even if they check out the branch locally, their terminal and editor may silently normalize or hide the problematic characters. The attack exploits the entire toolchain between the attacker’s editor and the reviewer’s eyes.

This is also why automated CI is not a reliable defense on its own. Unit tests do not test for the presence of invisible characters. Integration tests do not either. A test suite that passes against the malicious code will continue to pass, because the invisible characters are often in code paths that pass tests perfectly while failing security invariants in production contexts.

Detection and Defense

Several tools can catch invisible character abuse if deployed deliberately:

ESLint has a no-irregular-whitespace rule that catches a subset of these characters in JavaScript source. It does not cover all Unicode invisibles.

Grep-based scanning in CI can catch many of these before merge:

# Scan for common invisible characters and BIDI controls
grep -rP '[\x{200B}\x{200C}\x{200D}\x{FEFF}\x{202A}-\x{202E}\x{2066}-\x{2069}]' .

Git hooks can run this check at commit time and reject any file containing these characters outside of explicitly whitelisted paths (such as test fixtures that specifically test Unicode handling).

Editor configuration: VSCode and JetBrains IDEs can be configured to render invisible characters as visible markers. This does not help reviewers using GitHub’s web interface, but it helps contributors catch introduced characters before committing.

GitHub’s code scanning via CodeQL or third-party actions can include regex patterns for invisible characters. This requires explicit configuration; it is not on by default.

For package registries, the most reliable approach is server-side scanning at publish time. npm, PyPI, and RubyGems all have security scanning pipelines, but their existing rules are primarily focused on known malware patterns, dependency confusion, and typosquatting, not embedded Unicode abuse.

The Ecosystem Problem

The deeper issue is that supply chain security is a coordination problem at scale. A single maintainer running a pre-commit hook covers their own commits. It does not cover the contributor who submits a PR from a fork with no hooks configured. Code review catches logic bugs but cannot catch invisible syntax bugs without tool support that is not standard.

Each layer of the supply chain, from contributor tooling to review interfaces to CI pipelines to package registries, needs to independently enforce character hygiene, and most do not. The attack reported by Ars Technica succeeded not because any single tool failed catastrophically, but because none of them treated invisible Unicode as the threat surface it is.

Language communities can standardize here. Compiler warnings for non-ASCII characters in identifiers would raise visibility. Linting rules covering the full set of invisible code points, not just irregular whitespace, would help. Package registry policies requiring source archives to be free of invisible characters in non-string, non-comment positions would add a backstop.

None of this is technically complex. Unicode character categories are well-specified. Regex patterns that match invisible characters are not hard to write. The gap is adoption, not capability.

In the meantime, if you maintain or depend on open source packages, adding a CI step that scans for invisible Unicode characters in source files is a one-line grep. The cost is negligible. The attack surface it closes is real.

Was this interesting?