· 6 min read ·

Code Review Has a Unicode Blind Spot, and Supply Chain Attackers Found It

Source: lobsters

Source code exists in two simultaneous contexts. There is the representation a text renderer displays for human eyes, and there is the byte sequence a compiler or interpreter actually processes. For ASCII source code, these two views are always identical. Unicode, and specifically its bidirectional text controls and zero-width formatting characters, creates a seam between them. Attackers are now using that seam at supply chain scale.

A recent campaign reported by Ars Technica demonstrates this against GitHub and other code repositories. The technique is not new; the operational deployment of it against the software supply chain is what has changed. Understanding why this works requires understanding the Unicode mechanism underneath it.

The Rendering Gap

Unicode’s bidirectional text support exists for legitimate reasons. Documents that mix English with Arabic or Hebrew need a way to specify text direction, and Unicode provides a suite of control characters for this purpose. The relevant code points include:

  • U+202A through U+202E: LEFT-TO-RIGHT EMBEDDING, RIGHT-TO-LEFT EMBEDDING, POP DIRECTIONAL FORMATTING, LEFT-TO-RIGHT OVERRIDE, RIGHT-TO-LEFT OVERRIDE
  • U+2066 through U+2069: LEFT-TO-RIGHT ISOLATE, RIGHT-TO-LEFT ISOLATE, FIRST STRONG ISOLATE, POP DIRECTIONAL ISOLATE
  • U+200E, U+200F: LEFT-TO-RIGHT MARK, RIGHT-TO-LEFT MARK

Separately, there are zero-width formatting characters with no visible glyph at all: U+200B (ZERO WIDTH SPACE), U+200C (ZERO WIDTH NON-JOINER), U+200D (ZERO WIDTH JOINER), U+2060 (WORD JOINER), U+FEFF (ZERO WIDTH NO-BREAK SPACE, also used as a byte order mark).

The critical property these characters share is that text rendering engines respect them, but most programming language parsers either ignore them or treat them as valid within identifiers and literals without treating them as lexically significant. The human reviewer sees one thing. The runtime processes another.

What Trojan Source Established

The systematic treatment of this attack class came from Nicholas Boucher and Ross Anderson at the University of Cambridge. Their 2021 paper, Trojan Source: Invisible Vulnerabilities, assigned CVE-2021-42574, demonstrated exploitation across C, C++, Python, JavaScript, Go, Java, Rust, and Ruby.

The paper’s most striking attack pattern is called “Comment Out.” In C and C++, bidirectional override characters placed inside a block comment can make the comment’s closing delimiter, */, appear visually displaced from where the parser actually sees it. To the code reviewer, a block of potentially dangerous code appears to be commented out. To the compiler, the comment ends earlier, and the code executes.

A simplified illustration of the principle:

/* Closing delimiter appears displaced to viewer due to bidi controls
   ⁦ if (access_level < ADMIN) { return AUTH_DENY; } ⁦
*/
return AUTH_ALLOW;

The bidirectional isolate characters (not reproduced here because they are invisible in most contexts) cause a text renderer to display the comment’s closing */ after the conditional block. The C compiler processes the raw bytes in sequence and sees the */ where it actually sits, leaving the access check outside the comment and always executed with the wrong outcome.

Python is vulnerable through a different vector. Python 3 allows Unicode identifiers following the Unicode Standard’s ID_Start and ID_Continue properties. The zero-width joiner (U+200D) falls under Other_ID_Continue, making it a valid character within an identifier. Two variable names that are visually indistinguishable to any code review tool can be distinct objects at runtime:

# The homoglyph attack: Cyrillic 'а' (U+0430) vs Latin 'a' (U+0061)
# Both render identically in almost every font used in code review
access_level = "admin"   # 'а' is Cyrillic here
access_level = "user"    # 'a' is Latin here

print(access_level)  # Prints 'user'
# The assignment with Cyrillic 'а' created a separate variable
# Any privilege check reading the Latin 'access_level' sees 'user'
# The Cyrillic 'аccess_level' retains 'admin'

This homoglyph variant does not require bidirectional controls at all. It relies on Unicode’s confusables dataset, which catalogs thousands of character pairs that are visually indistinguishable in common fonts: Cyrillic, Greek, and various other scripts overlap extensively with the Latin characters that dominate most source code.

Why Supply Chain Is the Right Delivery Mechanism

These techniques are most effective when the attacker controls source code that passes human review before reaching downstream consumers. A supply chain insertion multiplies impact: one merged commit or one published package release propagates to every project that takes the dependency.

The code review process carries an implicit assumption: the reviewer can read the code under review. Invisible and visually deceptive characters break that assumption silently. There is no diff warning. The malicious logic is syntactically valid code in the language’s own terms. Automated CI passes because the code compiles and the tests test what the tests test. The attack bypasses every check that does not specifically scan for Unicode anomalies.

Package registries compound the problem. Once malicious code reaches npm, PyPI, crates.io, or a Maven repository, it is cached and redistributed. Projects that track version ranges rather than pinned commit hashes pick it up automatically on the next dependency resolution. An organization’s SBOM may accurately list the package version and hash while that version contains code invisible in every audit tool the organization uses.

The attack also survives post-install inspection. If a developer opens an installed package’s source in their editor, they see the same misleading rendering the original reviewer saw. The payload is not hidden in a binary; it is in plain text that happens to exploit how text is displayed.

GitHub’s Response and Its Limits

Following the Trojan Source disclosure, GitHub implemented a warning for files containing Unicode bidirectional control characters. When viewing a file on GitHub that contains characters like U+202E or the embedding controls, a banner appears noting the presence of hidden Unicode characters with an option to reveal them.

This is a meaningful improvement for the bidirectional control subset of the attack surface. It does not cover zero-width characters that do not affect directionality, which includes U+200B, U+200C, U+200D, U+2060, and U+FEFF when not used as a BOM. It does not cover homoglyph attacks using visually similar characters from different Unicode blocks. The warning is also dismissible and does not block merging. Automated pipeline tooling that processes repository events does not necessarily surface it the same way a human reviewer browsing GitHub’s interface would.

Other major code hosts have implemented similar measures with similar coverage gaps. The partial mitigations reflect a genuine difficulty: legitimate multilingual source code and documentation in the same repository may contain bidirectional controls for valid reasons, and blanket rejection creates false positives for international projects.

Detection and Defense

The most direct mitigation is a linting step that runs before merge, covering the Unicode ranges relevant to this attack class:

# Detect bidirectional controls and zero-width characters in source files
grep -rnP '[\x{200B}-\x{200D}\x{200E}\x{200F}\x{202A}-\x{202E}\x{2060}-\x{2069}\x{FEFF}]' src/

# With ripgrep using Unicode property matching
rg --pcre2 '\p{Bidi_Control}' src/

Homoglyph detection requires a different approach: checking identifiers against the Unicode confusables dataset. The Python confusable_homoglyphs library and similar tools for other languages can flag identifiers that are confusable with ASCII equivalents. This is more involved than a grep but addresses a larger portion of the attack surface.

For dependency security, tools like Socket perform static analysis on published package tarballs before they reach the developer, scanning source files for supply chain anomalies including Unicode-based obfuscation. This position, between the registry and the consumer, is more robust than post-install scanning because it intercepts packages before execution.

At the repository level, a pre-commit hook or CI step that fails on unexpected Unicode ranges in source files eliminates the attack surface entirely for projects that can constrain their character set. The policy is aggressive but tractable: fail on any source file containing characters outside well-understood blocks, with intentional exceptions documented and reviewed individually.

The Structural Issue

The underlying condition that makes this attack class persistent is not a bug in any specific tool or platform. Unicode’s bidirectional controls and zero-width characters exist because the problems they solve are real. Mixed-direction text is real. Complex script rendering requirements are real. The character properties that enable these attacks are load-bearing parts of the standard.

The security model of code review was built on the assumption that source files are plain text in the typographic sense: what you see is what is there. Unicode’s design for full language support means that assumption does not always hold. Supply chain attackers have identified that mismatch and found that the tooling in the review pipeline mostly ignores it.

The gap is closable. The detection tooling exists. The fix is integrating Unicode anomaly checks into the same pipeline that already runs linters, formatters, and static analyzers, and treating invisible characters in source code with the same skepticism applied to other obfuscation techniques. The attack works today partly because those checks are available but not yet standard practice in most projects’ CI configurations.

Was this interesting?