The Rendering Gap: Why Unicode Supply Chain Attacks Keep Working After the Patches
Source: lobsters
The attack surface was documented publicly in October 2021. CVE-2021-42574, nicknamed Trojan Source, showed that Unicode bidirectional control characters could make source code look completely different to a human reviewer than what a compiler actually processes. GitHub issued a fix. Major compilers added warning flags. Security advisories went out to every platform that mattered. And now, a supply chain campaign is hitting repositories on GitHub and other hosting platforms using the same foundational class of techniques. The mitigations were genuine but incomplete, and the attack surface persists because the root cause was never fully addressed.
What Trojan Source Actually Demonstrated
Nicholas Boucher and Ross Anderson at the University of Cambridge published their Trojan Source paper with a central observation: Unicode’s bidirectional text support, designed to allow Arabic and Hebrew text to mix with left-to-right Latin text in the same document, also applies to source code. Characters like U+202E (RIGHT-TO-LEFT OVERRIDE) and the newer isolate characters (U+2066 through U+2069) instruct renderers to visually reorder displayed characters. The bytes remain in their original sequence; only the visual representation changes.
Source code editors, web-based diff views, and code review interfaces all render Unicode per spec. Compilers and interpreters read bytes in sequence, ignoring bidirectional hints entirely. The two models diverge, and the gap between them is the exploitable surface.
A simplified illustration in C shows how the comment-out attack variant works:
/* Authenticated users only. BEGIN_SECURE_ZONE \u202e } if (access_granted) { \u202e */
do_privileged_operation();
The U+202E characters cause a renderer to display if (access_granted) as if it wraps do_privileged_operation(). In the actual byte sequence, the conditional sits inside the comment block and the privileged call executes unconditionally. A code reviewer examining a GitHub diff sees one program; the binary that ships is another.
GitHub deployed a fix in November 2021: warning banners for files containing BiDi control characters, with visual highlighting in diffs. GCC 12 added -Wbidi-chars. Rust made BiDi characters a hard error in version 1.56.1. These were appropriate responses to the specific characters named in the CVE.
The Characters the Initial Fixes Missed
The original CVE focused on the embedding and override characters: U+202A through U+202E. GitHub’s initial patch covered these and the related directional marks. What it did not fully cover were the isolate characters added in Unicode 6.3 (2013): U+2066 (LEFT-TO-RIGHT ISOLATE), U+2067 (RIGHT-TO-LEFT ISOLATE), U+2068 (FIRST STRONG ISOLATE), and U+2069 (POP DIRECTIONAL ISOLATE). These are the characters recommended in modern Unicode implementations to replace the older embedding characters for most purposes. They produce equivalent visual confusion and were absent from the original CVE character list.
This gap illustrates a persistent problem with advisory-driven security patches: they fix the instances in the report, not the class. When the attack class is “characters that cause visual rendering to diverge from byte sequence,” the fix needs to cover all such characters, not just the ones demonstrated in the proof of concept.
Zero-Width Characters: A Separate Attack, Same Outcome
Bidirectional characters rearrange rendered text. Zero-width characters disappear from it entirely. These are distinct Unicode categories with distinct exploits, and the Trojan Source CVE did not cover them.
The relevant codepoints include U+200B (ZERO WIDTH SPACE), U+200C (ZERO WIDTH NON-JOINER), U+200D (ZERO WIDTH JOINER), U+2060 (WORD JOINER), and U+FEFF (ZERO WIDTH NO-BREAK SPACE, also used as a UTF-8 BOM). None of these have visual representation in standard text renderers, editors, or web interfaces. A GitHub diff showing them is indistinguishable from a diff without them.
The supply chain vector here is identifier confusion. Python, JavaScript, Rust, and Go all permit Unicode in identifiers. A function named verifySignature and a function named verifySignature with U+200B embedded after verify are two distinct identifiers that can coexist in the same module:
def verifySignature(data, sig):
# legitimate signature check
return hmac.compare_digest(compute_hmac(data), sig)
# The following function name contains U+200B between 'verify' and 'Signature'
# It is invisible in GitHub diffs, VS Code, and virtually every editor
def verifySignature(data, sig): # <- visually identical, byte-distinct
return True
A malicious contribution that replaces calls to the real function with calls to the lookalike passes visual inspection at every stage. The diff view shows verifySignature in both the old and new code. The CI test suite, unless it specifically exercises the signature verification path against a bad signature, passes. The reviewer approves the pull request. The package ships with verification bypassed.
How This Feeds a Supply Chain Campaign
Individual code tricks become supply chain attacks through the organizational dynamics of large open-source repositories. Maintainers of widely depended-upon packages face a recurring pressure: fast review means approving changes with limited scrutiny; slow review creates contributor friction and stale queues. Code review tooling, GitHub’s pull request interface in particular, is optimized for readability. The interface shows developers rendered text, not raw bytes. A contributor submitting a plausible feature or bug fix, with an invisible character tucked into a security-relevant function call, presents exactly the kind of change that moves through a compressed review queue.
Package registries add another dimension. Research by Socket.dev has documented cases where npm’s registry applied inconsistent Unicode normalization across different API endpoints and the install path. A package name containing an embedded zero-width character resolved differently than its visually identical counterpart in some contexts, enabling confusion attacks at the dependency resolution layer rather than in source code directly. Phylum and Checkmarx have both flagged packages on PyPI using zero-width characters to evade source-level scanners that compare text rather than bytes.
Why Compiler Warnings Did Not Solve the Problem
Rust’s decision to make BiDi characters a hard compiler error was the most technically sound response. GCC’s -Wbidi-chars flag, Python 3.12’s SyntaxWarning for non-printable Unicode in source files, and similar changes from other toolchains addressed the developer-facing side of the problem.
Supply chain attacks bypass this because they target consumers, not authors. When a developer installs a malicious npm package or imports a compromised Python library, they are not compiling that dependency with warning flags set. They are running transpiled JavaScript or importing pre-compiled bytecode. The invisible characters in the source may never interact with a compiler warning at all.
Pre-commit hooks and CI static analysis are the practical mitigation layer. Scanning for the Unicode character ranges that enable these attacks requires a single grep pattern:
grep -rP "[\x{200B}-\x{200D}\x{200F}\x{202A}-\x{202E}\x{2060}\x{2066}-\x{2069}\x{FEFF}]" ./src
Semgrep supports custom rules targeting Unicode character classes and can be integrated into CI. Supply chain scanners from Socket.dev and Phylum perform this analysis at the registry level, before a dependency enters your project’s graph. Running both the publishing pipeline check and the consuming pipeline check covers both halves of the attack surface.
What Code Review Interfaces Should Be Doing
The longer-term fix requires changes to how code review interfaces present source code. GitHub’s current approach places an informational warning banner on files containing BiDi characters. This is informational rather than blocking, and it applies to files as a whole rather than flagging specific locations inline within a diff.
A more defensible design would render invisible characters visibly. Terminal emulators have done this for control characters for decades: cat -v shows ^[[31m instead of silently applying the color code. The same principle applied to source code diffs, rendering U+200B as a visible placeholder and BiDi characters as explicit direction markers, would close the gap between what a reviewer sees and what the runtime processes. Code signing systems and CI verification steps that operate on the rendered representation rather than the raw bytes inherit this same blindness and need to be updated to operate on byte content.
The registry side needs consistent normalization enforced at publication time. If npm and PyPI normalize all package names and published source identifiers to Unicode NFC at upload, confusion attacks require defeating the normalization rather than exploiting gaps in it. Both registries have moved in this direction since 2023, but the normalization is not uniformly applied across all code paths in either system.
The campaign reported this month is using techniques that were publicly described with full technical detail four and a half years ago. The disclosures generated real and genuine improvements across compilers, editors, and hosting platforms. The persistence of these attacks in 2026 reflects a gap between mitigating specific instances and fixing the underlying model. Code review tooling still fundamentally trusts the rendered representation as a proxy for actual program content. Patching specific Unicode codepoints does not close that gap; it just raises the cost of the next workaround.