Invisible Characters in the Supply Chain: What the Latest GitHub Attack Reveals

A supply chain attack recently identified across GitHub and other major code repositories exploited a class of Unicode characters that are invisible to the human eye but fully parsed by compilers and interpreters. The attack is the latest manifestation of a technique that has been theoretically understood since at least 2021, when researchers at the University of Cambridge published the Trojan Source paper, but one that continues to slip past conventional security tooling.

The mechanism is worth understanding in detail, because the defense requires knowing exactly what you are looking for.

Unicode Was Not Designed for Source Code Security

Unicode’s bidirectional text algorithm exists for legitimate reasons. Documents mixing left-to-right languages like English with right-to-left languages like Arabic or Hebrew need a way to correctly order and render mixed sequences. The Unicode Bidi algorithm accomplishes this through a set of control characters that can be embedded in text to influence rendering direction.

The characters that matter for this attack class are the directional override and isolate characters:

U+202E RIGHT-TO-LEFT OVERRIDE
U+202D LEFT-TO-RIGHT OVERRIDE
U+2066 LEFT-TO-RIGHT ISOLATE
U+2067 RIGHT-TO-LEFT ISOLATE
U+2068 FIRST STRONG ISOLATE
U+200F RIGHT-TO-LEFT MARK

These characters have no visible glyph. They occupy space in the byte stream but render as nothing. In a document viewer, a GitHub diff, or most terminal output, they are completely invisible. A code reviewer reading a pull request sees clean, normal-looking code. The runtime sees something entirely different.

The specific attack that Nicholas Boucher and Ross Anderson documented in their 2021 Trojan Source paper demonstrated how embedding a RIGHT-TO-LEFT OVERRIDE character inside a string literal or comment could cause the visual representation of code to differ from its logical content. A line that appears to contain an innocuous comment could actually contain executable code that the comment syntax terminates early, because the bidirectional rendering engine reorders the visible characters while the interpreter processes bytes in their original sequence.

Here is a simplified illustration. Consider this Python pseudocode where [RLO] represents U+202E at that byte position:

# Verify access level[RLO] "# )level_nimda(kcehc = detnarG_ssecca
access_granted = check_user_role(user)

To a reviewer, the first line looks like a comment. The bidirectional rendering engine flips the display of everything after the RLO character, so the malicious payload is visually reordered into what looks like garbage inside a string. The interpreter, however, processes the raw bytes sequentially and executes what the reviewer never meaningfully saw.

This was assigned CVE-2021-42574 and affected virtually every major programming language and editor combination tested: C, C++, C#, JavaScript, Java, Rust, Go, Python, and others.

Zero-Width Characters Are a Separate Problem

Bidirectional attacks are one branch of this technique. Zero-width characters are another, and they enable a different class of exploit.

The Unicode standard includes several characters that render as zero-width glyphs:

U+200B ZERO WIDTH SPACE
U+200C ZERO WIDTH NON-JOINER
U+200D ZERO WIDTH JOINER
U+FEFF ZERO WIDTH NO-BREAK SPACE (also the UTF-8 BOM)

Languages that permit Unicode in identifiers, including JavaScript, Python, and Ruby, allow these characters in variable and function names. This means config and config could be two entirely different variables if one has a zero-width character embedded at a position no font can display. The attack surface here extends to package names and dependency identifiers, not just inline code logic.

In the npm ecosystem this vector has been used to create confusable package names that look identical to legitimate ones in terminal output and in package.json files. A package name with a zero-width non-joiner embedded mid-string displays identically to the real package name in most contexts but resolves to a different registry entry entirely.

Why Code Review Consistently Fails This Test

The fundamental problem is that code review tooling was built on the assumption that what you see is what gets executed. GitHub’s diff view, VS Code’s built-in diff, and most peer review workflows render Unicode according to standard display rules. The Bidi algorithm runs, zero-width characters are invisible, and the displayed output matches what a legitimate author would have written.

This is not a failure of reviewer attention or diligence. It is a structural limitation of the rendering layer. The displayed view actively hides the attack. Even a meticulous reviewer reading every line carefully will not catch an U+202E character inside a string because there is nothing visible to catch.

GitHub added a warning banner for files containing bidirectional Unicode characters after the original Trojan Source disclosure in late 2021. The banner appears in file views and diffs when control characters are present. But it is a passive notification, not a block, and it depends on the reviewer noticing and acting on an easy-to-miss UI element. Automated enforcement requires additional explicit configuration that most repositories have not applied.

The Supply Chain Dimension

What makes invisible character attacks particularly dangerous for supply chains is the combination of scale and trust amplification. When a malicious commit reaches a package depended upon by thousands of projects, the blast radius is determined by the dependency graph, not by the sophistication of the attack itself. An exploit that bypasses one reviewer in one repository propagates automatically to everyone downstream on the next version bump.

The pattern in attacks like the one reported by Ars Technica follows a consistent template: target a widely used dependency, embed invisible characters that conceal actual behavior from human review, and let the normal update cycle distribute the payload. The attacker does not need to compromise the build pipeline or the package registry. They only need to get one commit past one reviewer.

The 2024 XZ Utils backdoor operated through a different mechanism, relying on years of social engineering to build maintainer trust before inserting a backdoor into the compression library’s build scripts. But it demonstrated that supply chain attackers are patient and specifically target the trust relationships open source depends on. Invisible character attacks require far less preparation and are correspondingly easier to attempt at scale across many repositories simultaneously.

Detection Is Mechanically Straightforward

The good news is that detection is reliable once you know what to scan for. The characters involved have specific Unicode codepoints, and grep can find them in a full repository in seconds.

A scan for bidirectional control characters:

grep -rP "[\x{200F}\x{202A}-\x{202E}\x{2066}-\x{2069}]" \
  --include="*.py" --include="*.js" --include="*.ts" .

A scan for zero-width characters:

grep -rP "[\x{200B}-\x{200D}\x{FEFF}]" \
  --include="*.py" --include="*.js" --include="*.ts" .

These patterns integrate cleanly into pre-commit hooks or CI pipelines. The anti-trojan-source ESLint plugin adds this as a linting rule for JavaScript projects. The trojan-source PyPI package provides a command-line scanner for Python codebases. For GitHub repositories using Advanced Security, custom code scanning patterns can be written to flag these character ranges as required status checks, making them a hard gate rather than an advisory warning.

For package consumers, the standard npm audit toolchain does not currently detect invisible character patterns in published packages. Supply chain scanning tools like Socket.dev have added detection for some of these patterns, and the coverage has been expanding. For critical dependencies, pinning to specific commit hashes rather than version ranges limits the blast radius of a compromised release.

Editor configuration provides a secondary layer. VS Code renders bidirectional control characters as visible annotated boxes when editor.renderControlCharacters is set to true in settings. Distributing this as a workspace default in your repository’s .vscode/settings.json removes the visual attack surface for your entire team without requiring individual action.

The Underlying Model Problem

The reason these attacks recur, and why they keep succeeding despite documented public disclosure, is that most security controls in open source are designed around the assumption that source code is text and text is what you see. That assumption holds for the vast majority of contributions. It fails specifically in adversarial conditions, which is precisely when the controls need to work.

Closing the gap requires treating the byte stream as the ground truth rather than the rendered view. The check has to happen at the byte level during CI, before merge, as an automated requirement rather than a reviewer responsibility. Human eyes reading a diff are not a reliable mechanism for detecting characters the rendering layer is designed to suppress.

The broader pattern across recent supply chain incidents, from XZ Utils to the invisible character attacks hitting GitHub now, is that attackers are not primarily exploiting vulnerabilities in code. They are exploiting the social and procedural infrastructure that open source depends on to function. Tooling that addresses the byte-versus-display gap closes one specific exploit path. The deeper work is building review processes and dependency hygiene practices that do not require every reviewer to be infallible under adversarial conditions.