Zero-Width Characters and the Supply Chain Gap That Trojan Source Left Open
Source: lobsters
A supply-chain attack using invisible Unicode characters has been found in repositories hosted on GitHub and other platforms, according to Ars Technica. The mechanism is not new. The fact that it is still working in 2026 is what deserves attention.
The basic idea is straightforward: Unicode contains a large number of characters that render as nothing visually. Insert one of these into an identifier, a package name, or a script field, and the resulting text looks identical to the legitimate version in virtually every interface that humans use to review code. The computer, however, sees something different.
The Codepoints in Question
The most commonly exploited invisible characters fall into two groups.
The first group is zero-width characters: U+200B (ZERO WIDTH SPACE), U+200C (ZERO WIDTH NON-JOINER), U+200D (ZERO WIDTH JOINER), U+FEFF (the byte-order mark, also called ZERO WIDTH NO-BREAK SPACE when it appears mid-document), and U+2060 through U+2064 (WORD JOINER, FUNCTION APPLICATION, INVISIBLE TIMES, INVISIBLE SEPARATOR, INVISIBLE PLUS). These exist in Unicode primarily for typographic and linguistic purposes. They have legitimate uses in certain writing systems. In source code, they are almost always either accidental or malicious.
The second group is bidirectional control characters: U+202A through U+202E (LEFT-TO-RIGHT EMBEDDING, RIGHT-TO-LEFT EMBEDDING, POP DIRECTIONAL FORMATTING, LEFT-TO-RIGHT OVERRIDE, RIGHT-TO-LEFT OVERRIDE) and the newer isolate characters U+2066 through U+2069. These tell text renderers to display subsequent characters in a different reading direction. Arabic and Hebrew text legitimately needs this. In source code, it lets an attacker make what a code reviewer reads differ from what a compiler processes.
The bidirectional override attack was formalized in 2021 by Nicholas Boucher and Ross Anderson at Cambridge, in research that produced CVE-2021-42574. The Trojan Source paper demonstrated that you could embed a RIGHT-TO-LEFT OVERRIDE inside a comment or string literal, causing the rendered text to reverse the apparent order of tokens so that a safety check appears commented out while the compiler sees it as live code. Every major compiler and interpreter was assigned a CVE. Most patched by warning when bidi characters appeared in string literals or comments, which addressed the demonstrated exploit while leaving adjacent attack surfaces open.
Why Identifiers Are the Harder Problem
The bidi-in-comments attack is elegant but visible to anyone who copies the affected line and pastes it somewhere that renders Unicode differently. The zero-width-in-identifiers attack is harder to catch because there is no rendering inconsistency to notice.
Consider this Python import:
import requests
Now consider the same line with a U+200D (ZERO WIDTH JOINER) appended to the module name:
import requests
Your terminal, your editor, your browser-based code review interface, and almost certainly your CI log output render these identically. The Python interpreter, however, sees different strings. If an attacker publishes a package on PyPI named requests followed by U+200D, that import resolves to their code.
You can verify this yourself:
>>> s1 = "requests"
>>> s2 = "requests\u200d"
>>> s1 == s2
False
>>> len(s1), len(s2)
(8, 9)
>>> print(s1), print(s2)
requests
requests
Both print() calls produce the same visible output. The strings are not equal.
The Supply Chain Threat Model
The package registry layer is the most dangerous application of this. npm, PyPI, RubyGems, and most other registries accept package names as Unicode strings. Name validation has historically focused on preventing obvious typosquatting: replacing the letter l with the numeral 1, or swapping o for 0. These substitutions change the visible string and can be caught by similarity checks. Invisible characters do not change the visible string at all, which means name-similarity checks pass completely.
An attacker registers lodash with a zero-width space embedded at some position, then finds or creates contexts where that exact string gets written into a package.json. The dependency resolves to the malicious package. Every tool in the pipeline, from the diff that introduced the dependency to the lock file that records it, shows what appears to be the legitimate package name.
The attack surface extends beyond registry names:
- Dependency version strings inside
package.jsonorrequirements.txt, where an invisible character in the version specifier can influence resolution - Script hooks in
package.json’spreinstallandpostinstallfields, where a command that looks likenode ./setup.jsmight call a different binary - GitHub Actions workflow files, where YAML string values receive almost no scrutiny at the codepoint level
- README install commands, where a project’s own documentation becomes the delivery mechanism
GitHub is a particularly wide surface because it renders everything in a browser, displays diffs as styled HTML, and has no built-in indicator when a file contains codepoints outside the printable ASCII range. A pull request that inserts a zero-width space into a workflow file produces a diff with a changed line that looks like a whitespace-only or no-change edit.
What Existing Tooling Misses
Git stores files as byte sequences and has no Unicode opinion. git diff will show a changed line when invisible characters are added, but the change is not visually apparent in the default output. You would need to run git diff --word-diff=color and then peer at the highlighted region to notice anything suspicious, and even then some invisible characters produce no visible highlight.
ESLint’s no-irregular-whitespace rule targets whitespace in specific syntactic positions and does not comprehensively cover identifier names or import strings. The unicode-bom rule only checks for the BOM at the start of a file. Neither rule covers embedded bidi characters in the general case.
Bandit, the Python security linter, does not scan for invisible Unicode in identifiers. Semgrep, depending on rule configuration, also does not flag this by default. GitHub’s CodeQL does not include a built-in query for suspicious codepoints, though custom queries can be written.
This gap has been documented since at least 2021, and the tooling response has been inconsistent.
Detection
The most reliable defense is a scanner that runs in CI or as a pre-commit hook. Here is a Python implementation that covers the highest-risk codepoints:
import sys
import pathlib
SUSPICIOUS = (
set(range(0x200B, 0x2010)) # zero-width space through right mark
| set(range(0x2060, 0x2065)) # word joiner through invisible plus
| set(range(0x202A, 0x202F)) # bidi embedding/override characters
| set(range(0x2066, 0x206A)) # bidi isolate characters
| {0xFEFF} # BOM / ZWNBSP
)
def scan_file(path):
text = pathlib.Path(path).read_text(encoding="utf-8", errors="replace")
hits = []
for lineno, line in enumerate(text.splitlines(), start=1):
for col, ch in enumerate(line, start=1):
if ord(ch) in SUSPICIOUS:
hits.append((lineno, col, ord(ch)))
return hits
exiting = False
for arg in sys.argv[1:]:
for lineno, col, cp in scan_file(arg):
print(f"{arg}:{lineno}:{col}: suspicious codepoint U+{cp:04X}")
exiting = True
if exiting:
sys.exit(1)
For a faster scan using ripgrep across a whole repository:
rg --pcre2 '[\x{200B}-\x{200F}\x{2060}-\x{2069}\x{202A}-\x{202E}\x{FEFF}]' .
Running this over a fresh clone of a dependency directory takes seconds and catches the most commonly used invisible codepoints. It does not cover every possible invisible character in Unicode, but it covers the attack patterns that appear in documented incidents.
What Platforms Should Do
Package registries need to normalize names before accepting them. PyPI already bans many confusable homoglyphs from package names following earlier typosquatting work. The invisible-character problem is a harder version of the same problem, and the fix is the same: reject any package name containing codepoints outside a strictly defined allowed set. For most package ecosystems, that set is printable ASCII plus hyphens and underscores. There is no legitimate reason a package name needs a zero-width joiner.
GitHub could surface a warning in the pull request diff view when a changed file contains codepoints outside the printable ASCII range. The information is available at render time. This is not a hard feature to add, and several users have requested exactly this since the Trojan Source paper got traction in 2021. The absence of it is a choice.
Editors have improved somewhat. Visual Studio Code added highlighting for bidirectional control characters in late 2021, controlled by the editor.renderControlCharacters setting. But zero-width spaces and joiners embedded in identifiers remain invisible by default in most editors, because legitimate internationalized code occasionally uses them, and because they have not historically been treated as a security surface.
The Structural Problem
Security tooling for source code was built under the assumption that code is mostly ASCII, and that non-ASCII characters appear in string literals and comments rather than in identifier names or import paths. That assumption no longer holds, and the attack surface it leaves is not hypothetical. Supply chain attacks through dependency confusion and typosquatting have matured into a recognized threat category with established defenses. Invisible-character attacks sit outside those defenses entirely: they are not caught by name-similarity checks, not caught by integrity hashes (since the hash correctly identifies the malicious package you actually installed), and not caught by standard static analysis.
Adding a codepoint scanner to a CI pipeline is a small amount of work. The Trojan Source paper made this a known and named problem in 2021. Five years later, campaigns are still succeeding because that small amount of work has not been done at either the platform level or the project level. The current report is a data point, not an outlier.