The URL as an Exfiltration Channel: What Gemini's Mitigation Reveals About LLM Security Boundaries
Source: lobsters
When Google’s security team published their post on mitigating URL-based exfiltration in Gemini, they were documenting a class of attack that has been theorized and demonstrated against nearly every major LLM-integrated product since 2023. The mechanics are not subtle, but they are genuinely difficult to close off completely, and the mitigations Google describes reveal something important about where the security boundary actually sits in AI assistant systems.
The Attack Chain
The core attack has two stages. First, an adversary plants malicious instructions in content the target model will process: an email in Gmail, a document in Google Docs, a webpage retrieved during a browsing session. This is indirect prompt injection, first formalized by Greshake et al. in their 2023 paper “Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection”. The critical distinction from direct injection is that the adversary never speaks to the model directly; instead they contaminate the data the model is asked to process.
The injected instructions tell the model to do something superficially innocent: construct and output a URL. But that URL encodes information from the conversation context, user profile, or surrounding documents in its query parameters or path. When the client application renders the URL, the data leaves.

That base64-encoded query parameter contains whatever the model was told to extract. When a browser or Electron app renders a Markdown image, it issues an HTTP GET to that URL. The attacker’s server logs the request. No interaction from the user is required beyond the model being invoked on the malicious document.
Base64 encoding serves two purposes here. It packs structured data (JSON objects, conversation snippets, file contents) into a URL-safe format, and it obscures the payload enough that naive pattern-matching on the output misses it. An attacker can also split the exfiltration across multiple images, each carrying a fragment, to stay under length limits and avoid triggering length-based heuristics.
Why Model-Level Controls Are Necessary But Not Sufficient
The instinct when encountering this class of problem is to address it in the model itself. If the model refuses to construct URLs containing user data, the exfiltration fails at the source. Google’s mitigation does include model-level training, and this layer matters. A model trained to recognize and decline these instruction patterns provides genuine defense in depth.
But the capability the attacker exploits is not a bug in the model’s behavior. It is general-purpose URL construction, which is something any useful coding assistant, documentation helper, or web-integrated agent needs to do. The same capability that lets Gemini generate a valid API endpoint for a developer is what lets malicious injected content construct an exfiltration beacon. Training the model to refuse URL generation in adversarial contexts while permitting it in legitimate ones requires the model to distinguish between them based on intent, and that distinction is exactly what prompt injection is designed to confuse.
This is not a theoretical limitation. Researchers including Johann Rehberger, whose work at embrace-the-red.com has documented practical prompt injection attacks against ChatGPT, Copilot, and Bing Chat across 2023 and 2024, have repeatedly shown that model-level refusals can be bypassed through rephrasing, role-playing framings, and multi-step instruction sequences. A model trained to decline “exfiltrate user data via URL” will often comply with “generate a diagnostic link for support purposes that includes session context.”
The attack surface also widens with model capability. More capable models are better at following nuanced instructions, understanding indirect requests, and generating syntactically correct outputs in varied formats. These properties are assets for users and liabilities when the instruction source is adversarial.
The Rendering Layer Is Where Data Actually Leaves
The more tractable part of the defense is at the rendering layer, and this is where Google’s approach becomes technically interesting. The model outputs text. That text only becomes a data exfiltration event when something acts on it, typically by fetching a URL or displaying a rendered image that causes a browser to fetch one.
Content Security Policy headers on the Gemini web interface can prevent automatic loading of third-party images, neutralizing the Markdown image vector without touching the model at all. Stripping or neutralizing image syntax from model outputs before rendering is another option, though it conflicts with legitimate uses where a model might generate documentation that includes image references.
The more surgical approach involves inspecting generated URLs before rendering and classifying whether they appear to contain encoded context data. A URL like https://docs.google.com/document/d/abc123 is clearly legitimate. A URL like https://external-domain.com/path?q=eyJ1c2VybmFtZSI6... (where the query parameter is a long base64 string) is suspicious in a way that can be detected heuristically, even without understanding the model’s intent. The classification does not need to be perfect; it needs to make the attack significantly more expensive to execute reliably.
This is analogous to how defense-in-depth works in traditional web security. You do not rely solely on input validation, the equivalent of training the model not to generate bad outputs, but also on output encoding and sanitization at the point where output is rendered or executed. The two layers protect against different failure modes.
How Other Systems Handle the Same Problem
The problem predates Gemini’s specific mitigation work. Microsoft’s Copilot integrations in Microsoft 365 were demonstrated to be vulnerable to similar attack patterns, where injected content in emails caused the model to construct and output URLs containing meeting notes and contact information. OpenAI’s ChatGPT with plugins and browsing faced early versions of this attack when it rendered Markdown images from plugin output without sanitization. The OWASP LLM Top 10 lists prompt injection as LLM01 and sensitive information disclosure as LLM02, with URL-based exfiltration explicitly discussed as an intersection of both.
Vendor responses have converged on a similar layered approach, but with different emphasis and tradeoffs. Anthropic’s Claude does not automatically fetch URLs or render images in its base API deployment, which removes the auto-beacon vector entirely at the cost of some capability in certain interfaces. OpenAI added warnings and sandboxing around external URL handling in plugin contexts. Microsoft added link sanitization and user confirmation steps before following external URLs in Copilot responses.
What Google’s post adds to the public record is specificity about the output classification and filtering approach applied before rendering, which is the layer that has received the least detailed public documentation across the ecosystem. Most published defenses focus on model training or interface-level controls; the middle layer of output classification is harder to describe without revealing bypass opportunities, which is likely why vendors have been cautious about it.
The Structural Tension
There is a deeper issue here that no combination of mitigations fully resolves. When an AI assistant is integrated into a system that handles sensitive information, processes external content, and produces output that gets rendered in a context that makes HTTP requests, you have assembled the components needed for an exfiltration channel. The model sits in the middle: it has access to sensitive data on one side and can produce output that reaches the network on the other.
The specific instruction that bridges those two things, embed this data in this URL, is a single sentence that any capable language model can follow. The defenses are real and they raise the cost of exploitation significantly. URL classifiers catch obvious beacons. Model training catches explicit injection patterns. CSP headers prevent auto-fetch. But indirect injection can be subtle and varied in phrasing, and URL generation is a core model capability, so the attack surface does not go to zero.
What Google’s approach represents is a mature response to that reality: address the problem in training, address it in output classification, and address it in the rendering pipeline. No single layer is sufficient. The interesting question going forward is not whether these defenses can be made perfect, but how to drive attacker effort high enough that consistent, reliable exploitation becomes impractical.
The published work on this class of attack is still sparse relative to the complexity of the problem. More transparency from vendors about their specific detection thresholds and classification approaches would raise the baseline for everyone building LLM-integrated systems, and the security community has generally made more progress on the attack side than the defense side. Google’s post is a step in the right direction, and it is worth reading carefully if you are building anything that puts a language model between sensitive user data and rendered output.