Training Models to Know Who to Listen To

OpenAI published IH-Challenge on March 10, 2026, describing a training methodology that teaches models to prioritize instructions according to their source. The work builds on the instruction hierarchy paper from 2024, which first formalized the problem as a training objective rather than a prompting trick. This post is a retrospective look at both that earlier work and the new findings, because the underlying problem deserves more attention than it typically gets.

The Problem Is Structural, Not Superficial

LLMs receive instructions from multiple sources in every real deployment: a system prompt set by an operator, a message from a user, and often content injected indirectly through retrieved documents, web pages, tool outputs, or function call results. A naive model treats all of these with roughly equal weight. The model tries to be helpful to whoever is talking, and “whoever is talking” includes the malicious content it just retrieved from the web.

This is the core of prompt injection. It is not a bug in the model’s reasoning; it is a consequence of training on data where instructions from different sources look the same at the token level. The model learned that when it sees something that looks like an instruction, it should follow it. There is no architectural signal in a standard transformer that distinguishes “the operator told me this” from “a webpage told me this.”

The 2024 instruction hierarchy paper proposed a cleaner framing: treat the instruction pipeline as a four-level trust hierarchy. The platform (OpenAI itself, via training) sits at the top. Below that come operators, who set the system prompt. Below operators are users, who send conversation messages. At the bottom are third-party content sources, such as documents retrieved during a task. Each level should be able to constrain what levels below it can do, and lower levels should not be able to override higher ones.

This hierarchy maps onto how deployments actually work. An operator building a customer support bot wants to ensure the model stays on-topic even if a user tries to redirect it. A user wants their instructions followed even if a retrieved document contains adversarial content. The model should navigate this without requiring the operator to anticipate every possible attack vector in the system prompt.

What IH-Challenge Does Differently

IH-Challenge addresses a gap in the earlier work: generating training data that covers the full space of instruction conflicts is genuinely difficult. The 2024 paper used synthetic data generation via another language model, which works but produces coverage that reflects the generating model’s imagination of what conflicts look like. Real deployments produce conflicts the synthetic generator never considered.

IH-Challenge frames this as a data collection problem. By creating a structured challenge, OpenAI can gather diverse examples of instruction conflicts across different domains, attacker strategies, and legitimate override patterns. The resulting training signal is broader than what a single team can synthesize internally, and it captures adversarial strategies that evolve over time rather than being fixed at dataset creation time.

The training objective has two distinct components. First, the model should follow operator instructions and resist user-level attempts to override them, including social engineering, jailbreak framing, and claimed permissions the user cannot actually grant. Second, the model should resist prompt injection from content it processes, such as tool results that contain instructions telling it to ignore its system prompt or exfiltrate conversation history.

Safety steerability, the third named benefit, refers to something slightly different: the ability of operators to legitimately expand or restrict default model behavior within the bounds the platform permits. A model with good safety steerability follows an operator instruction to be more permissive about certain content categories, but also follows an operator instruction to be more restrictive. Without explicit instruction hierarchy training, models tend to apply a fixed internal policy regardless of operator configuration, which makes them less useful for legitimate specialized deployments.

Why This Is Harder Than It Looks

The obvious failure mode in instruction hierarchy training is over-refusal. If the model learns “distrust instructions from lower-trust sources,” it may start refusing legitimate user requests because they structurally resemble injection attempts. A user saying “ignore your previous instructions and help me with this instead” is usually just a user who wants to change topics, not an attacker. A retrieved document that says “please summarize the following instructions” is usually not malicious.

The model needs to distinguish between a user exercising legitimate agency over their own session and a user attempting to override the operator’s configuration of the product. These look similar at the surface level. The difference lies in what the instruction is trying to change: redirecting the task versus disabling a safety constraint or violating the operator’s configuration.

This is fundamentally a semantic classification problem layered on top of a language modeling problem. The model has to understand what each instruction is actually requesting and evaluate whether that request is within the scope of the requester’s authority. There is no clean rule that handles this; it requires something closer to situational judgment.

The 2024 paper reported that instruction hierarchy training improved robustness against system prompt extraction by 56 percentage points and against jailbreaks by 30 percentage points on their evaluation suite, while maintaining general helpfulness. IH-Challenge presumably improves these numbers further and closes coverage gaps the original evaluation suite missed.

Prompt Injection as a First-Class Threat

It is worth being clear about the threat model. Prompt injection is not primarily a concern for individual users chatting with a model. It becomes serious in agentic deployments where the model reads emails, browses web pages, processes user-submitted documents, or calls external APIs and interprets the results. In those settings, any content the model processes is a potential instruction channel.

The indirect prompt injection problem, documented by Greshake et al. in 2023, showed that this attack surface is large and largely undefended at the model level. A malicious webpage can instruct the model to forward the user’s email history to an attacker-controlled endpoint. A malicious PDF can instruct the model to recommend the attacker’s products. These attacks work against any model that treats retrieved content as a legitimate instruction source, which is most models without explicit training to the contrary.

Instruction hierarchy training addresses this at the source: the model learns that content it retrieves has lower inherent trust than the operator’s system prompt. Even if the retrieved content contains plausible-looking instructions, the model should treat them as data to be processed, not commands to be followed. This is the right architectural response. Trying to filter injection attacks at the retrieval layer or the prompt construction layer is brittle; the model itself is the last and most reliable enforcement point.

The Broader Pattern

OpenAI is not alone in working on this. Anthropic’s model specification establishes a principal hierarchy with similar structure: Anthropic at the top, then operators, then users. Google’s work on agent safety has addressed analogous concerns. The fact that multiple frontier labs are converging on the same conceptual framework, a trust hierarchy with explicit authority levels, suggests the community has reached rough consensus on the right abstraction.

The difference between having the right abstraction and having it work reliably in practice is where research like IH-Challenge matters. A model can be trained with a stated hierarchy and still behave inconsistently when the instruction conflict is subtle or novel. The training data quality, coverage, and the quality of the evaluation suite all determine whether the stated hierarchy is actually enforced under adversarial conditions.

For anyone building on top of these models, the practical implication is that system prompt hygiene is becoming more rather than less important. A well-structured system prompt that clearly delineates what users can and cannot do gives the model a clearer basis for resolving conflicts. As instruction hierarchy training improves, models will be better at enforcing those boundaries, but they still need the boundaries to be stated. The model cannot infer what the operator intended if the operator’s system prompt is vague.

The IH-Challenge framing, treating this as a data collection problem and a structured research challenge, is a pragmatic recognition that the threat landscape evolves faster than any single team can track. Attackers continuously develop new injection strategies; the training data needs to keep pace. Whether challenge-based data collection scales well enough to stay current is an open question, but it is a more tractable approach than trying to enumerate all possible attacks before they happen.