Prompt Injection as Role Confusion
Pith reviewed 2026-05-15 20:16 UTC · model grok-4.3
The pith
Language models fall for prompt injection because they judge text by its sound rather than its actual source.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that prompt injection attacks succeed because models infer the source of text from how it sounds rather than from its true origin. Role probes reveal that attacker-controllable features such as syntactic patterns and lexical choice control the model's internal assignment of speaker identity. Text that merely resembles a trusted source occupies the same representational space as text that actually comes from one. This mechanism is demonstrated first with CoT Forgery, a zero-shot attack that injects fabricated reasoning and achieves 60 percent success on StrongREJECT across models with near-zero baselines, and then generalized to standard agent prompt injections. The paper
What carries the argument
Role confusion: the process by which models assign speaker identity to text based on stylistic features rather than provenance, measured by role probes that read out internal perceptions of 'who is speaking'.
If this is right
- Mitigations that target only surface content will leave models vulnerable if stylistic cues still control role assignment.
- Role probes can serve as an early diagnostic to flag models likely to suffer high attack rates before deployment.
- The same confusion mechanism explains both fabricated-reasoning attacks and standard agent prompt injections, so fixes must address role representation broadly.
- Attack success can be predicted in advance from probe measurements rather than discovered only through trial-and-error red-teaming.
Where Pith is reading between the lines
- If role confusion stems from training data patterns, then deliberately mixing stylistic signals across sources during pre-training could reduce the effect.
- The same probes might be applied to detect role blurring in multi-turn agent interactions where instructions accumulate from multiple origins.
- Extending the framework to vision-language models could test whether role confusion generalizes beyond pure text to multimodal inputs.
Load-bearing premise
The probes accurately measure the model's internal sense of speaker identity and this internal sense directly causes prompt-injection success instead of merely correlating with it.
What would settle it
An experiment that manipulates syntactic or lexical signals to produce high role-confusion scores while attack success remains near baseline levels, or vice versa.
Figures
read the original abstract
LLMs see the world as a single stream of text, partitioned into roles like <user> or <tool>. We trace prompt injection to role confusion: models perceive the source of text from how it sounds, not its labeled role. A command hidden in a webpage hijacks an agent simply because it sounds like <user> text, despite its <tool> label. We design role probes to measure how LLMs internally perceive "who is speaking," and find that injected text occupies the same representational space as the trusted role it imitates. We demonstrate this with CoT Forgery, a zero-shot attack that injects fabricated reasoning into user prompts and tool outputs. Models mistake the forgery for their own thoughts, yielding 60% attack success against frontier models with near-zero baselines. Strikingly, the degree of role confusion predicts attack success before a single token is generated. This mechanism generalizes beyond CoT Forgery to standard agent prompt injections, revealing prompt injection as a measurable consequence of role perception. To the model, sounding like a role is indistinguishable from being one.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that prompt injection attacks arise from role confusion in language models, where models infer text source from stylistic and syntactic cues rather than actual provenance. It introduces role probes to measure internal role perception, shows that attacker-controllable signals influence these perceptions, demonstrates a CoT Forgery attack achieving ~60% success (near-0% baselines) on StrongREJECT, reports that role confusion scores strongly predict attack success, and proposes a unifying framework reframing prompt injection as a measurable consequence of role representation rather than an ad-hoc exploit.
Significance. If the causal mediation holds, the work offers a mechanistic account of a persistent vulnerability class, moving beyond empirical attack catalogs toward interventions on internal role representations. The correlation between independent role probes and attack success, plus the zero-shot CoT Forgery result, would be a useful empirical anchor for safety research on agentic systems and retrieval-augmented models.
major comments (2)
- [Experiments / Results] The central causal claim—that role confusion (as measured by probes) drives prompt injection success rather than merely correlating with it—lacks supporting evidence. The abstract and described experiments report predictive correlation and attacker control of role signals, but no mediation analysis, activation patching, or ablation that holds surface features fixed while altering role representations is described. This leaves open the possibility that both role scores and attack rates are downstream of the same lexical/syntactic cues.
- [Role Probes] The role probes are presented as measuring internal perception of 'who is speaking,' yet the manuscript provides no validation that these probes capture causally relevant representations rather than surface statistics. Without probe validation against known role manipulations or comparison to alternative mechanistic interpretability methods, the claim that probes reveal the mechanism remains under-supported.
minor comments (2)
- [Abstract] The abstract states concrete attack success rates and predictive correlations but does not summarize the number of models, prompt templates, or statistical tests used; adding these details would improve reproducibility assessment.
- [CoT Forgery Experiments] Baseline comparisons are described as 'near-0%' without specifying the exact control conditions or whether they include standard prompt-injection defenses; clarifying this would strengthen the attack novelty claim.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review, which identifies key opportunities to strengthen the causal interpretation of our results. We respond to each major comment below and commit to revisions that address the concerns while preserving the core contributions of the work.
read point-by-point responses
-
Referee: [Experiments / Results] The central causal claim—that role confusion (as measured by probes) drives prompt injection success rather than merely correlating with it—lacks supporting evidence. The abstract and described experiments report predictive correlation and attacker control of role signals, but no mediation analysis, activation patching, or ablation that holds surface features fixed while altering role representations is described. This leaves open the possibility that both role scores and attack rates are downstream of the same lexical/syntactic cues.
Authors: We agree that the current manuscript primarily establishes a strong predictive correlation between role confusion scores and attack success, together with evidence that attacker-controllable signals influence role perception. No mediation analysis, activation patching, or surface-feature-controlled ablation is reported. In the revised version we will add activation patching experiments that intervene on role-related directions while holding lexical and syntactic features fixed, and we will explicitly qualify the causal language in the abstract and discussion to reflect the correlational nature of the existing results. revision: yes
-
Referee: [Role Probes] The role probes are presented as measuring internal perception of 'who is speaking,' yet the manuscript provides no validation that these probes capture causally relevant representations rather than surface statistics. Without probe validation against known role manipulations or comparison to alternative mechanistic interpretability methods, the claim that probes reveal the mechanism remains under-supported.
Authors: We acknowledge that the manuscript does not include explicit validation of the role probes against ground-truth role manipulations (e.g., direct system-prompt role assignments) or comparisons with other interpretability techniques such as linear probes on known features. In the revision we will add a dedicated validation subsection that (i) measures probe agreement with model behavior under explicit role instructions and (ii) benchmarks the probes against alternative methods, thereby clarifying what the probes capture beyond surface statistics. revision: yes
Circularity Check
Empirical probes and new attacks ground the framework without reduction to inputs
full rationale
The derivation relies on newly designed role probes that measure internal representations independently of attack success rates, plus a novel CoT Forgery attack with reported 60% success and near-0% baselines. Role confusion is shown to correlate with and predict success via these measurements, but the probes and attacks are not defined in terms of each other or fitted parameters renamed as predictions. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing steps. The unifying framework is presented as a reframing based on these fresh empirical results rather than a self-referential or by-construction equivalence. This qualifies as minor (score 2) rather than significant circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Models' internal representations encode speaker role or source information primarily from controllable stylistic and lexical signals rather than provenance metadata.
invented entities (1)
-
Role confusion
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
We design role probes which measure how models internally perceive 'who is speaking', showing that attacker-controllable signals (e.g. syntactic patterns, lexical choice) control role perception.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_injective unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the degree of role confusion strongly predicts attack success even before generation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.