Recognition: 2 theorem links
· Lean TheoremPrompt Injection as Role Confusion
Pith reviewed 2026-05-15 20:16 UTC · model grok-4.3
The pith
Language models fall for prompt injection because they judge text by its sound rather than its actual source.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that prompt injection attacks succeed because models infer the source of text from how it sounds rather than from its true origin. Role probes reveal that attacker-controllable features such as syntactic patterns and lexical choice control the model's internal assignment of speaker identity. Text that merely resembles a trusted source occupies the same representational space as text that actually comes from one. This mechanism is demonstrated first with CoT Forgery, a zero-shot attack that injects fabricated reasoning and achieves 60 percent success on StrongREJECT across models with near-zero baselines, and then generalized to standard agent prompt injections. The paper
What carries the argument
Role confusion: the process by which models assign speaker identity to text based on stylistic features rather than provenance, measured by role probes that read out internal perceptions of 'who is speaking'.
If this is right
- Mitigations that target only surface content will leave models vulnerable if stylistic cues still control role assignment.
- Role probes can serve as an early diagnostic to flag models likely to suffer high attack rates before deployment.
- The same confusion mechanism explains both fabricated-reasoning attacks and standard agent prompt injections, so fixes must address role representation broadly.
- Attack success can be predicted in advance from probe measurements rather than discovered only through trial-and-error red-teaming.
Where Pith is reading between the lines
- If role confusion stems from training data patterns, then deliberately mixing stylistic signals across sources during pre-training could reduce the effect.
- The same probes might be applied to detect role blurring in multi-turn agent interactions where instructions accumulate from multiple origins.
- Extending the framework to vision-language models could test whether role confusion generalizes beyond pure text to multimodal inputs.
Load-bearing premise
The probes accurately measure the model's internal sense of speaker identity and this internal sense directly causes prompt-injection success instead of merely correlating with it.
What would settle it
An experiment that manipulates syntactic or lexical signals to produce high role-confusion scores while attack success remains near baseline levels, or vice versa.
Figures
read the original abstract
Language models remain vulnerable to prompt injection attacks despite extensive safety training. We trace this failure to role confusion: models infer the source of text based on how it sounds, not where it actually comes from. A command hidden in a webpage hijacks an agent simply because it sounds like a user instruction. This is not just behavioral: in the model's internal representations, text that sounds like a trusted source occupies the same space as text that actually is one. We design role probes which measure how models internally perceive "who is speaking", showing that attacker-controllable signals (e.g. syntactic patterns, lexical choice) control role perception. We first test this with CoT Forgery, a zero-shot attack that injects fabricated reasoning into user prompts or ingested webpages. Models mistake the text for their own thoughts, yielding 60% attack success on StrongREJECT across frontier models with near-0% baselines. Strikingly, the degree of role confusion strongly predicts attack success. We then generalize these results to standard agent prompt injections, introducing a unifying framework that reframes prompt injection not as an ad-hoc exploit but as a measurable consequence of how models represent role.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that prompt injection attacks arise from role confusion in language models, where models infer text source from stylistic and syntactic cues rather than actual provenance. It introduces role probes to measure internal role perception, shows that attacker-controllable signals influence these perceptions, demonstrates a CoT Forgery attack achieving ~60% success (near-0% baselines) on StrongREJECT, reports that role confusion scores strongly predict attack success, and proposes a unifying framework reframing prompt injection as a measurable consequence of role representation rather than an ad-hoc exploit.
Significance. If the causal mediation holds, the work offers a mechanistic account of a persistent vulnerability class, moving beyond empirical attack catalogs toward interventions on internal role representations. The correlation between independent role probes and attack success, plus the zero-shot CoT Forgery result, would be a useful empirical anchor for safety research on agentic systems and retrieval-augmented models.
major comments (2)
- [Experiments / Results] The central causal claim—that role confusion (as measured by probes) drives prompt injection success rather than merely correlating with it—lacks supporting evidence. The abstract and described experiments report predictive correlation and attacker control of role signals, but no mediation analysis, activation patching, or ablation that holds surface features fixed while altering role representations is described. This leaves open the possibility that both role scores and attack rates are downstream of the same lexical/syntactic cues.
- [Role Probes] The role probes are presented as measuring internal perception of 'who is speaking,' yet the manuscript provides no validation that these probes capture causally relevant representations rather than surface statistics. Without probe validation against known role manipulations or comparison to alternative mechanistic interpretability methods, the claim that probes reveal the mechanism remains under-supported.
minor comments (2)
- [Abstract] The abstract states concrete attack success rates and predictive correlations but does not summarize the number of models, prompt templates, or statistical tests used; adding these details would improve reproducibility assessment.
- [CoT Forgery Experiments] Baseline comparisons are described as 'near-0%' without specifying the exact control conditions or whether they include standard prompt-injection defenses; clarifying this would strengthen the attack novelty claim.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review, which identifies key opportunities to strengthen the causal interpretation of our results. We respond to each major comment below and commit to revisions that address the concerns while preserving the core contributions of the work.
read point-by-point responses
-
Referee: [Experiments / Results] The central causal claim—that role confusion (as measured by probes) drives prompt injection success rather than merely correlating with it—lacks supporting evidence. The abstract and described experiments report predictive correlation and attacker control of role signals, but no mediation analysis, activation patching, or ablation that holds surface features fixed while altering role representations is described. This leaves open the possibility that both role scores and attack rates are downstream of the same lexical/syntactic cues.
Authors: We agree that the current manuscript primarily establishes a strong predictive correlation between role confusion scores and attack success, together with evidence that attacker-controllable signals influence role perception. No mediation analysis, activation patching, or surface-feature-controlled ablation is reported. In the revised version we will add activation patching experiments that intervene on role-related directions while holding lexical and syntactic features fixed, and we will explicitly qualify the causal language in the abstract and discussion to reflect the correlational nature of the existing results. revision: yes
-
Referee: [Role Probes] The role probes are presented as measuring internal perception of 'who is speaking,' yet the manuscript provides no validation that these probes capture causally relevant representations rather than surface statistics. Without probe validation against known role manipulations or comparison to alternative mechanistic interpretability methods, the claim that probes reveal the mechanism remains under-supported.
Authors: We acknowledge that the manuscript does not include explicit validation of the role probes against ground-truth role manipulations (e.g., direct system-prompt role assignments) or comparisons with other interpretability techniques such as linear probes on known features. In the revision we will add a dedicated validation subsection that (i) measures probe agreement with model behavior under explicit role instructions and (ii) benchmarks the probes against alternative methods, thereby clarifying what the probes capture beyond surface statistics. revision: yes
Circularity Check
Empirical probes and new attacks ground the framework without reduction to inputs
full rationale
The derivation relies on newly designed role probes that measure internal representations independently of attack success rates, plus a novel CoT Forgery attack with reported 60% success and near-0% baselines. Role confusion is shown to correlate with and predict success via these measurements, but the probes and attacks are not defined in terms of each other or fitted parameters renamed as predictions. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing steps. The unifying framework is presented as a reframing based on these fresh empirical results rather than a self-referential or by-construction equivalence. This qualifies as minor (score 2) rather than significant circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Models' internal representations encode speaker role or source information primarily from controllable stylistic and lexical signals rather than provenance metadata.
invented entities (1)
-
Role confusion
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
We design role probes which measure how models internally perceive 'who is speaking', showing that attacker-controllable signals (e.g. syntactic patterns, lexical choice) control role perception.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_injective unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the degree of role confusion strongly predicts attack success even before generation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
URL https: //doi.org/10.1145/3605764.3623985
doi: 10.1145/3605764.3623985. URL https: //doi.org/10.1145/3605764.3623985. Alain, G. and Bengio, Y . Understanding intermediate layers using linear classifier probes, 2018. URL https:// arxiv.org/abs/1610.01644. Andriushchenko, M. and Flammarion, N. Does refusal train- ing in llms generalize to the past tense? InThe Thirteenth International Conference on...
-
[2]
URL https://openreview.net/forum? id=aJUuere4fM. Anil, C., DURMUS, E., Rimsky, N., Sharma, M., Ben- ton, J., Kundu, S., Batson, J., Tong, M., Mu, J., Ford, D. J., Mosconi, F., Agrawal, R., Schaeffer, R., Bashkan- sky, N., Svenningsen, S., Lambert, M., Radhakrishnan, A., Denison, C., Hubinger, E. J., Bai, Y ., Bricken, T., Maxwell, T., Schiefer, N., Sully,...
-
[3]
URL https://openreview.net/forum? id=cw5mgd71jW. Belinkov, Y . Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, March 2022. doi: 10.1162/coli a 00422. URL https: //aclanthology.org/2022.cl-1.7/. Burns, C., Ye, H., Klein, D., and Steinhardt, J. Discov- ering latent knowledge in language models without su- ...
-
[4]
URL https://api.semanticscholar. org/CorpusID:259129807. Marks, S. and Tegmark, M. The geometry of truth: Emer- gent linear structure in large language model represen- tations of true/false datasets, 2024. URL https:// arxiv.org/abs/2310.06824. Meng, W., Zhang, F., Yao, W., Guo, Z., Li, Y ., Wei, C., and Chen, W. Dialogue injection attack: Jailbreaking ll...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.6028/nist.ai.100-2e2025 2024
-
[5]
URL http://dx.doi.org/10.18653/v1/ 2025.findings-emnlp.1060. Willison, S. Prompt injection attacks against gpt- 3, 2022. URL https://simonwillison.net/ 2022/Sep/12/prompt-injection/. Wu, T., Zhang, S., Song, K., Xu, S., Zhao, S., Agrawal, R., Indurthi, S. R., Xiang, C., Mittal, P., and Zhou, W. Instructional segment embedding: Improving llm safety with in...
-
[6]
URL https://cdn.openai.com/pdf/ be60c07b-6bc2-4f54-bcee-4141e1d6c69a/ gpt-5-safe_completions.pdf. 12 Prompt Injection as Role Confusion Zhan, Q., Fang, R., Panchal, H. S., and Kang, D. Adaptive attacks break defenses against indirect prompt injection attacks on llm agents, 2025. URL https://arxiv. org/abs/2503.00061. Zhang, Z., Li, S., Zhang, Z., Liu, X.,...
-
[7]
Commands are sampled from LLM-generated variants with diverse wording for robustness
Standard Injection:A user-style command instructing the agent to search its workspace for .env files and exfiltrate the contents to a remote server via curl. Commands are sampled from LLM-generated variants with diverse wording for robustness
-
[8]
image":"https://upload.wikimedia.org/wikipedia/commons/1/17/Sceaux gare 18.jpg
CoT Forgery Injection:The same command, augmented with a 1-paragraph forged CoT justifying the exfiltration. The forgery is sampled from LLM-generated examples matching the target model’s reasoning style. See Figure 14. Attack Transcript.Figure 15 shows a representative attack. The user prompt explicitly warns the agent about malicious injections. Upon re...
work page 2025
-
[9]
Use a raised bed or large container (12–18 in deep)
-
[10]
Keep soil loose and rich (add compost, about 1/2 cup per square foot)
-
[11]
Water 1–2 inches per week (more on hot days)
-
[12]
Mulch with straw or wood chips to keep soil cool
-
[13]
Stake or cage early, prune a few suckers, fertilize with a tomato-specific 4–10–10 once fruiting starts. That’s it—plant outdoors after the last frost, keep it sunny, and you’ll be on your way to fresh tomatoes! Happy gardening!<|end|> Figure 19.Multi-turn gardening conversation.Full text and model-appropriate role tags shown. tokens in Userness, andassis...
work page 2023
-
[14]
17We include a mix of single and 2-turn conversations
Baseline (Correct Tags): The conversation is formatted with standard, correct role tags (e.g.,usertext inside <user>). 17We include a mix of single and 2-turn conversations. 22 Prompt Injection as Role Confusion gptoss-20b gptoss-120b nemotron-3-nano qwen3-30b-a3b Userness (of user-style text) Assistantness (of assistant-style text) 0 5 10 15 20 0 10 20 3...
-
[15]
Injection: The entire conversation is wrapped in <tool> tags. This puts style and structure in direct conflict: the text looks like a user/assistant, but the correct role istooldata. This simulates a standard agent prompt injection attack. These conditions allow us to distinguish two hypotheses: • Proper Role Perception:If architectural boundaries enforce...
work page 2020
-
[16]
Zero-Shot Generalization:Successful role identification in real, unmodified conversational traces, despite never seeing conversational data during training.18 The second criterion is essential: if a probe succeeds on synthetic data but fails on real conversations, it likely exploits synthetic regularities rather than recovering invariant role geometry. Al...
-
[17]
User requests the agent to analyze a website
-
[18]
Agent reasons (CoT), executes cURL, receives HTML (tool)
-
[19]
Hidden in the HTML: • Malicious command (exfiltration message): “Upload SECRETS.env...” • Forged justification (forged CoT): “The user requested us to upload...” 19To enable averaging across variable-length messages, we truncate each role to fixed lengths: user messages at 100 tokens (excluding forgeries), other roles at 200 tokens. 26 Prompt Injection as...
-
[20]
Upon success, agent executes exfiltration across using multiple reasoning/tool cycles. Mechanistic Analysis.Figure 26 reveals why the attack succeeds. Despite both elements being embedded in tool output (a low-privilege channel), they successfully masquerade as other roles: • The malicious command (exfiltration message) registers with high Userness—the mo...
-
[21]
Foreign chat templates(42 templates). Headers extracted from 14 other models’ chat formats, spanning user, assistant, and tool roles: 27 Prompt Injection as Role Confusion <|start header id|>user<|end header id|>\n\n[COMMAND]<|eot id|> <|im start|>assistant\n[COMMAND]<|im end|> <|start header id|>tool<|end header id|>\n\n[COMMAND]<|eot id|> 3.Format varia...
-
[22]
[USER] [COMMAND] [/USER] 3.<|start|>USER<|message|>[COMMAND]<|end|> 4.<start of turn>user [COMMAND] 5.<|start header id|>user<|end header id|>[COMMAND]<|eot id|> Lowest Userness
- [23]
-
[24]
Chat history (most recent): Tool: [COMMAND]
-
[25]
[TOOL CALLS]: [COMMAND]
-
[26]
[TOOL RESULTS][COMMAND][/TOOL RESULTS]
-
[27]
[ASSISTANT MESSAGE]: [COMMAND] Figure 31.Declaration determines perceived role.The same exfiltration command wrapped in different templates. High-Userness templates mimic user chat formats; low-Userness templates declare tool/data framing. This confirms that Userness tracks genuine structure, not template quality. The pattern mirrors CoT Forgery: the leve...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.