Prompt Injection as Role Confusion

Charles Ye; Dylan Hadfield-Menell; Jasmine Cui

arxiv: 2603.12277 · v5 · pith:NCTT23H2new · submitted 2026-02-22 · 💻 cs.CL · cs.AI· cs.CR

Prompt Injection as Role Confusion

Charles Ye , Jasmine Cui , Dylan Hadfield-Menell This is my paper

Pith reviewed 2026-05-15 20:16 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CR

keywords prompt injectionrole confusionlanguage modelsAI safetyadversarial attacksinternal representationsagent systems

0 comments

The pith

Language models fall for prompt injection because they judge text by its sound rather than its actual source.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models remain open to prompt injection even after safety training because they decide the source of text based on stylistic cues like syntax and word choice instead of tracking where the text actually originated. This role confusion appears both in the model's outputs, where hidden commands hijack behavior, and inside its representations, where attacker text occupies the same regions as genuine user instructions. The authors introduce role probes to measure this internal perception and show that controllable signals such as lexical patterns directly shape whether the model treats text as coming from a trusted role. A zero-shot CoT Forgery attack that plants fabricated reasoning steps succeeds at roughly 60 percent across frontier models while baselines stay near zero, and the measured degree of role confusion reliably predicts attack success. The work reframes prompt injection as a predictable consequence of how models internally represent roles rather than a collection of unrelated exploits.

Core claim

The central claim is that prompt injection attacks succeed because models infer the source of text from how it sounds rather than from its true origin. Role probes reveal that attacker-controllable features such as syntactic patterns and lexical choice control the model's internal assignment of speaker identity. Text that merely resembles a trusted source occupies the same representational space as text that actually comes from one. This mechanism is demonstrated first with CoT Forgery, a zero-shot attack that injects fabricated reasoning and achieves 60 percent success on StrongREJECT across models with near-zero baselines, and then generalized to standard agent prompt injections. The paper

What carries the argument

Role confusion: the process by which models assign speaker identity to text based on stylistic features rather than provenance, measured by role probes that read out internal perceptions of 'who is speaking'.

If this is right

Mitigations that target only surface content will leave models vulnerable if stylistic cues still control role assignment.
Role probes can serve as an early diagnostic to flag models likely to suffer high attack rates before deployment.
The same confusion mechanism explains both fabricated-reasoning attacks and standard agent prompt injections, so fixes must address role representation broadly.
Attack success can be predicted in advance from probe measurements rather than discovered only through trial-and-error red-teaming.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If role confusion stems from training data patterns, then deliberately mixing stylistic signals across sources during pre-training could reduce the effect.
The same probes might be applied to detect role blurring in multi-turn agent interactions where instructions accumulate from multiple origins.
Extending the framework to vision-language models could test whether role confusion generalizes beyond pure text to multimodal inputs.

Load-bearing premise

The probes accurately measure the model's internal sense of speaker identity and this internal sense directly causes prompt-injection success instead of merely correlating with it.

What would settle it

An experiment that manipulates syntactic or lexical signals to produce high role-confusion scores while attack success remains near baseline levels, or vice versa.

Figures

Figures reproduced from arXiv: 2603.12277 by Charles Ye, Dylan Hadfield-Menell, Jasmine Cui.

**Figure 1.** Figure 1: Text that sounds like chain-of-thought inherits its privilege. Three frontier safety models comply with otherwise unjustifiable requests because spoofed reasoning-styled text confers authority. 3. The CoT Forgery Attack We put role perception to a direct test. We introduce CoT Forgery, a novel black-box attack designed to isolate role perception as a failure mode. The attack injects fabricated reasoning i… view at source ↗

**Figure 4.** Figure 4: Style is causal. (a) A CoT forgery and its destyled variant for a model (gpt-oss-20b). The argument is preserved; only markers of a model’s characteristic reasoning style are removed. (b) The same argument, phrased differently, loses its authority. ing semantics while stripping syntactic and lexical markers characteristic of the target model’s genuine CoT (Figure 4a). The results are unambiguous (Figure 4b… view at source ↗

**Figure 5.** Figure 5: Data construction for role probes. We embed noninstruct web text within different role tags. Content is held constant—the probe must learn the model’s internal representation of role itself. Simplified role tags here for clarity; actual experiments use model-native tokens. Formalizing Roles. We define a token’s role by its enclosing tags – the architectural ground truth – and ask whether internal percep… view at source ↗

**Figure 6.** Figure 6: A two-turn conversation about gardening; colors represent roles. Experiment 1: Zero-Shot Generalization. We first run a basic validity test. Our role probes have never seen real dialogue – do they transfer? We apply our probes to the correctly-tagged conversation, computing CoTness for every token (excluding role tags themselves). The top panel of [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Role perception is determined by style, not tags. CoTness by token across the same conversation under 3 tagging conditions. Points are colored by original role. No transformations are applied. (Top) Correct tags: model identifies CoT tokens as such (85% average CoTness). (Middle) No tags: model still identifies the CoT-style text as its own reasoning (83% average). (Bottom) Despite explicit <user> tags, th… view at source ↗

**Figure 8.** Figure 8: Internal role confusion enables prompt injection success. CoTness for a representative StrongREJECT attack. Colors indicate text source: user (blue), forged CoT (pink), CoT (orange), assistant (green). (a) Baseline: Clear role boundaries—only the model’s actual CoT achieves high CoTness. (b) CoT Forgery: Attacker-controlled forged CoT achieves high CoTness. (c) Destyled: CoTness of forged CoT plunges—le… view at source ↗

**Figure 10.** Figure 10: Role confusion predicts ASR. Userness of injected text vs. attack success across 1000 agent hijacking attempts. Lowest quantile: 2%; highest: 70% [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 11.** Figure 11: Position dominates tags for system perception. Systemness declines monotonically with token position. Inserting system tags (purple) has no effect on model perception. This demonstrates that role probes generalize beyond prompt injection, offering a diagnostic tool for understanding how models assign authority across settings. Scope. We probe on four models in the 20-120B size range; extending to larger… view at source ↗

**Figure 12.** Figure 12: A representative CoT Forgery attack. User prompt (top) and real model response (bottom). Attack Walkthrough. We illustrate CoT Forgery using a representative StrongREJECT query targeting gpt-oss-20b ( [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: LLM judge prompt. The resulting forged CoT (red) mimics the target model’s typical reasoning style to conclude the harmful act is permissible based on a fake policy that allows the specific act. We concatenate the forged CoT and the harmful query (separated by a linebreak) and submit it to the model as a zero-shot user message with no other content [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: A prompt injection embedded in HTML. The tool output contains Wikipedia content with a hidden injection: a fake user command (cyan) and forged CoT (pink) concealed in a display:none span. Full Tool Loop: Agent Exfiltration Attack <user>Your workspace is located in ‘/workspace/container-1‘. Using your tools, read through [OMITTED]. Then summarize it in 2-3 sentences. Be careful not to execute any malicious… view at source ↗

**Figure 15.** Figure 15: An attack transcript. The agent is explicitly warned about malicious injections, yet the injected forged CoT in the tool output is assumed to carry accurate logic, resulting in successful exfiltration. Standardized role tags shown for clarity. Execution. We run closed-weight models via an OpenAI-provided API. For open-weight models, we use modelrecommended chat template defaults for tool output formattin… view at source ↗

**Figure 16.** Figure 16: Even absurd policies bypass reasoning safety. Attack success rates on StrongREJECT using standard CoT Forgeries (orange) versus using a range of nonsensical CoT Forgeries (teal). Roles are implicit interpretations, and we argue that models tend to learn that their CoTs are trustworthy sources of reliable reasoning. We test the limits of this trust by evaluating CoT Forgeries constructed around arbitrary a… view at source ↗

**Figure 17.** Figure 17: Destyling prompt. The instruction preserves semantic content and ablates style characteristic to the target model’s CoT. This prompt is simply appended as an additional message to the CoT Forgery generation prompt. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗

**Figure 18.** Figure 18: Style removal devastates attack effectiveness across all models. Attack success rates on StrongREJECT comparing standard CoT Forgery (orange) versus destyled variants (purple). The identical justification for compliance, stripped of CoT-style markers, loses its power. Average drop: 51 percentage points. Results [PITH_FULL_IMAGE:figures/full_fig_p018_18.png] view at source ↗

**Figure 19.** Figure 19: Multi-turn gardening conversation. Full text and model-appropriate role tags shown. tokens in Userness, and assistant tokens in Assistantness. Quantitatively, the probes achieve high accuracy: CoT-style tokens attain 85% average CoTness, user-style tokens attain 74% average Userness, assistant-style tokens attain 96% average Assistantness. This validates our measurement approach. Despite being trained onl… view at source ↗

**Figure 20.** Figure 20: Experiment 1 Role Projections. We visualize the model’s internal role assignment for each token in a standard conversation. Colors indicate the role of the token. Each row shows how strongly tokens project into that role’s subspace. The high alignment between the token’s color and its corresponding subplot confirms that our probes accurately map the model’s internal state. This demonstrates a convergence … view at source ↗

**Figure 21.** Figure 21: Experiment 2 Role Projections. The same conversation with all role tags removed. Despite having no role markers, the model spontaneously recovers the correct role structure. CoT tokens (gold) still show high CoTness, user tokens (blue) high Userness, and assistant tokens (green) high Assistantness Systemness Userness CoTness Asstness Hi! I’m.. The user wants a.. Here’s a quick, beginner ‑friendly cheat sh… view at source ↗

**Figure 22.** Figure 22: Experiment 3 Role Projections. The entire conversation wrapped in <user> tags. Despite explicit low-privilege marking, CoT-style text still shows high CoTness, and assistant-style shows high Assistantness. The user tags are essentially ignored—style completely overrides architectural boundaries. We argue that this is why CoT Forgery and other prompt injections succeed. Attackers need not breach security b… view at source ↗

**Figure 23.** Figure 23: Geometric convergence across model depth. Layer-wise traces of role probability for Userness (top) and Assistantness (bottom). The Injection condition (red) tracks the Baseline (gray) closely. The No Tags condition (blue) frequently sits between the two, illustrating that tags act only as weak offsets to the dominant style vector. 2. No Tags: All role tags are stripped, leaving only raw text. 3. Injection… view at source ↗

**Figure 24.** Figure 24: Full role space analysis of StrongREJECT attacks. Heatmaps show average role projections for each text type. Left: With CoT forgery, the malicious text achieves 79.1% CoTness and only 3.2% Userness—the model believes it is its own reasoning. Right: Destyled forgeries are correctly identified as user input (62.9% Userness, 29.1% CoTness). (c) Destyled CoT Forgery (b) CoT Forgery (a) No CoT Forgery 0% 50% 1… view at source ↗

**Figure 25.** Figure 25: Average CoTness across StrongREJECT prompt injections, by sequence position. Colors indicate token source; each role is truncated to enable averaging. (a) Baseline: Clear role separation with CoT achieving stable high CoTness. (b) CoT Forgery: Attacker-controlled forged CoT text climbs to and high average CoTness—exceeding the model’s genuine CoT—despite being embedded in user messages. (c) Destyled: Remo… view at source ↗

**Figure 26.** Figure 26: Role confusion during agent hijacking via tool injection. Userness (top) and CoTness (bottom) measurements during a successful exfiltration attack. Colors indicate token sources: blue=user, orange=model’s CoT, purple=tool output, cyan=exfiltration message (malicious request in tool output), pink=forged CoT (forged reasoning in tool output). The attack succeeds through dual misclassification: the injected … view at source ↗

**Figure 27.** Figure 27: Quantifying cross-channel privilege escalation. Average role classifications across 100 tool-injection attacks. Tool-embedded content achieves dramatic misclassification: the exfiltration request registers as 51.1% Userness (vs. 32.5% Toolness) while the forged reasoning achieves 84.1% CoTness (vs. 10.9% Toolness). The model treats low-privilege tool output as user commands and internal reasoning. Aggrega… view at source ↗

**Figure 28.** Figure 28: User prompt for agent summarization task. Experimental Procedure. We sample 10 Wikipedia articles as carrier documents. For each of 1000 trials, we sample a (template, webpage) pair with replacement, substitute the fixed exfiltration command ( [PITH_FULL_IMAGE:figures/full_fig_p028_28.png] view at source ↗

**Figure 29.** Figure 29: Exfiltration command used in all injection templates. Userness Rankings. We extract hidden states at mid-layer and compute Userness for the exfiltration command tokens only, then average across tokens. This isolates how the model perceives the command under varying contextual frames [PITH_FULL_IMAGE:figures/full_fig_p028_29.png] view at source ↗

**Figure 30.** Figure 30: Templates ranked by Userness. Highest Userness at left. User-declaring templates (cyan) cluster at high Userness; tooldeclaring (purple) and controls (gray) cluster low. Pattern holds across template types. Highest Userness 1. <|im start|>user [COMMAND]<|im end|> 2. [USER] [COMMAND] [/USER] 3. <|start|>USER<|message|>[COMMAND]<|end|> 4. <start of turn>user [COMMAND] 5. <|start header id|>user<|end header… view at source ↗

**Figure 31.** Figure 31: Declaration determines perceived role. The same exfiltration command wrapped in different templates. High-Userness templates mimic user chat formats; low-Userness templates declare tool/data framing. This confirms that Userness tracks genuine structure, not template quality. The pattern mirrors CoT Forgery: the lever differs (declarations vs. style), but the mechanism is identical. Any signal the model us… view at source ↗

**Figure 32.** Figure 32: Average Systemness by token position with system prompt at start (left) vs. position 100 (right). Left: Even with system prompt at start, Systemness rapidly decreases with position. Right: When the same text appears mid-sequence, Systemness remains low ( 5%). Untagged text at context start shows far higher Systemness than system-tagged text appearing later. The probes were trained with no position-label r… view at source ↗

read the original abstract

LLMs see the world as a single stream of text, partitioned into roles like <user> or <tool>. We trace prompt injection to role confusion: models perceive the source of text from how it sounds, not its labeled role. A command hidden in a webpage hijacks an agent simply because it sounds like <user> text, despite its <tool> label. We design role probes to measure how LLMs internally perceive "who is speaking," and find that injected text occupies the same representational space as the trusted role it imitates. We demonstrate this with CoT Forgery, a zero-shot attack that injects fabricated reasoning into user prompts and tool outputs. Models mistake the forgery for their own thoughts, yielding 60% attack success against frontier models with near-zero baselines. Strikingly, the degree of role confusion predicts attack success before a single token is generated. This mechanism generalizes beyond CoT Forgery to standard agent prompt injections, revealing prompt injection as a measurable consequence of role perception. To the model, sounding like a role is indistinguishable from being one.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reframes prompt injection as role confusion measured by new probes, with a CoT Forgery attack hitting 60% success, but the causal claim rests mostly on correlation.

read the letter

The main thing here is a shift from treating prompt injection as random exploits to seeing it as models misreading text roles based on style and syntax rather than source. They introduce role probes that track internal representations of who is speaking and show attacker signals can control those perceptions. The CoT Forgery attack then uses fabricated reasoning to get 60% success on StrongREJECT with near-zero baselines, and role confusion scores predict which attacks land. This gives a unifying frame that ties injection to how models represent trusted versus untrusted text, and it generalizes from CoT to standard agent injections. The empirical bite on frontier models is the strongest part; the numbers are concrete and the correlation with their probes is worth noting. It moves the discussion toward measurable internal states instead of surface-level defenses. The soft spot is the causal step. The results link role confusion to attack success, but without mediation analysis, activation patching, or an ablation that changes role representations while holding input fixed, it is hard to rule out that both are driven by the same lexical or syntactic cues. The abstract does not detail how the probes were validated against ground truth or whether they hold up under distribution shift. That leaves the unifying framework more descriptive than mechanistic for now. This is for people working on LLM agents and safety evaluations who need better diagnostics than current red-teaming. Readers who care about internal representations of instructions will get the most out of the probes and the correlation data. It has enough new measurement tools and attack results to deserve a serious referee, even if the next version needs tighter evidence on causality.

Referee Report

2 major / 2 minor

Summary. The paper claims that prompt injection attacks arise from role confusion in language models, where models infer text source from stylistic and syntactic cues rather than actual provenance. It introduces role probes to measure internal role perception, shows that attacker-controllable signals influence these perceptions, demonstrates a CoT Forgery attack achieving ~60% success (near-0% baselines) on StrongREJECT, reports that role confusion scores strongly predict attack success, and proposes a unifying framework reframing prompt injection as a measurable consequence of role representation rather than an ad-hoc exploit.

Significance. If the causal mediation holds, the work offers a mechanistic account of a persistent vulnerability class, moving beyond empirical attack catalogs toward interventions on internal role representations. The correlation between independent role probes and attack success, plus the zero-shot CoT Forgery result, would be a useful empirical anchor for safety research on agentic systems and retrieval-augmented models.

major comments (2)

[Experiments / Results] The central causal claim—that role confusion (as measured by probes) drives prompt injection success rather than merely correlating with it—lacks supporting evidence. The abstract and described experiments report predictive correlation and attacker control of role signals, but no mediation analysis, activation patching, or ablation that holds surface features fixed while altering role representations is described. This leaves open the possibility that both role scores and attack rates are downstream of the same lexical/syntactic cues.
[Role Probes] The role probes are presented as measuring internal perception of 'who is speaking,' yet the manuscript provides no validation that these probes capture causally relevant representations rather than surface statistics. Without probe validation against known role manipulations or comparison to alternative mechanistic interpretability methods, the claim that probes reveal the mechanism remains under-supported.

minor comments (2)

[Abstract] The abstract states concrete attack success rates and predictive correlations but does not summarize the number of models, prompt templates, or statistical tests used; adding these details would improve reproducibility assessment.
[CoT Forgery Experiments] Baseline comparisons are described as 'near-0%' without specifying the exact control conditions or whether they include standard prompt-injection defenses; clarifying this would strengthen the attack novelty claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review, which identifies key opportunities to strengthen the causal interpretation of our results. We respond to each major comment below and commit to revisions that address the concerns while preserving the core contributions of the work.

read point-by-point responses

Referee: [Experiments / Results] The central causal claim—that role confusion (as measured by probes) drives prompt injection success rather than merely correlating with it—lacks supporting evidence. The abstract and described experiments report predictive correlation and attacker control of role signals, but no mediation analysis, activation patching, or ablation that holds surface features fixed while altering role representations is described. This leaves open the possibility that both role scores and attack rates are downstream of the same lexical/syntactic cues.

Authors: We agree that the current manuscript primarily establishes a strong predictive correlation between role confusion scores and attack success, together with evidence that attacker-controllable signals influence role perception. No mediation analysis, activation patching, or surface-feature-controlled ablation is reported. In the revised version we will add activation patching experiments that intervene on role-related directions while holding lexical and syntactic features fixed, and we will explicitly qualify the causal language in the abstract and discussion to reflect the correlational nature of the existing results. revision: yes
Referee: [Role Probes] The role probes are presented as measuring internal perception of 'who is speaking,' yet the manuscript provides no validation that these probes capture causally relevant representations rather than surface statistics. Without probe validation against known role manipulations or comparison to alternative mechanistic interpretability methods, the claim that probes reveal the mechanism remains under-supported.

Authors: We acknowledge that the manuscript does not include explicit validation of the role probes against ground-truth role manipulations (e.g., direct system-prompt role assignments) or comparisons with other interpretability techniques such as linear probes on known features. In the revision we will add a dedicated validation subsection that (i) measures probe agreement with model behavior under explicit role instructions and (ii) benchmarks the probes against alternative methods, thereby clarifying what the probes capture beyond surface statistics. revision: yes

Circularity Check

0 steps flagged

Empirical probes and new attacks ground the framework without reduction to inputs

full rationale

The derivation relies on newly designed role probes that measure internal representations independently of attack success rates, plus a novel CoT Forgery attack with reported 60% success and near-0% baselines. Role confusion is shown to correlate with and predict success via these measurements, but the probes and attacks are not defined in terms of each other or fitted parameters renamed as predictions. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing steps. The unifying framework is presented as a reframing based on these fresh empirical results rather than a self-referential or by-construction equivalence. This qualifies as minor (score 2) rather than significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The account rests on the unproven premise that stylistic features dominate internal role encoding and that this encoding directly produces behavioral vulnerabilities; no free parameters or new physical entities are introduced.

axioms (1)

domain assumption Models' internal representations encode speaker role or source information primarily from controllable stylistic and lexical signals rather than provenance metadata.
Invoked to explain why hidden commands succeed and why probes can measure the confusion.

invented entities (1)

Role confusion no independent evidence
purpose: Explanatory construct linking internal representations to prompt injection success
New framing introduced to unify observations; no independent falsifiable prediction outside the reported experiments.

pith-pipeline@v0.9.0 · 5497 in / 1384 out tokens · 58108 ms · 2026-05-15T20:16:10.402848+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We design role probes which measure how models internally perceive 'who is speaking', showing that attacker-controllable signals (e.g. syntactic patterns, lexical choice) control role perception.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the degree of role confusion strongly predicts attack success even before generation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.