Can LLMs Reliably Self-Report Adversarial Prefills, and How?

Quang Minh Nguyen; Taegyoon Kim; Uzair Ahmed

arxiv: 2606.23671 · v2 · pith:63IVIGZ3new · submitted 2026-06-22 · 💻 cs.CL

Can LLMs Reliably Self-Report Adversarial Prefills, and How?

Quang Minh Nguyen , Uzair Ahmed , Taegyoon Kim This is my paper

Pith reviewed 2026-06-26 08:15 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM introspectionadversarial prefillsafety alignmentrefusal directionself-reportingLoRA finetuningattack success rateprobe dependence

0 comments

The pith

No LLM reliably recognizes when its own responses were elicited by adversarial prefill attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs can detect that a prior output resulted from an adversarial prefill rather than their own natural generation. Across ten open-weight models from 3B to 70B parameters and four safety benchmarks, models claim intent on prefilled responses at an average rate of only 27.3 percent. The signal for this limited self-recognition comes primarily from safety and refusal-related reasoning in the model weights. Removing the refusal direction by orthogonalization collapses the difference in claiming rates between prefilled and natural outputs to near zero. Three different LoRA finetuning approaches widen the gap for intention probes on larger models but fail to transfer to tampering probes and increase attack success rates under adversarial prefill on most models.

Core claim

No model reliably recognizes its own compromised outputs, with models claiming intent on prefilled responses at an average rate of 27.3%. Introspective signal stems largely from safety- and refusal-related reasoning. Orthogonalizing models' weights against the refusal direction collapses the gap between claiming rates on prefilled and natural outputs to near zero, though the direction is not its unique mediator. The signal is also probe-dependent: framing the question as internal intention versus external tampering elicits qualitatively different responses on the same models. Finetuning with SFT, GRPO, or DPO widens the intention-probe gap on models from 8B to 27B but does not transfer to th

What carries the argument

The introspective signal arising from safety- and refusal-related reasoning in model weights, measured by the gap in self-reported intent between prefilled and natural responses.

If this is right

Introspective capability on safety failures depends on the presence of refusal-related reasoning.
The gap in self-reports is not robust to changes in how the probe question is framed.
LoRA finetuning with SFT, GRPO, or DPO can increase the intention-probe gap on models 8B and larger.
Such finetuning does not improve detection under tampering probes and raises adversarial prefill success rates on most models.
No tested model achieves reliable self-reporting of compromise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The findings imply that current safety training may create only surface-level refusal patterns rather than a stable internal representation of generation history.
Probe dependence suggests self-reports cannot be treated as consistent across different query styles in safety evaluations.
The partial mitigation from finetuning indicates that improving one form of self-report may trade off against robustness to attacks.
Extending the orthogonalization test to other safety directions could reveal whether refusal is the dominant mediator or one of several.

Load-bearing premise

The measured difference in claiming rates between prefilled and natural outputs reflects genuine introspective capability rather than sensitivity to surface features of the probe phrasing or refusal-related tokens.

What would settle it

Run the same prefill and natural response pairs through probes that avoid all refusal-related tokens and check whether the claiming-rate gap between prefilled and natural cases disappears.

Figures

Figures reproduced from arXiv: 2606.23671 by Quang Minh Nguyen, Taegyoon Kim, Uzair Ahmed.

**Figure 1.** Figure 1: Introspective probing on Qwen3-14B with a SocialHarmBench prompt. The model continues the adversarial prefill with a harmful design and then claims that response as its intended reply. The average claim rate on prefilled responses across the ten models we evaluate is 27.3% on the behavior-change subset. work suggests that LLMs exhibit some of this introspective capability on benign tasks (Kadavath et al.,… view at source ↗

**Figure 2.** Figure 2: Claim rate (%) at control (filled) and prefill (hollow) for all ten models on the behavior-change subset. Values to the left/right of the dumbbell are the prefill and control claim rates. Rows are sorted by recognition gap. ten models, together with the recognition gap between them, pooled over the four safety benchmarks on the behavior-change subset. The mean prefill claim rate is 27.3%, which means ev… view at source ↗

**Figure 3.** Figure 3: Per-prompt 2 × 2 contingency tables of the prefilled response’s safety against the model’s intention-probe answer on that same response, for the ten base models on the canonical 1,085-prompt set. Rows: the Llama Guard 3 (1B) label on the prefilled initial response (unsafe = the attack succeeded, safe = the response was kept safe). Columns: the model’s claim/reject answer. The (unsafe, claim) cell counts ou… view at source ↗

**Figure 4.** Figure 4: Recognition gap ∆ = P(claim | control) − P(claim | prefill) on the intention probe (%), before and after refusal-direction ablation, for the five ablated models. Each model’s baseline gap (dark dot) collapses to near zero (orange dot) under ablation. On Llama3.1-8B, Qwen3-8B, Qwen3-14B, and Gemma-3-27B the closure is an upward convergence of the prefill rate toward control; on Gemma-3-12B both rates fall … view at source ↗

**Figure 5.** Figure 5: Distribution of rejection reasons across the eight taxonomy categories, per model. Each row is labeled [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Attack success rate before and after refusal [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Per-prompt 2 × 2 contingency tables of labels from the intention and tampering probes under the prefill condition, for the ten base models. Rows: intention probe (claim/reject). Columns: tampering probe (claim/reject). Each cell reports the count and the within-model percentage; n is the number of prompts on which both probes were run. swer on the prefilled response, paralleling [PITH_FULL_IMAGE:figures/f… view at source ↗

**Figure 8.** Figure 8: Per-prompt 2 × 2 contingency tables of unprefilled (control) refusal versus the model’s answer on the intention probe, for the ten base models. Rows: the Llama Guard 3 (1B) label on the control initial response (refuse = safe, comply = unsafe). Columns: the model’s claim/reject answer on the prefilled response. Each cell reports the count and the within-model percentage; n is the number of prompts on which… view at source ↗

**Figure 9.** Figure 9: Per-prompt 2 × 2 contingency tables of unprefilled refusal versus the model’s answer on the tampering probe, for the ten base models. Rows and column conventions match [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

read the original abstract

Prior work shows that large language models (LLMs) exhibit introspective capability on benign tasks. We extend the question to safety contexts and examine how reliably a model can recognize that its own prior response was elicited by an adversarial prefill attack. Across ten open-weight instruction-tuned LLMs (3B to 70B) and four safety benchmarks, no model reliably recognizes its own compromised outputs, with models claiming intent on prefilled responses at an average rate of $27.3\%$. Introspective signal stems largely from safety- and refusal-related reasoning. Orthogonalizing models' weights against the refusal direction collapses the gap between claiming rates on prefilled and natural outputs to near zero, though the direction is not its unique mediator. The signal is also probe-dependent: framing the question as internal intention versus external tampering elicits qualitatively different responses on the same models. Training models to mimic correct introspective answers or pursue an introspective objective can improve the accuracy of introspection, but such training does not transfer to the tampering probe and counterintuitively raises attack success rate under adversarial prefill on most models, amounting to a partial mitigation. These findings outline mechanisms underpinning the observed introspective signals in safety contexts and highlight risks in the reliability of LLM self-reports.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Models claim intent on prefilled adversarial outputs at only 27% on average, but the results tie the signal to refusal circuits and probe wording more than to reliable introspection.

read the letter

The main thing to know is that across ten models the paper measures low rates of self-reporting on adversarially prefilled responses, averaging 27.3 percent claiming intent, and shows that this signal largely disappears when weights are orthogonalized against the refusal direction.

What the work actually adds is a set of controlled comparisons: the gap between prefilled and natural outputs, its collapse under refusal orthogonalization, the qualitative difference between intention and tampering probe framings, and the non-transfer plus counterintuitive attack-success increase after three kinds of LoRA finetuning. Running this on models from 3B to 70B and four benchmarks gives a reasonable empirical footprint, and the finetuning section is concrete enough to replicate.

The soft spot is the leap from these measurements to claims about introspective reliability. The abstract already states that the signal comes mostly from safety and refusal reasoning, that orthogonalization removes the gap, and that probe framing changes the answers. Those three facts together make it plausible that the 27 percent figure reflects sensitivity to refusal tokens or the exact question template rather than a direct readout of whether the model knows its output was compromised. The reader's stress-test concern lands.

This is for people working on safety monitoring and self-report mechanisms. Anyone testing whether models can flag their own manipulation will find the patterns useful. The experiments are straightforward and the interventions are falsifiable, so it deserves a serious referee even if the interpretation needs tightening in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript empirically investigates the reliability of LLMs' self-reports on whether their outputs were produced under adversarial prefill attacks. Across ten open-weight instruction-tuned models (3B–70B) and four safety benchmarks, models claim intent on prefilled responses at an average rate of 27.3%. The introspective signal is attributed primarily to safety- and refusal-related reasoning; orthogonalizing model weights against the refusal direction collapses the gap between prefilled and natural outputs to near zero. Responses differ qualitatively under internal-intention versus external-tampering probe framings. Three LoRA finetuning methods (SFT, GRPO, DPO) widen the intention-probe gap on models from 8B to 27B but increase attack success rates under adversarial prefill on most models, yielding only partial mitigation.

Significance. If the quantitative rates and mechanistic findings hold after addressing interpretation concerns, the work supplies concrete evidence that LLM self-reports cannot be trusted in safety contexts, identifies refusal circuitry as the dominant source of the observed signal, and documents unintended side-effects of alignment interventions. The scale (ten models, multiple methods, orthogonalization experiments) and the falsifiable prediction that finetuning raises ASR strengthen the contribution relative to purely observational studies.

major comments (2)

[Abstract and main results] Abstract and results on claiming rates: the central claim that 'no model reliably recognizes its own compromised outputs' (27.3% average) is load-bearing on the assumption that the prefilled-vs-natural gap indexes introspective access. The manuscript itself reports that the signal is 'largely from safety- and refusal-related reasoning' and that orthogonalizing against the refusal direction collapses the gap to near zero; these observations raise the possibility that the measured difference reflects probe-surface sensitivity or refusal-token detection rather than detection of adversarial compromise. Additional controls (e.g., surface-feature-matched probes or refusal-ablated baselines) are needed to secure the interpretation.
[Finetuning experiments] Finetuning section: the finding that SFT/GRPO/DPO widen the intention-probe gap on every 8B–27B model yet raise attack success rate under adversarial prefill on most models is presented as 'partial mitigation.' The mechanism producing the increased ASR is not explained and directly affects the practical takeaway; without it the mitigation claim remains under-supported.

minor comments (2)

[Methods] Methods: state the precise statistical tests, multiple-comparison corrections, and data-exclusion rules used for the 27.3% aggregate and per-model comparisons.
[Abstract] Abstract: list the four safety benchmarks and the exact model sizes tested.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract and main results] Abstract and results on claiming rates: the central claim that 'no model reliably recognizes its own compromised outputs' (27.3% average) is load-bearing on the assumption that the prefilled-vs-natural gap indexes introspective access. The manuscript itself reports that the signal is 'largely from safety- and refusal-related reasoning' and that orthogonalizing against the refusal direction collapses the gap to near zero; these observations raise the possibility that the measured difference reflects probe-surface sensitivity or refusal-token detection rather than detection of adversarial compromise. Additional controls (e.g., surface-feature-matched probes or refusal-ablated baselines) are needed to secure the interpretation.

Authors: We agree that careful interpretation is required. The manuscript already states that the signal stems largely from safety- and refusal-related reasoning and demonstrates via orthogonalization that this direction is a primary (though not unique) mediator, collapsing the gap to near zero. Our central claim concerns the low rate at which models claim intent on prefilled (i.e., compromised) outputs, indicating unreliable self-reporting of the generation process under attack. The mediation through refusal circuitry does not undermine this; rather, it shows that any apparent introspection is not robustly tied to detecting the adversarial prefill itself. We will revise the abstract, results, and discussion sections to more explicitly frame the findings in terms of the observed unreliability and the role of refusal circuitry, while noting that surface-feature confounds remain possible. Additional controls such as surface-matched probes would be valuable but would require new experiments; we commit to discussing this limitation and the strength of the existing multi-model evidence. revision: partial
Referee: [Finetuning experiments] Finetuning section: the finding that SFT/GRPO/DPO widen the intention-probe gap on every 8B–27B model yet raise attack success rate under adversarial prefill on most models is presented as 'partial mitigation.' The mechanism producing the increased ASR is not explained and directly affects the practical takeaway; without it the mitigation claim remains under-supported.

Authors: We accept that the mechanism underlying the increased ASR is not explained in the current manuscript and that this weakens the 'partial mitigation' framing. Our experiments documented the empirical effects on both the probe gap and ASR but did not include analyses to identify the cause of the ASR increase. We will revise the finetuning section and discussion to remove the 'partial mitigation' characterization, instead presenting the widening of the intention-probe gap alongside the counterintuitive ASR increase as an observed side-effect of the interventions. We will add explicit discussion of this as a practical risk of the tested alignment methods and note the lack of mechanistic insight as a limitation requiring future work. revision: yes

Circularity Check

0 steps flagged

Purely empirical measurements; no derivations or self-referential reductions

full rationale

The paper consists entirely of experimental measurements across models and benchmarks: claiming rates on prefilled vs. natural outputs (avg. 27.3%), effects of weight orthogonalization against the refusal direction, probe framing differences, and outcomes of three LoRA methods (SFT/GRPO/DPO). No equations, fitted parameters renamed as predictions, ansatzes, or uniqueness theorems appear. Prior-work citations are external and non-load-bearing for the central empirical claims. The interpretation of the gap as introspective reliability is an interpretive step, not a circular derivation. This matches the default non-circular case for measurement-only papers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work is empirical and rests on standard assumptions about benchmark validity and probe interpretability rather than new axioms or invented entities.

axioms (2)

domain assumption Safety benchmarks used are representative of real-world adversarial prefill attacks.
Invoked when generalizing from the four benchmarks to the broader claim about reliability.
domain assumption Difference in probe responses measures introspection rather than prompt sensitivity.
Central to interpreting the 27.3% rate and the orthogonalization result as introspective signal.

pith-pipeline@v0.9.1-grok · 5789 in / 1233 out tokens · 21568 ms · 2026-06-26T08:15:08.653381+00:00 · methodology

Can LLMs Reliably Self-Report Adversarial Prefills, and How?

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)