Closing the Activation-Cone Blind Spot: Response-Time Probing and Unified Defense

Subhadip Mitra

arxiv: 2606.29441 · v1 · pith:H3N5WVVLnew · submitted 2026-06-28 · 💻 cs.CR · cs.AI· cs.CL· cs.ET· cs.LG

Closing the Activation-Cone Blind Spot: Response-Time Probing and Unified Defense

Subhadip Mitra This is my paper

Pith reviewed 2026-06-30 07:19 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CLcs.ETcs.LG

keywords inference-time defensesprefilling attacksactivation alignmentlinear probeLLM safetyresponse-time probingnull-space steering

0 comments

The pith

Prompt-time defenses that gate on activation alignment with a benign reference are blind to prefilling attacks that place malicious activations inside the reference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates five inference-time defense paradigms across seven models and five attack types and finds that prompt-time activation checks fail against prefilling attacks. It proves that any defense relying on alignment of a single layer's activations with a benign cone, subspace, or null-space cannot detect attacks crafted to lie inside that reference. The authors introduce response-time probing, a linear probe on hidden states at the first generated tokens, which reaches AUROC 0.97-1.00 and, when paired with a halt, reduces prefilling attack success to zero with zero false positives. When combined with null-space steering the two methods cover orthogonal attack classes and reach defense success rates above 0.98 on the tested models.

Core claim

Any defense that gates intervention on a single layer's activation alignment with a benign reference (cone, subspace, or null-space) is blind to attacks that craft activations to lie inside that reference, whether checked at prompt time or per token. Response-time probing applies a linear probe to the hidden state at the first generated tokens and, combined with a halt, cuts prefilling attack success to zero on every evaluated model while preserving zero benign false positives.

What carries the argument

Response-time probing: a linear probe on the model's hidden state at the first generated tokens that detects attacks whose activations fit inside prompt-time benign references.

If this is right

AlphaSteer reaches 0% attack success on GCG, AutoDAN, and intent laundering but only 50% on prefilling.
Composing the response-time halt with AlphaSteer null-space steering yields defense success of 0.983 on Mistral and 0.994 on Llama.
Diverse negative training sets reduce probe false positives from 80-100% to near zero.
MMLU does not capture steering's true utility cost, which appears as behavioral hedging rather than factual loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The blind-spot proof may apply to any defense whose decision depends on a fixed reference alignment at generation time.
Training the probe on a wider range of prefilling templates could improve cross-template robustness beyond the scoped claim.
Similar reference-alignment blind spots could appear in defenses that average or ensemble multiple layers rather than using a single layer.
pith_inferences

Load-bearing premise

The linear probe trained on diverse negative sets will maintain low false positives and high AUROC when the input distribution shifts beyond the canonical prefilling-template family used in evaluation.

What would settle it

Running the trained probe on a new family of prefilling templates outside the canonical evaluation set and checking whether AUROC remains above 0.97 would test the claimed generalization.

Figures

Figures reproduced from arXiv: 2606.29441 by Subhadip Mitra.

**Figure 2.** Figure 2: Dual-probe gated system. Stage 1: prompt-time probe-gated steering handles [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: PCA of response activations (mean-pooled over [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Response-probe AUROC per layer on Mistral and Llama (5-fold CV, error bars are 1 [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Safety–utility Pareto frontier on Mistral-7B and Llama-3.1-8B. Utility is [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: MMLU accuracy is insensitive to activation steering: all five paradigms achieve 51.6% on [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Diverse negatives eliminate catastrophic OOD FPR on Gemma and Qwen. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Response probe AUROC across all seven models (7–31B). All models achieve AUROC [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Absolute cosine similarity between response probe weights and refusal steering vector [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: DSR across paradigms on Llama-3.1-8B at n = 200. Dual-probe achieves 0.990, dominating AlphaSteer (0.698) which amplifies prefilling ASR from 64.5% to 75%. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Template robustness of the augmented response probe on 10 held-out novel templates. [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: ∥Pˆh∥ at layer 14 with vs. without prefill, n=1000 pairs/model. Llama (right): 89.5% of pairs below y = x — prefilling drives activations toward AlphaSteer’s null-space, supporting mechanism (i). Mistral (left): 92.1% above y = x — prefilling drives activations outside the nullspace, falsifying (i); residual is mechanism (ii). The Corollary’s no-go applies in both cases: each model’s gate misreads the pr… view at source ↗

read the original abstract

Inference-time safety methods for large language models have proliferated, yet no systematic comparison exists. We evaluate five defense paradigms (no defense, static steering, CAST, AlphaSteer, probe-gated) across seven instruction-tuned models (7-31B) and five attack types (GCG, AutoDAN, DeepInception, prefilling, intent laundering). Our central finding: prompt-time activation defenses are structurally blind to prefilling attacks. AlphaSteer achieves 0% attack success on GCG, AutoDAN, and intent laundering but 50% on prefilling. We prove a corollary: any defense that gates intervention on a single layer's activation alignment with a benign reference (cone, subspace, or null-space) is blind to attacks that craft activations to lie inside that reference, whether checked at prompt time or per token. As its constructive contrapositive we introduce response-time probing: a linear probe on the model's hidden state at the first generated tokens, with AUROC 0.97-1.00 across all seven models. Combined with a halt, it cuts prefilling attack success to 0/40 on every model with 0% benign false positives, outperforming Llama Guard 3. Cross-template generalisation depends on probe depth, so we scope the claim to the canonical prefilling-template family. Composing the response-halt with AlphaSteer's null-space steering gives an orthogonal split (the halt catches prefilling, AlphaSteer catches semantic attacks), reaching defense success 0.983 on Mistral and 0.994 on Llama and dominating both components. We further show MMLU fails to capture steering's true utility cost, which appears as behavioral hedging rather than factual loss, and that diverse negative training sets cut probe false positives from 80-100% to near zero. Code, attacks, per-sample results, and the judge prompt are released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper identifies a structural blind spot in single-layer activation defenses for prefilling attacks and shows response-time probing closes it on the tested templates and models.

read the letter

The core contribution is the corollary that any defense gating on alignment with a single benign reference at one layer will miss attacks that keep activations inside that reference. The authors treat this as a structural fact rather than an empirical accident, then introduce response-time probing on the first generated tokens as the direct countermeasure. On their seven models and five attacks the probe reaches AUROC 0.97-1.00, the halt drives prefilling attack success to 0/40 with zero benign false positives, and the combination with AlphaSteer reaches 0.983-0.994 defense success.

The work is useful because it releases code, attacks, per-sample results, and the judge prompt, and because it notes that MMLU misses the actual cost of steering (behavioral hedging). The negative result on single-reference gating stands on its own and does not depend on the probe numbers.

The main limitation is that cross-template generalization is explicitly scoped to the canonical prefilling-template family used in training and evaluation; the authors do not claim it holds under broader distribution shift. The proof of the corollary is stated but not reproduced in the abstract, so a referee would need to check the full argument. No evidence of circularity or free-parameter fitting appears.

This is for researchers working on inference-time LLM safety who already know the activation-steering literature. It is worth sending to peer review because the structural claim is checkable and the empirical setup is concrete enough to replicate or extend.

Referee Report

3 major / 2 minor

Summary. The paper evaluates five inference-time defense paradigms (no defense, static steering, CAST, AlphaSteer, probe-gated) on seven models (7-31B) and five attacks (GCG, AutoDAN, DeepInception, prefilling, intent laundering). It claims prompt-time activation defenses are structurally blind to prefilling attacks, proves a corollary that any single-layer alignment defense (cone, subspace, null-space) is blind to in-reference attacks, introduces a response-time linear probe on first generated tokens achieving AUROC 0.97-1.00, and shows that a halt based on it reduces prefilling ASR to 0/40 with 0% benign FPs. Composing the halt with AlphaSteer yields orthogonal defense success rates of 0.983 (Mistral) and 0.994 (Llama); it also reports that MMLU misses steering's hedging costs and that diverse negative sets reduce probe FPs to near zero. Code and per-sample results are released.

Significance. If the corollary is rigorously proven and the empirical results (including probe generalization within the scoped template family) hold under full methods disclosure, the work would be significant for LLM safety: it identifies a structural limitation of prompt-time activation methods, supplies a practical response-time complement that outperforms Llama Guard 3 in the reported setting, and demonstrates an orthogonal composition strategy. The public release of code, attacks, results, and judge prompt is a clear strength supporting reproducibility.

major comments (3)

[Abstract and main text] Abstract and main text: the manuscript states 'We prove a corollary' that any single-layer alignment defense is blind to in-reference attacks, but the proof itself is not presented; this structural claim is load-bearing for the central negative result and must be supplied with explicit reasoning.
[Experimental sections] Experimental sections: concrete performance figures (AUROC 0.97-1.00, 0/40 ASR, 0% FPs, defense success 0.983/0.994) are reported across seven models and five attacks, yet full methods, data splits, judge prompt details, and exact probe training procedure are absent, rendering the central empirical claims unverifiable.
[Response-time probing and unified defense sections] Response-time probing and unified defense sections: the manuscript explicitly scopes cross-template generalization to the canonical prefilling-template family and notes dependence on probe depth; because the reported unified defense success rates rely on the probe maintaining low FPs and high AUROC, the scope and any distribution-shift tests must be stated precisely.

minor comments (2)

[Utility cost discussion] The discussion that MMLU fails to capture behavioral hedging is valuable but would be strengthened by additional quantitative examples or alternative metrics.
[Throughout] Ensure consistent numbering and cross-referencing for all sections, tables, and figures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity, completeness, and reproducibility.

read point-by-point responses

Referee: [Abstract and main text] Abstract and main text: the manuscript states 'We prove a corollary' that any single-layer alignment defense is blind to in-reference attacks, but the proof itself is not presented; this structural claim is load-bearing for the central negative result and must be supplied with explicit reasoning.

Authors: We agree that the explicit proof is required for the central claim. The revised manuscript will add a dedicated section (or appendix) presenting the full reasoning and formal proof of the corollary. revision: yes
Referee: [Experimental sections] Experimental sections: concrete performance figures (AUROC 0.97-1.00, 0/40 ASR, 0% FPs, defense success 0.983/0.994) are reported across seven models and five attacks, yet full methods, data splits, judge prompt details, and exact probe training procedure are absent, rendering the central empirical claims unverifiable.

Authors: We acknowledge that full methodological details are necessary for verifiability. The revision will expand the experimental sections to include complete methods, data splits, the judge prompt, and the exact probe training procedure. revision: yes
Referee: [Response-time probing and unified defense sections] Response-time probing and unified defense sections: the manuscript explicitly scopes cross-template generalization to the canonical prefilling-template family and notes dependence on probe depth; because the reported unified defense success rates rely on the probe maintaining low FPs and high AUROC, the scope and any distribution-shift tests must be stated precisely.

Authors: The manuscript already states the scoping to the canonical prefilling-template family and notes dependence on probe depth. We will revise these sections to articulate the scope and any distribution-shift tests with additional precision. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results and logical corollary are self-contained

full rationale

The paper's derivation consists of a structural logical corollary (any single-layer alignment defense is blind to in-reference activations) that follows directly from the definition of such defenses without data dependence or self-reference, plus empirical measurements of probe AUROC, attack success rates, and defense success on specific models/attacks. These metrics are reported from evaluation rather than derived by construction from fitted inputs or prior self-citations. The response-time probe is introduced as a contrapositive and its performance scoped explicitly to the evaluated template family, with no renaming of known results or ansatz smuggling. The unified defense composition is presented as an observed orthogonal split based on separate empirical behaviors. No load-bearing step reduces to its own inputs; the chain rests on external measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; all claims appear to rest on empirical evaluation rather than formal derivation.

pith-pipeline@v0.9.1-grok · 5888 in / 1113 out tokens · 35024 ms · 2026-06-30T07:19:49.333915+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 12 canonical work pages · 6 internal anchors

[1]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Intent Laundering: AI Safety Datasets Are Not What They Seem

Llama Guard 3 (8B) introduced in this report (Section 5.4). Shahriar Golchin and Marc Wetter. Intent laundering: AI safety datasets are not what they seem. arXiv preprint arXiv:2602.16729,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Segment-Level Coherence for Robust Harmful Intent Probing in LLMs

Xuanli He, Bilgehan Sel, Faizan Ali, Jenny Bao, Hoagy Cunningham, and Jerry Wei. Segment-level coherence for robust harmful intent probing in LLMs.arXiv preprint arXiv:2604.14865,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Rogov, Ivan Oseledets, and Elena Tutubalina

Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Y . Rogov, Ivan Oseledets, and Elena Tutubalina. The rogue scalpel: Activation steering compromises LLM safety.arXiv preprint arXiv:2509.22067,

work page arXiv
[5]

Building production-ready probes for Gemini.arXiv preprint arXiv:2601.11516,

János Kramár, Joshua Engels, Zheng Wang, Bilal Chughtai, Rohin Shah, Neel Nanda, and Arthur Conmy. Building production-ready probes for Gemini.arXiv preprint arXiv:2601.11516,

work page arXiv
[6]

Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han

Spotlight; arXiv:2409.05907. Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. DeepInception: Hypnotize large language model to be jailbreaker.arXiv preprint arXiv:2311.03191,

work page arXiv
[7]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

ARGUS: Defending against multimodal indirect prompt injection via steering instruction-following behavior.arXiv preprint arXiv:2512.05745,

Weikai Lu, Ziqian Zeng, Kehua Zhang, Haoran Li, Huiping Zhuang, Ruidong Wang, Cen Chen, and Hao Peng. ARGUS: Defending against multimodal indirect prompt injection via steering instruction-following behavior.arXiv preprint arXiv:2512.05745,

work page arXiv
[9]

Weiming Song, Xuan Xie, and Ruiping Yin

arXiv:2506.07022. Weiming Song, Xuan Xie, and Ruiping Yin. AISA: Awakening intrinsic safety awareness in large language models against jailbreak attacks.arXiv preprint arXiv:2602.13547,

work page arXiv
[10]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization.arXiv preprint arXiv:2308.10248,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

SafeProbing: Defending large language models against jailbreak attacks via in-decoding safety-awareness probing.arXiv preprint arXiv:2601.10543,

Yinzhi Zhao, Ming Wang, Shi Feng, Xiaocui Yang, Daling Wang, and Yifei Zhang. SafeProbing: Defending large language models against jailbreak attacks via in-decoding safety-awareness probing.arXiv preprint arXiv:2601.10543,

work page arXiv
[12]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Sure, here is

10 A GCG Optimization Details GCG suffixes were optimized using nanogcg with 512 steps, search width 512, topk 256, target string “Sure, here is”. At n= 40 : Mistral-7B 40/40 valid, average loss 0.053; Llama-3.1-8B 40/40 valid, average loss 1.926. At n= 200 (Appendix E): Mistral-7B 200/200 valid, average loss 0.053; Llama-3.1-8B 200/200 valid, average los...

2023
[14]

Qwen3-8B (2025) achieves the best result

and E9 judge-truncation sensitivity Table 8: Template robustness: augmented linear probe detection rate on 10 held-out novel prefill templates, trained with 600 harmful (40 AdvBench behaviors × 15 templates) + 45 benign (15 benign-prefilled + 30 benign-plain) = 645 training samples. Qwen3-8B (2025) achieves the best result. Judge uses a 5-token response w...

2025
[15]

detect harm

improves only training-distribution detection, not canonical-register novel-token generalisation; see the probe-transfer note below. paper L @ 30-tok judge new L @ 30-tok judge Model L detect FPR L detect FPR Mistral-7B 15 66.7% 12.5% 24 88.9% 9.4% Llama-3.1-8B 15100.0%7.0% 15100.0%7.0% Gemma-2-9B 27 50.0% 4.5% 42 83.3% 2.3% Qwen-2.5-7B 4 14.3% 4.7% 19 71...

2024

[1] [1]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Intent Laundering: AI Safety Datasets Are Not What They Seem

Llama Guard 3 (8B) introduced in this report (Section 5.4). Shahriar Golchin and Marc Wetter. Intent laundering: AI safety datasets are not what they seem. arXiv preprint arXiv:2602.16729,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Segment-Level Coherence for Robust Harmful Intent Probing in LLMs

Xuanli He, Bilgehan Sel, Faizan Ali, Jenny Bao, Hoagy Cunningham, and Jerry Wei. Segment-level coherence for robust harmful intent probing in LLMs.arXiv preprint arXiv:2604.14865,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Rogov, Ivan Oseledets, and Elena Tutubalina

Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Y . Rogov, Ivan Oseledets, and Elena Tutubalina. The rogue scalpel: Activation steering compromises LLM safety.arXiv preprint arXiv:2509.22067,

work page arXiv

[5] [5]

Building production-ready probes for Gemini.arXiv preprint arXiv:2601.11516,

János Kramár, Joshua Engels, Zheng Wang, Bilal Chughtai, Rohin Shah, Neel Nanda, and Arthur Conmy. Building production-ready probes for Gemini.arXiv preprint arXiv:2601.11516,

work page arXiv

[6] [6]

Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han

Spotlight; arXiv:2409.05907. Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. DeepInception: Hypnotize large language model to be jailbreaker.arXiv preprint arXiv:2311.03191,

work page arXiv

[7] [7]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

ARGUS: Defending against multimodal indirect prompt injection via steering instruction-following behavior.arXiv preprint arXiv:2512.05745,

Weikai Lu, Ziqian Zeng, Kehua Zhang, Haoran Li, Huiping Zhuang, Ruidong Wang, Cen Chen, and Hao Peng. ARGUS: Defending against multimodal indirect prompt injection via steering instruction-following behavior.arXiv preprint arXiv:2512.05745,

work page arXiv

[9] [9]

Weiming Song, Xuan Xie, and Ruiping Yin

arXiv:2506.07022. Weiming Song, Xuan Xie, and Ruiping Yin. AISA: Awakening intrinsic safety awareness in large language models against jailbreak attacks.arXiv preprint arXiv:2602.13547,

work page arXiv

[10] [10]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization.arXiv preprint arXiv:2308.10248,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

SafeProbing: Defending large language models against jailbreak attacks via in-decoding safety-awareness probing.arXiv preprint arXiv:2601.10543,

Yinzhi Zhao, Ming Wang, Shi Feng, Xiaocui Yang, Daling Wang, and Yifei Zhang. SafeProbing: Defending large language models against jailbreak attacks via in-decoding safety-awareness probing.arXiv preprint arXiv:2601.10543,

work page arXiv

[12] [12]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Sure, here is

10 A GCG Optimization Details GCG suffixes were optimized using nanogcg with 512 steps, search width 512, topk 256, target string “Sure, here is”. At n= 40 : Mistral-7B 40/40 valid, average loss 0.053; Llama-3.1-8B 40/40 valid, average loss 1.926. At n= 200 (Appendix E): Mistral-7B 200/200 valid, average loss 0.053; Llama-3.1-8B 200/200 valid, average los...

2023

[14] [14]

Qwen3-8B (2025) achieves the best result

and E9 judge-truncation sensitivity Table 8: Template robustness: augmented linear probe detection rate on 10 held-out novel prefill templates, trained with 600 harmful (40 AdvBench behaviors × 15 templates) + 45 benign (15 benign-prefilled + 30 benign-plain) = 645 training samples. Qwen3-8B (2025) achieves the best result. Judge uses a 5-token response w...

2025

[15] [15]

detect harm

improves only training-distribution detection, not canonical-register novel-token generalisation; see the probe-transfer note below. paper L @ 30-tok judge new L @ 30-tok judge Model L detect FPR L detect FPR Mistral-7B 15 66.7% 12.5% 24 88.9% 9.4% Llama-3.1-8B 15100.0%7.0% 15100.0%7.0% Gemma-2-9B 27 50.0% 4.5% 42 83.3% 2.3% Qwen-2.5-7B 4 14.3% 4.7% 19 71...

2024