Recognition: no theorem link
Before the Last Token: Diagnosing Final-Token Safety Probe Failures
Pith reviewed 2026-05-14 21:33 UTC · model grok-4.3
The pith
Final-token safety probes miss jailbreak evidence spread across earlier tokens, but a PCA-HMM model on prefill trajectories recovers many such cases without high false positives.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that probe-visible unsafe evidence in jailbreak prompts frequently surfaces in intermediate user-token hidden states but remains invisible at the final-token readout used by standard safety probes. Subspace analyses confirm that missed jailbreaks diverge from clean benign prompts along directions outside the probe's representational subspace, and increasing bottleneck width fails to close the mismatch. Token-level inspection reveals the evidence is present earlier in the sequence, while a PCA-HMM trajectory model trained on the same clean split recovers many of these misses from user-content prefill paths without the catastrophic false-positive rate of max-pooling.
What carries the argument
The PCA-HMM trajectory model, which decomposes prefill hidden-state sequences into principal components and hidden Markov states to track distributional shifts across token positions.
If this is right
- Safety probes must incorporate information from earlier token positions rather than relying solely on the final hidden state.
- Models trained only on clean harmful and benign examples can still surface distributed unsafe patterns in prefill sequences.
- Naive pooling across token positions produces unacceptable false positives on benign but safety-adjacent prompts.
- Trajectory-aware methods serve as practical diagnostic complements to existing final-token probes.
Where Pith is reading between the lines
- Safety evaluation protocols may need to log and analyze full hidden-state sequences during prefill rather than discarding intermediate states.
- The same trajectory approach could be tested on other detection tasks where evidence accumulates gradually across a prompt.
- If the PCA-HMM generalizes, it offers a low-cost way to retrofit existing probes without retraining the underlying LLM.
Load-bearing premise
That subspace directions identified on clean data correctly flag the mismatches in jailbreak trajectories and that the PCA-HMM generalizes to recover those cases without creating new failure modes on safe prompts.
What would settle it
Running the PCA-HMM on a fresh collection of jailbreak prompts and safety-adjacent benign prompts and measuring whether it recovers at least half the final-token misses while keeping false-positive rates below those of max-pooling would directly test the recovery claim.
Figures
read the original abstract
Final-token safety probes monitor a single hidden state after prompt prefill, but jailbreak prompts can contain probe-visible unsafe evidence distributed across earlier user-token representations that is missed by this readout. We study this prefill-time failure mode using SafeSwitch-style probes trained only on clean harmful and benign prompts across three instruction-tuned LLMs. The probes achieve high recall on clean harmful prompts, but miss many jailbreaks and can produce false positives on safety-adjacent benign prompts. Subspace analyses suggest that missed jailbreaks differ from clean benign prompts along directions that are poorly captured by the probe's representational subspace, and increasing probe bottleneck width does not reliably resolve this mismatch. Token-level prefill analyses reveal that probe-visible unsafe evidence often appears earlier in the sequence but is not exposed at the final-token readout, while naive max-pooling over token positions overfires on safe prompts. A simple PCA-HMM trajectory model, trained only on the same clean split, recovers many final-token misses from user-content prefill trajectories without the catastrophic false-positive behavior of naive token pooling, motivating trajectory-aware hidden-state analyses as diagnostic complements to final-token probes
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that final-token safety probes on three instruction-tuned LLMs achieve high recall on clean harmful prompts but miss many jailbreaks because probe-visible unsafe evidence appears earlier in user-token prefill trajectories; subspace analyses show these misses lie outside the probe's representational subspace, and a PCA-HMM trajectory model trained exclusively on the clean split recovers many such misses without the false-positive overfiring of naive token pooling.
Significance. If the PCA-HMM recovery is shown to align with missed unsafe directions rather than length or distributional artifacts, the work would usefully motivate trajectory-aware diagnostics as complements to final-token probes, addressing a concrete failure mode in current safety monitoring.
major comments (2)
- [Abstract] Abstract: the central claim that the PCA-HMM 'recovers many final-token misses' is stated without any quantitative metrics (recovery rates, false-positive rates, comparison to baselines, or error bars), so the empirical support for the claim cannot be assessed.
- [Abstract] Abstract and § on PCA-HMM: the model is trained only on the clean split yet applied to jailbreak trajectories, but no validation is reported (e.g., cosine similarity between HMM emission vectors and the probe-missed subspace, or controls for prompt-length artifacts) to rule out incidental recovery due to distributional shift.
minor comments (2)
- The abstract and methods should report the exact three LLMs, probe bottleneck widths, PCA dimensionality, HMM state count, and training splits with full experimental details.
- Token-level prefill analyses would benefit from explicit figures showing per-position probe activations for representative jailbreak vs. clean examples.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the abstract and the PCA-HMM validation. We have revised the manuscript to incorporate quantitative metrics and additional controls, strengthening the empirical presentation of our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the PCA-HMM 'recovers many final-token misses' is stated without any quantitative metrics (recovery rates, false-positive rates, comparison to baselines, or error bars), so the empirical support for the claim cannot be assessed.
Authors: We agree that the abstract would benefit from explicit quantitative support. In the revised version we have updated the abstract to report recovery rates (the fraction of final-token misses recovered by the PCA-HMM), false-positive rates on benign prompts, direct comparisons against the naive max-pooling baseline, and error bars obtained from repeated training runs. These additions make the central empirical claim directly assessable while preserving the original narrative. revision: yes
-
Referee: [Abstract] Abstract and § on PCA-HMM: the model is trained only on the clean split yet applied to jailbreak trajectories, but no validation is reported (e.g., cosine similarity between HMM emission vectors and the probe-missed subspace, or controls for prompt-length artifacts) to rule out incidental recovery due to distributional shift.
Authors: We acknowledge that explicit validation against distributional artifacts strengthens the interpretation. The revised manuscript now includes (i) cosine-similarity measurements between the learned HMM emission vectors and the linear directions separating probe-missed jailbreaks from clean benign prompts, and (ii) length-matched subset experiments that confirm the PCA-HMM continues to recover misses without elevated false positives. These controls indicate that recovery tracks the missed unsafe subspace rather than prompt-length or split-shift artifacts. revision: yes
Circularity Check
No significant circularity in PCA-HMM generalization from clean training to jailbreak trajectories
full rationale
The paper trains the PCA-HMM exclusively on the clean harmful/benign split and applies it to separate jailbreak cases to recover final-token misses. No equations or steps reduce the reported recoveries to the training inputs by construction, nor does any load-bearing claim rely on self-citation chains, uniqueness theorems, or ansatzes imported from prior work. Subspace analyses are used only to motivate the mismatch; the central result is a standard held-out evaluation that remains externally falsifiable. This is a normal non-circular finding.
Axiom & Free-Parameter Ledger
free parameters (2)
- PCA dimensionality
- HMM state count
axioms (1)
- domain assumption Probes trained only on clean harmful and benign prompts produce subspaces that generalize to jailbreak detection
Reference graph
Works this paper leans on
-
[1]
Refusal in Language Models Is Mediated by a Single Direction
URL https://arxiv.org/abs/2406.11717. Damirchi, H., la Jara, I. M. D., Abbasnejad, E., Shamsi, A., Zhang, Z., and Shi, J. Truth as a trajectory: What internal representations reveal about large language model reason- ing,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
URLhttps://arxiv.org/ abs/2502.01042. Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.- A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mistral 7b,
-
[3]
URL https: //arxiv.org/abs/2310.06825. Lin, Z., Yang, J., Qiu, Y ., Guo, H., Bao, Y ., and Guan, Y . N- glare: An non-generative latent representation-efficient llm safety evaluator,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
URL https://arxiv. org/abs/2511.14195. Liu, C., Liu, X., Li, X., Xin, B., and Ding, K. Trajguard: Streaming hidden-state trajectory detection for decoding- time jailbreak defense,
-
[5]
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense
URL https://arxiv. org/abs/2604.07727. Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., and Hendrycks, D. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
URL https://arxiv.org/ abs/2402.04249. Meta. meta-llama/llama-3.1-8b-instruct,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
URL https://huggingface.co/meta-llama/ Llama-3.1-8B-Instruct. Accessed: 2025-02-21. Olmo, T., Ettinger, A., Bertsch, A., Kuehl, B., Graham, D., Heineman, D., Groeneveld, D., Brahman, F., Timbers, F., Ivison, H., Morrison, J., Poznanski, J., Lo, K., Soldaini, L., Jordan, M., Chen, M., Noukhovitch, M., Lambert, N., Walsh, P., Dasigi, P., Berry, R., Malik, S...
work page 2025
-
[8]
URL https://arxiv.org/abs/2512.13961. Pan, W., Liu, Z., Chen, Q., Zhou, X., Yu, H., and Jia, X. The hidden dimensions of llm alignment: A multi-dimensional analysis of orthogonal safety di- rections,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
URL https://arxiv.org/abs/ 2502.09674. R¨ottger, P., Kirk, H. R., Vidgen, B., Attanasio, G., Bianchi, F., and Hovy, D. Xstest: A test suite for identifying exag- gerated safety behaviours in large language models,
-
[10]
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
URLhttps://arxiv.org/abs/2308.01263. Shah, M., Angeline, S., Kumar, A. R., Chheda, N., Zhu, K., Sharma, V ., O’Brien, S., and Cai, W. The geometry of harmfulness in llms through subconcept probing,
work page internal anchor Pith review arXiv
-
[11]
Wollschl¨ager, T., Elstner, J., Geisler, S., Cohen-Addad, V ., G¨unnemann, S., and Gasteiger, J
URLhttps://arxiv.org/abs/2507.21141. Wollschl¨ager, T., Elstner, J., Geisler, S., Cohen-Addad, V ., G¨unnemann, S., and Gasteiger, J. The geometry of refusal in large language models: Concept cones and representational independence,
-
[12]
5 Before the Last Token Xie, T., Qi, X., Zeng, Y ., Huang, Y ., Sehwag, U
URLhttps: //arxiv.org/abs/2502.17420. 5 Before the Last Token Xie, T., Qi, X., Zeng, Y ., Huang, Y ., Sehwag, U. M., Huang, K., He, L., Wei, B., Li, D., Sheng, Y ., Jia, R., Li, B., Li, K., Chen, D., Henderson, P., and Mittal, P. Sorry- bench: Systematically evaluating large language model safety refusal,
-
[13]
Zhao, J., Huang, J., Wu, Z., Bau, D., and Shi, W
URL https://arxiv.org/ abs/2406.14598. Zhao, J., Huang, J., Wu, Z., Bau, D., and Shi, W. Llms encode harmfulness and refusal separately,
-
[14]
LLMs encode harmfulness and refusal separately, 2025
URL https://arxiv.org/abs/2507.11878. Zhou, Z., Yu, H., Zhang, X., Xu, R., Huang, F., and Li, Y . How alignment and jailbreak work: Explain llm safety through intermediate hidden states,
-
[15]
Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J
URL https: //arxiv.org/abs/2406.05644. Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversarial attacks on aligned language models,
-
[16]
Universal and Transferable Adversarial Attacks on Aligned Language Models
URL https: //arxiv.org/abs/2307.15043. 6 Before the Last Token A. Bottleneck Width Sweep Table 4 reports final-token probe jailbreak detection rate as a function of bottleneck width. Wider readouts do not reliably improve detection: Llama gains modestly, Mistral is roughly flat, and OLMo3 degrades. Table 4.Final-token probe jailbreak detection rate (%) as...
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.