pith. machine review for the scientific record. sign in

arxiv: 2605.12726 · v1 · submitted 2026-05-12 · 💻 cs.LG

Recognition: no theorem link

Before the Last Token: Diagnosing Final-Token Safety Probe Failures

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:33 UTC · model grok-4.3

classification 💻 cs.LG
keywords safety probesjailbreak detectionLLM hidden statesprefill trajectoriesfinal-token readoutPCA-HMMtoken-level analysis
0
0 comments X

The pith

Final-token safety probes miss jailbreak evidence spread across earlier tokens, but a PCA-HMM model on prefill trajectories recovers many such cases without high false positives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates why safety probes that read only the final hidden state after a prompt is processed fail to detect many jailbreak attempts. These probes detect clean harmful content reliably yet overlook jailbreaks where unsafe signals appear in earlier user tokens and do not reach the end state. Subspace comparisons show the missed cases lie along directions poorly aligned with the probe's learned representation, and simply widening the probe does not fix the gap. A lightweight PCA-HMM model trained exclusively on clean prompts tracks token-level trajectories and retrieves many missed unsafe signals while avoiding the false-positive spikes seen with naive pooling over all positions.

Core claim

The central claim is that probe-visible unsafe evidence in jailbreak prompts frequently surfaces in intermediate user-token hidden states but remains invisible at the final-token readout used by standard safety probes. Subspace analyses confirm that missed jailbreaks diverge from clean benign prompts along directions outside the probe's representational subspace, and increasing bottleneck width fails to close the mismatch. Token-level inspection reveals the evidence is present earlier in the sequence, while a PCA-HMM trajectory model trained on the same clean split recovers many of these misses from user-content prefill paths without the catastrophic false-positive rate of max-pooling.

What carries the argument

The PCA-HMM trajectory model, which decomposes prefill hidden-state sequences into principal components and hidden Markov states to track distributional shifts across token positions.

If this is right

  • Safety probes must incorporate information from earlier token positions rather than relying solely on the final hidden state.
  • Models trained only on clean harmful and benign examples can still surface distributed unsafe patterns in prefill sequences.
  • Naive pooling across token positions produces unacceptable false positives on benign but safety-adjacent prompts.
  • Trajectory-aware methods serve as practical diagnostic complements to existing final-token probes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety evaluation protocols may need to log and analyze full hidden-state sequences during prefill rather than discarding intermediate states.
  • The same trajectory approach could be tested on other detection tasks where evidence accumulates gradually across a prompt.
  • If the PCA-HMM generalizes, it offers a low-cost way to retrofit existing probes without retraining the underlying LLM.

Load-bearing premise

That subspace directions identified on clean data correctly flag the mismatches in jailbreak trajectories and that the PCA-HMM generalizes to recover those cases without creating new failure modes on safe prompts.

What would settle it

Running the PCA-HMM on a fresh collection of jailbreak prompts and safety-adjacent benign prompts and measuring whether it recovers at least half the final-token misses while keeping false-positive rates below those of max-pooling would directly test the recovery claim.

Figures

Figures reproduced from arXiv: 2605.12726 by Shravan Doda.

Figure 1
Figure 1. Figure 1: Jailbreak–XSTest operating-point shift from the final￾token probe to the user-window PCA-HMM trajectory diagnostic. Across all three models, the trajectory diagnostic recovers many final-token misses while reducing XSTest false positives at this operating point. is not whether any token score is high, but how the score evolves across the prompt: missed jailbreaks exhibit a high harmful-request score follow… view at source ↗
Figure 2
Figure 2. Figure 2: Complementarity on jailbreak prompts. Stacked bars partition 900 jailbreak prompts by whether they are caught by the final-token probe, the PCA-HMM trajectory diagnostic, both, or neither. Percentages inside bars are normalized by the 900-prompt jailbreak set. PCA-HMM catches many prompts missed by the final-token probe: 236 for Llama, 106 for Mistral, and 310 for OLMo3. D. PCA-HMM Length Correlations [PI… view at source ↗
read the original abstract

Final-token safety probes monitor a single hidden state after prompt prefill, but jailbreak prompts can contain probe-visible unsafe evidence distributed across earlier user-token representations that is missed by this readout. We study this prefill-time failure mode using SafeSwitch-style probes trained only on clean harmful and benign prompts across three instruction-tuned LLMs. The probes achieve high recall on clean harmful prompts, but miss many jailbreaks and can produce false positives on safety-adjacent benign prompts. Subspace analyses suggest that missed jailbreaks differ from clean benign prompts along directions that are poorly captured by the probe's representational subspace, and increasing probe bottleneck width does not reliably resolve this mismatch. Token-level prefill analyses reveal that probe-visible unsafe evidence often appears earlier in the sequence but is not exposed at the final-token readout, while naive max-pooling over token positions overfires on safe prompts. A simple PCA-HMM trajectory model, trained only on the same clean split, recovers many final-token misses from user-content prefill trajectories without the catastrophic false-positive behavior of naive token pooling, motivating trajectory-aware hidden-state analyses as diagnostic complements to final-token probes

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that final-token safety probes on three instruction-tuned LLMs achieve high recall on clean harmful prompts but miss many jailbreaks because probe-visible unsafe evidence appears earlier in user-token prefill trajectories; subspace analyses show these misses lie outside the probe's representational subspace, and a PCA-HMM trajectory model trained exclusively on the clean split recovers many such misses without the false-positive overfiring of naive token pooling.

Significance. If the PCA-HMM recovery is shown to align with missed unsafe directions rather than length or distributional artifacts, the work would usefully motivate trajectory-aware diagnostics as complements to final-token probes, addressing a concrete failure mode in current safety monitoring.

major comments (2)
  1. [Abstract] Abstract: the central claim that the PCA-HMM 'recovers many final-token misses' is stated without any quantitative metrics (recovery rates, false-positive rates, comparison to baselines, or error bars), so the empirical support for the claim cannot be assessed.
  2. [Abstract] Abstract and § on PCA-HMM: the model is trained only on the clean split yet applied to jailbreak trajectories, but no validation is reported (e.g., cosine similarity between HMM emission vectors and the probe-missed subspace, or controls for prompt-length artifacts) to rule out incidental recovery due to distributional shift.
minor comments (2)
  1. The abstract and methods should report the exact three LLMs, probe bottleneck widths, PCA dimensionality, HMM state count, and training splits with full experimental details.
  2. Token-level prefill analyses would benefit from explicit figures showing per-position probe activations for representative jailbreak vs. clean examples.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract and the PCA-HMM validation. We have revised the manuscript to incorporate quantitative metrics and additional controls, strengthening the empirical presentation of our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the PCA-HMM 'recovers many final-token misses' is stated without any quantitative metrics (recovery rates, false-positive rates, comparison to baselines, or error bars), so the empirical support for the claim cannot be assessed.

    Authors: We agree that the abstract would benefit from explicit quantitative support. In the revised version we have updated the abstract to report recovery rates (the fraction of final-token misses recovered by the PCA-HMM), false-positive rates on benign prompts, direct comparisons against the naive max-pooling baseline, and error bars obtained from repeated training runs. These additions make the central empirical claim directly assessable while preserving the original narrative. revision: yes

  2. Referee: [Abstract] Abstract and § on PCA-HMM: the model is trained only on the clean split yet applied to jailbreak trajectories, but no validation is reported (e.g., cosine similarity between HMM emission vectors and the probe-missed subspace, or controls for prompt-length artifacts) to rule out incidental recovery due to distributional shift.

    Authors: We acknowledge that explicit validation against distributional artifacts strengthens the interpretation. The revised manuscript now includes (i) cosine-similarity measurements between the learned HMM emission vectors and the linear directions separating probe-missed jailbreaks from clean benign prompts, and (ii) length-matched subset experiments that confirm the PCA-HMM continues to recover misses without elevated false positives. These controls indicate that recovery tracks the missed unsafe subspace rather than prompt-length or split-shift artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity in PCA-HMM generalization from clean training to jailbreak trajectories

full rationale

The paper trains the PCA-HMM exclusively on the clean harmful/benign split and applies it to separate jailbreak cases to recover final-token misses. No equations or steps reduce the reported recoveries to the training inputs by construction, nor does any load-bearing claim rely on self-citation chains, uniqueness theorems, or ansatzes imported from prior work. Subspace analyses are used only to motivate the mismatch; the central result is a standard held-out evaluation that remains externally falsifiable. This is a normal non-circular finding.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that hidden-state trajectories during prefill contain recoverable unsafe signals distinct from clean prompts, with the PCA-HMM serving as a lightweight extractor.

free parameters (2)
  • PCA dimensionality
    Number of principal components retained for trajectory representation; chosen to capture relevant variance.
  • HMM state count
    Number of hidden states in the Markov model; fitted to model token-position dynamics.
axioms (1)
  • domain assumption Probes trained only on clean harmful and benign prompts produce subspaces that generalize to jailbreak detection
    Invoked in the subspace analyses and probe training description.

pith-pipeline@v0.9.0 · 5491 in / 1200 out tokens · 58579 ms · 2026-05-14T21:33:49.301876+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 7 internal anchors

  1. [1]

    Refusal in Language Models Is Mediated by a Single Direction

    URL https://arxiv.org/abs/2406.11717. Damirchi, H., la Jara, I. M. D., Abbasnejad, E., Shamsi, A., Zhang, Z., and Shi, J. Truth as a trajectory: What internal representations reveal about large language model reason- ing,

  2. [2]

    Jiang, A

    URLhttps://arxiv.org/ abs/2502.01042. Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.- A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mistral 7b,

  3. [3]

    Mistral 7B

    URL https: //arxiv.org/abs/2310.06825. Lin, Z., Yang, J., Qiu, Y ., Guo, H., Bao, Y ., and Guan, Y . N- glare: An non-generative latent representation-efficient llm safety evaluator,

  4. [4]

    org/abs/2511.14195

    URL https://arxiv. org/abs/2511.14195. Liu, C., Liu, X., Li, X., Xin, B., and Ding, K. Trajguard: Streaming hidden-state trajectory detection for decoding- time jailbreak defense,

  5. [5]

    TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense

    URL https://arxiv. org/abs/2604.07727. Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., and Hendrycks, D. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal,

  6. [6]

    URL https://arxiv.org/ abs/2402.04249. Meta. meta-llama/llama-3.1-8b-instruct,

  7. [7]

    Accessed: 2025-02-21

    URL https://huggingface.co/meta-llama/ Llama-3.1-8B-Instruct. Accessed: 2025-02-21. Olmo, T., Ettinger, A., Bertsch, A., Kuehl, B., Graham, D., Heineman, D., Groeneveld, D., Brahman, F., Timbers, F., Ivison, H., Morrison, J., Poznanski, J., Lo, K., Soldaini, L., Jordan, M., Chen, M., Noukhovitch, M., Lambert, N., Walsh, P., Dasigi, P., Berry, R., Malik, S...

  8. [8]

    Olmo 3

    URL https://arxiv.org/abs/2512.13961. Pan, W., Liu, Z., Chen, Q., Zhou, X., Yu, H., and Jia, X. The hidden dimensions of llm alignment: A multi-dimensional analysis of orthogonal safety di- rections,

  9. [9]

    The hidden di- mensions of llm alignment: A multi-dimensional anal- ysis of orthogonal safety directions.arXiv preprint arXiv:2502.09674, 2025

    URL https://arxiv.org/abs/ 2502.09674. R¨ottger, P., Kirk, H. R., Vidgen, B., Attanasio, G., Bianchi, F., and Hovy, D. Xstest: A test suite for identifying exag- gerated safety behaviours in large language models,

  10. [10]

    XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

    URLhttps://arxiv.org/abs/2308.01263. Shah, M., Angeline, S., Kumar, A. R., Chheda, N., Zhu, K., Sharma, V ., O’Brien, S., and Cai, W. The geometry of harmfulness in llms through subconcept probing,

  11. [11]

    Wollschl¨ager, T., Elstner, J., Geisler, S., Cohen-Addad, V ., G¨unnemann, S., and Gasteiger, J

    URLhttps://arxiv.org/abs/2507.21141. Wollschl¨ager, T., Elstner, J., Geisler, S., Cohen-Addad, V ., G¨unnemann, S., and Gasteiger, J. The geometry of refusal in large language models: Concept cones and representational independence,

  12. [12]

    5 Before the Last Token Xie, T., Qi, X., Zeng, Y ., Huang, Y ., Sehwag, U

    URLhttps: //arxiv.org/abs/2502.17420. 5 Before the Last Token Xie, T., Qi, X., Zeng, Y ., Huang, Y ., Sehwag, U. M., Huang, K., He, L., Wei, B., Li, D., Sheng, Y ., Jia, R., Li, B., Li, K., Chen, D., Henderson, P., and Mittal, P. Sorry- bench: Systematically evaluating large language model safety refusal,

  13. [13]

    Zhao, J., Huang, J., Wu, Z., Bau, D., and Shi, W

    URL https://arxiv.org/ abs/2406.14598. Zhao, J., Huang, J., Wu, Z., Bau, D., and Shi, W. Llms encode harmfulness and refusal separately,

  14. [14]

    LLMs encode harmfulness and refusal separately, 2025

    URL https://arxiv.org/abs/2507.11878. Zhou, Z., Yu, H., Zhang, X., Xu, R., Huang, F., and Li, Y . How alignment and jailbreak work: Explain llm safety through intermediate hidden states,

  15. [15]

    Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J

    URL https: //arxiv.org/abs/2406.05644. Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversarial attacks on aligned language models,

  16. [16]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    URL https: //arxiv.org/abs/2307.15043. 6 Before the Last Token A. Bottleneck Width Sweep Table 4 reports final-token probe jailbreak detection rate as a function of bottleneck width. Wider readouts do not reliably improve detection: Llama gains modestly, Mistral is roughly flat, and OLMo3 degrades. Table 4.Final-token probe jailbreak detection rate (%) as...