Recognition: 1 theorem link
· Lean TheoremTrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense
Pith reviewed 2026-05-10 18:20 UTC · model grok-4.3
The pith
Hidden states during LLM decoding drift toward risk regions on jailbreaks, enabling real-time detection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Hidden states in critical layers during the decoding phase carry stronger and more stable risk signals than input jailbreak prompts because the hidden representations of tokens generated during jailbreak attempts progressively approach high-risk regions in the latent space.
What carries the argument
Sliding-window aggregation of hidden-state trajectories to quantify persistent risk and trigger intervention when a local threshold is exceeded.
If this is right
- Decoding can be interrupted immediately when risk persists in the sliding window.
- Defense works across multiple open-source models and 12 attack types without any training.
- Detection runs at low per-token latency while holding false positives under 1.5 percent.
- The method avoids modifying the underlying model weights or architecture.
Where Pith is reading between the lines
- The same trajectory-monitoring idea could be tested on other unsafe generation patterns such as factually wrong or biased content.
- If hidden states become available through APIs, the approach might apply to some closed models without internal access.
- Thresholds could be made conversation-aware rather than fixed, potentially lowering false positives on edge cases.
Load-bearing premise
The risk signals observed in hidden-state trajectories remain consistent enough across different models, attack variants, and normal usage that a fixed threshold and window rule can separate jailbreaks from benign generations without excessive errors.
What would settle it
A jailbreak attack that produces harmful output while keeping hidden-state risk scores below the detection threshold for the entire generation, or a benign query that repeatedly triggers high-risk trajectories.
Figures
read the original abstract
Existing jailbreak defense paradigms primarily rely on static detection of prompts, outputs, or internal states, often neglecting the dynamic evolution of risk during decoding. This oversight leaves risk signals embedded in decoding trajectories underutilized, constituting a critical blind spot in current defense systems. In this work, we empirically demonstrate that hidden states in critical layers during the decoding phase carry stronger and more stable risk signals than input jailbreak prompts. Specifically, the hidden representations of tokens generated during jailbreak attempts progressively approach high-risk regions in the latent space. Based on this observation, we propose TrajGuard, a training-free, decoding-time defense framework. TrajGuard aggregates hidden-state trajectories via a sliding window to quantify risk in real time, triggering a lightweight semantic adjudication only when risk within a local window persistently exceeds a threshold. This mechanism enables the immediate interruption or constraint of subsequent decoding. Extensive experiments across 12 jailbreak attacks and various open-source LLMs show that TrajGuard achieves an average defense rate of 95%. Furthermore, it reduces detection latency to 5.2 ms/token while maintaining a false positive rate below 1.5%. These results confirm that hidden-state trajectories during decoding can effectively support real-time jailbreak detection, highlighting a promising direction for defenses without model modification.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that hidden-state trajectories during LLM decoding carry stronger, more stable risk signals for jailbreak detection than static input prompts. It proposes TrajGuard, a training-free framework that aggregates these states over a sliding window, computes a real-time risk score, and triggers lightweight semantic adjudication (or interruption) only when the score persistently exceeds a fixed threshold. Experiments across 12 jailbreak attacks on multiple open-source LLMs report an average 95% defense rate, <1.5% false-positive rate, and 5.2 ms/token latency.
Significance. If the empirical separation holds under proper controls, the work provides a practical, model-modification-free approach to real-time jailbreak defense that exploits the dynamic evolution of internal representations rather than static checks. The observation that jailbreak-generated tokens progressively approach high-risk latent regions is a useful empirical finding that could inform future dynamic monitoring techniques.
major comments (2)
- [§3.2] §3.2 (Risk Quantification): The sliding-window risk metric and fixed threshold lack any derivation showing invariance across architectures or fine-tunings; the method therefore depends on two free parameters whose values are almost certainly selected on the same distributions used to report the 95% defense rate and <1.5% FPR, directly undermining the claimed generality to unseen models and attacks.
- [§4.1–4.3] §4.1–4.3 (Experimental Setup): No details are provided on how the 12 attacks and models were chosen, whether threshold/window size were tuned on held-out data, or whether statistical significance and multiple-comparison corrections were applied; without these controls the reported performance cannot be assessed as robust rather than post-hoc.
minor comments (2)
- [§3.1] The description of how 'critical layers' are identified is empirical but would benefit from an explicit ablation or selection criterion in §3.1.
- [§3.2] Notation for the risk aggregation function inside the window should be formalized with an equation rather than prose.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our work. We provide point-by-point responses to the major concerns and indicate the revisions we will make to address them.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Risk Quantification): The sliding-window risk metric and fixed threshold lack any derivation showing invariance across architectures or fine-tunings; the method therefore depends on two free parameters whose values are almost certainly selected on the same distributions used to report the 95% defense rate and <1.5% FPR, directly undermining the claimed generality to unseen models and attacks.
Authors: We agree that there is no theoretical derivation provided for the invariance of the sliding-window risk metric and threshold across different architectures or fine-tunings. The parameters are selected empirically based on the observed separation in hidden-state trajectories. To strengthen the presentation, we will revise Section 3.2 to include a more detailed justification for the parameter choices, supported by additional experiments showing consistent performance across a range of window sizes and thresholds. We will also discuss the empirical generality observed in our multi-model experiments. revision: yes
-
Referee: [§4.1–4.3] §4.1–4.3 (Experimental Setup): No details are provided on how the 12 attacks and models were chosen, whether threshold/window size were tuned on held-out data, or whether statistical significance and multiple-comparison corrections were applied; without these controls the reported performance cannot be assessed as robust rather than post-hoc.
Authors: We will update Sections 4.1 to 4.3 with additional details on the selection criteria for the 12 attacks and the LLMs used, drawing from established benchmarks in the jailbreak literature. We will clarify the process for determining the threshold and window size, noting that they were chosen to generalize across the tested scenarios. Furthermore, we will incorporate statistical significance testing and address multiple comparisons appropriately in the revised manuscript to better substantiate the robustness of the results. revision: yes
Circularity Check
No circularity: empirical observation and training-free heuristic with no self-referential derivation
full rationale
The paper's core contribution is an empirical demonstration that hidden-state trajectories during decoding exhibit stronger risk signals than static prompts, followed by a practical sliding-window aggregation rule with a fixed threshold to trigger adjudication. No equations, uniqueness theorems, or first-principles derivations are presented that reduce to their own inputs by construction. The method is explicitly training-free and relies on experimental results across multiple models and attacks rather than any fitted parameter or self-citation chain that would force the claimed separation. This is a standard empirical engineering paper whose claims are externally falsifiable via replication on held-out data.
Axiom & Free-Parameter Ledger
free parameters (2)
- risk threshold
- sliding window size
axioms (1)
- domain assumption Hidden states in critical layers encode progressively stronger risk signals during jailbreak decoding that are more stable than those in input prompts.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery and 8-tick period (nowhere referenced) unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
sliding window Wt of size w=8 ... Aggτ∈Wt(rl,τ) ... EWMA ... Trigger(t) ⇐⇒ sum I(pτ ≥ γ) = k for k=3 consecutive steps
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Before the Last Token: Diagnosing Final-Token Safety Probe Failures
Final-token probes miss distributed unsafe evidence in jailbreaks, but a PCA-HMM model on prefill trajectories recovers many misses without naive pooling's false positives.
Reference graph
Works this paper leans on
-
[1]
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, and 12 others. 2022. https://arxiv.org/abs/2204.05862 Training a helpful an...
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [2]
-
[3]
Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, and 1 others. 2023. https://arxiv.org/abs/2310.08419 Jailbreaking black box large language models in twenty queries . arXiv preprint arXiv:2310.08419
work page internal anchor Pith review arXiv 2023
- [4]
-
[5]
Pei Ding and 1 others. 2024. https://aclanthology.org/2024.naacl-long.118 Generalized nested jailbreak prompts can fool large language models . In Proceedings of NAACL
2024
-
[6]
Ming Dong, Jinkui Zhang, Bolong Zheng, Xinhui Tu, Po Hu, and Tingting He. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.197 DSCD : Large language model detoxification with self-constrained decoding . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3969--3984, Suzhou, China. Association for Computational...
-
[7]
Soumya Suvra Ghosal, Souradip Chakraborty, Vaibhav Singh, Tianrui Guan, Mengdi Wang, Alvaro Velasquez, Ahmad Beirami, Furong Huang, Dinesh Manocha, and Amrit Singh Bedi. 2025. https://arxiv.org/abs/2411.18688 Immune: Improving safety against jailbreaks in multi-modal llms via inference-time alignment . Preprint, arXiv:2411.18688
-
[8]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3...
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [9]
-
[10]
James Y. Huang, Sailik Sengupta, Daniele Bonadiman, Yi-An Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchhoff, and Dan Roth. 2025. https://doi.org/10.18653/v1/2025.acl-long.1274 D e AL : Decoding-time alignment for large language models . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: L...
-
[11]
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. 2023. https://arxiv.org/abs/2312.06674 Llama guard: Llm-based input-output safeguard for human-ai conversations . Preprint, arXiv:2312.06674
work page internal anchor Pith review arXiv 2023
-
[12]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. https://arxiv.org/abs/2310.0...
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [13]
- [14]
-
[15]
Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2023. https://arxiv.org/abs/2310.04451 Autodan: Generating stealthy jailbreak prompts on aligned large language models . arXiv preprint arXiv:2310.04451
work page internal anchor Pith review arXiv 2023
- [16]
-
[17]
Antonio Mastropaolo, Luca Pascarella, Emanuela Guglielmi, Matteo Ciniselli, Simone Scalabrino, Rocco Oliveto, and Gabriele Bavota. 2023. https://arxiv.org/abs/2302.00438 On the robustness of code generation techniques: An empirical study on github copilot . Preprint, arXiv:2302.00438
-
[18]
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. 2024. https://arxiv.org/abs/2402.04249 Harmbench: A standardized evaluation framework for automated red teaming and robust refusal . Preprint, arXiv:2402.04249
work page internal anchor Pith review arXiv 2024
- [19]
-
[20]
Meta AI . 2024. Llama guard 3: Model card and prompt formats. https://www.llama.com/docs/model-cards-and-prompt-formats/llama-guard-3/
2024
-
[21]
Thu-Trang Nguyen, Thanh Trong Vu, Hieu Dinh Vo, and Son Nguyen. 2025. https://doi.org/10.1016/j.infsof.2025.107780 An empirical study on capability of large language models in understanding code semantics . Information and Software Technology, 185:107780
-
[22]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. https://arxiv.org/abs/2203.02155 Training language models to f...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[23]
Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. 2023. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277
work page internal anchor Pith review arXiv 2023
-
[24]
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. 2023. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693
work page internal anchor Pith review arXiv 2023
- [25]
- [26]
-
[27]
Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, and Jing Shao. 2025. https://doi.org/10.18653/v1/2025.acl-long.1207 LLM s know their vulnerabilities: Uncover safety gaps through natural distribution shifts . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume...
- [28]
-
[29]
Mark Russinovich, Ahmed Salem, and Ronen Eldan. 2025 b . Great, now write an article about that: The crescendo \ Multi-Turn \ \ LLM \ jailbreak attack. In 34th USENIX Security Symposium (USENIX Security 25), pages 2421--2440
2025
-
[30]
Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. 2024. https://arxiv.org/abs/2308.01263 Xstest: A test suite for identifying exaggerated safety behaviours in large language models . Preprint, arXiv:2308.01263
work page internal anchor Pith review arXiv 2024
-
[31]
Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, Amanda Askell, Nathan Bailey, Joe Benton, Emma Bluemke, Samuel R. Bowman, Eric Christiansen, Hoagy Cunningham, Andy Dau, Anjali Gopal, and 24 others. 2025. https://arxiv.org/abs/2501.18837 Constitutional classifiers: Defending aga...
- [32]
-
[33]
Rohan Taori, Ishaan Gulrajani, and 1 others. 2023. https://crfm.stanford.edu/2023/03/13/alpaca.html Stanford alpaca: An instruction-following llm . GitHub
2023
-
[34]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, and 49 others. 2023. https://arxiv.org/abs/2307.09288 Llama 2: Open fo...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Xuekang Wang, Shengyu Zhu, and Xueqi Cheng. 2025 a . https://doi.org/10.18653/v1/2025.emnlp-main.648 Speculative safety-aware decoding . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12838--12852, Suzhou, China. Association for Computational Linguistics
- [36]
-
[37]
Zezhong Wang, Fangkai Yang, Lu Wang, Pu Zhao, Hongru Wang, Liang Chen, Qingwei Lin, and Kam-Fai Wong. 2024. https://aclanthology.org/2024.naacl-long.92/ Self-guard: Empower the llm to safeguard itself . In NAACL 2024
2024
-
[38]
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023 a . https://arxiv.org/abs/2307.02483 Jailbroken: How does llm safety training fail? In Advances in Neural Information Processing Systems (NeurIPS)
work page internal anchor Pith review arXiv 2023
- [39]
- [40]
-
[41]
Zitao Xuan, Xiaofeng Mao, Da Chen, Xin Zhang, Yuhan Dong, and Jun Zhou. 2025. https://doi.org/10.18653/v1/2025.findings-acl.932 S hield H ead: Decoding-time safeguard for large language models . In Findings of the Association for Computational Linguistics: ACL 2025, pages 18129--18143, Vienna, Austria. Association for Computational Linguistics
-
[42]
Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. 2023. https://arxiv.org/abs/2309.10253 Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts . arXiv preprint arXiv:2309.10253
work page internal anchor Pith review arXiv 2023
- [43]
- [44]
-
[45]
Zhang and 1 others
Z. Zhang and 1 others. 2024. https://aclanthology.org/2024.acl-long.481.pdf Defending large language models against jailbreaking attacks through goal prioritization . In ACL 2024
2024
-
[46]
Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, Baosong Yang, Chen Cheng, Jialong Tang, Jiandong Jiang, Jianwei Zhang, Jijie Xu, Ming Yan, Minmin Sun, Pei Zhang, and 24 others. 2025 a . https://arxiv.org/abs/2510.14276 Qwen3guard technical report . Preprint, arXiv:2510.14276
work page internal anchor Pith review arXiv 2025
-
[47]
Weixiang Zhao, Jiahe Guo, Yulin Hu, Yang Deng, An Zhang, Xingyu Sui, Xinyang Han, Yanyan Zhao, Bing Qin, Tat-Seng Chua, and Ting Liu. 2025 b . https://doi.org/10.18653/v1/2025.emnlp-main.1248 A da S teer: Your aligned LLM is inherently an adaptive jailbreak defender . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processin...
- [48]
-
[50]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. 2023 b . https://arxiv.org/abs/2307.15043 Universal and transferable adversarial attacks on aligned language models . arXiv preprint arXiv:2307.15043
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
online" 'onlinestring :=
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[52]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.