arxiv: 2604.07727 · v1 · submitted 2026-04-09 · 💻 cs.CR · cs.AI

Recognition: 1 theorem link

· Lean Theorem

TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense

Cheng Liu , Xiaolei Liu , Xingyu Li , Bangzhou Xin , Kangyi Ding

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:20 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords jailbreak defensehidden state trajectoriesdecoding-time detectionLLM safetystreaming detectiontraining-free defenserisk signal monitoring

0 comments

The pith

Hidden states during LLM decoding drift toward risk regions on jailbreaks, enabling real-time detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that hidden representations of tokens generated during jailbreak attempts move progressively closer to high-risk areas in the model's latent space, providing a stronger and steadier signal than the original input prompt. This dynamic evolution during decoding is tracked by aggregating states over a sliding window to measure persistent risk in real time. When the local risk exceeds a threshold, the system triggers a lightweight check that can interrupt or constrain further generation. The approach requires no model retraining and operates at low latency while keeping false positives low on ordinary queries. A sympathetic reader would see this as shifting defense from upfront or post-hoc checks to monitoring the generation process itself.

Core claim

Hidden states in critical layers during the decoding phase carry stronger and more stable risk signals than input jailbreak prompts because the hidden representations of tokens generated during jailbreak attempts progressively approach high-risk regions in the latent space.

What carries the argument

Sliding-window aggregation of hidden-state trajectories to quantify persistent risk and trigger intervention when a local threshold is exceeded.

If this is right

Decoding can be interrupted immediately when risk persists in the sliding window.
Defense works across multiple open-source models and 12 attack types without any training.
Detection runs at low per-token latency while holding false positives under 1.5 percent.
The method avoids modifying the underlying model weights or architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trajectory-monitoring idea could be tested on other unsafe generation patterns such as factually wrong or biased content.
If hidden states become available through APIs, the approach might apply to some closed models without internal access.
Thresholds could be made conversation-aware rather than fixed, potentially lowering false positives on edge cases.

Load-bearing premise

The risk signals observed in hidden-state trajectories remain consistent enough across different models, attack variants, and normal usage that a fixed threshold and window rule can separate jailbreaks from benign generations without excessive errors.

What would settle it

A jailbreak attack that produces harmful output while keeping hidden-state risk scores below the detection threshold for the entire generation, or a benign query that repeatedly triggers high-risk trajectories.

Figures

Figures reproduced from arXiv: 2604.07727 by Bangzhou Xin, Cheng Liu, Kangyi Ding, Xiaolei Liu, Xingyu Li.

**Figure 2.** Figure 2: Comparative jailbreak dynamics across mod [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview. TrajGuard monitors hidden-state trajectories in real time via the SGS module and triggers the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of average detection steps (blue [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: ASR versus average Trigger Frequency (TF) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Hidden-state Mahalanobis-distance patterns for Llama-2-7B-chat and Llama-3.1-8B-Instruct. The upper [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Hidden-state Mahalanobis-distance patterns for Mistral-7B and Vicuna-7B. The upper subfigure corre [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Case study of real-time interception against a Jailbreak attack. (a) The stream of decoded content, illustrating how the attack masks malicious intent within a benign framing. (b) The corresponding trajectory of the Safety (Pt). The trajectory clearly demonstrates the revealing effect, reaching a threshold (trigger) after step 16, at which point SGS triggers PAIR-Judge to stop generating. intermediate to l… view at source ↗

**Figure 9.** Figure 9: Case Study 2: Benign Interaction. (a) The user asks a standard historical question, and the model generates a detailed response. (b) The streaming safety score (Pt, orange line) fluctuates but remains consistently below the trigger threshold (dotted lines). TrajGuard correctly identifies this as low-risk, allowing the generation to proceed uninterrupted, demonstrating the system’s low false-positive rate… view at source ↗

**Figure 10.** Figure 10: Visualization of activation boundaries on Llama-2-7B-chat. We plot the Mahalanobis distances to benign (dB) and harmful (dH) centers. Colors denote Benign (Green), Malicious (Red), and Attacks (Blue). Inset histograms of the margin (m = dB − dH) demonstrate clear separability between safe and unsafe representations [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

read the original abstract

Existing jailbreak defense paradigms primarily rely on static detection of prompts, outputs, or internal states, often neglecting the dynamic evolution of risk during decoding. This oversight leaves risk signals embedded in decoding trajectories underutilized, constituting a critical blind spot in current defense systems. In this work, we empirically demonstrate that hidden states in critical layers during the decoding phase carry stronger and more stable risk signals than input jailbreak prompts. Specifically, the hidden representations of tokens generated during jailbreak attempts progressively approach high-risk regions in the latent space. Based on this observation, we propose TrajGuard, a training-free, decoding-time defense framework. TrajGuard aggregates hidden-state trajectories via a sliding window to quantify risk in real time, triggering a lightweight semantic adjudication only when risk within a local window persistently exceeds a threshold. This mechanism enables the immediate interruption or constraint of subsequent decoding. Extensive experiments across 12 jailbreak attacks and various open-source LLMs show that TrajGuard achieves an average defense rate of 95%. Furthermore, it reduces detection latency to 5.2 ms/token while maintaining a false positive rate below 1.5%. These results confirm that hidden-state trajectories during decoding can effectively support real-time jailbreak detection, highlighting a promising direction for defenses without model modification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TrajGuard's sliding-window check on hidden-state trajectories during decoding is a straightforward practical idea, but the fixed threshold and lack of model-independent grounding limit how far the results can be trusted.

read the letter

The paper's core contribution is showing that hidden states evolve differently when a model is generating under a jailbreak, and that a simple sliding-window aggregator on those states can flag trouble early enough to interrupt decoding. They avoid any training step and report 95% defense across 12 attacks on several open-source models, with detection at 5 ms per token and false positives under 1.5%. That combination of low overhead and no retraining is the part worth paying attention to if you care about deployed systems rather than just prompt-level filters.

Referee Report

2 major / 2 minor

Summary. The paper claims that hidden-state trajectories during LLM decoding carry stronger, more stable risk signals for jailbreak detection than static input prompts. It proposes TrajGuard, a training-free framework that aggregates these states over a sliding window, computes a real-time risk score, and triggers lightweight semantic adjudication (or interruption) only when the score persistently exceeds a fixed threshold. Experiments across 12 jailbreak attacks on multiple open-source LLMs report an average 95% defense rate, <1.5% false-positive rate, and 5.2 ms/token latency.

Significance. If the empirical separation holds under proper controls, the work provides a practical, model-modification-free approach to real-time jailbreak defense that exploits the dynamic evolution of internal representations rather than static checks. The observation that jailbreak-generated tokens progressively approach high-risk latent regions is a useful empirical finding that could inform future dynamic monitoring techniques.

major comments (2)

[§3.2] §3.2 (Risk Quantification): The sliding-window risk metric and fixed threshold lack any derivation showing invariance across architectures or fine-tunings; the method therefore depends on two free parameters whose values are almost certainly selected on the same distributions used to report the 95% defense rate and <1.5% FPR, directly undermining the claimed generality to unseen models and attacks.
[§4.1–4.3] §4.1–4.3 (Experimental Setup): No details are provided on how the 12 attacks and models were chosen, whether threshold/window size were tuned on held-out data, or whether statistical significance and multiple-comparison corrections were applied; without these controls the reported performance cannot be assessed as robust rather than post-hoc.

minor comments (2)

[§3.1] The description of how 'critical layers' are identified is empirical but would benefit from an explicit ablation or selection criterion in §3.1.
[§3.2] Notation for the risk aggregation function inside the window should be formalized with an equation rather than prose.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We provide point-by-point responses to the major concerns and indicate the revisions we will make to address them.

read point-by-point responses

Referee: [§3.2] §3.2 (Risk Quantification): The sliding-window risk metric and fixed threshold lack any derivation showing invariance across architectures or fine-tunings; the method therefore depends on two free parameters whose values are almost certainly selected on the same distributions used to report the 95% defense rate and <1.5% FPR, directly undermining the claimed generality to unseen models and attacks.

Authors: We agree that there is no theoretical derivation provided for the invariance of the sliding-window risk metric and threshold across different architectures or fine-tunings. The parameters are selected empirically based on the observed separation in hidden-state trajectories. To strengthen the presentation, we will revise Section 3.2 to include a more detailed justification for the parameter choices, supported by additional experiments showing consistent performance across a range of window sizes and thresholds. We will also discuss the empirical generality observed in our multi-model experiments. revision: yes
Referee: [§4.1–4.3] §4.1–4.3 (Experimental Setup): No details are provided on how the 12 attacks and models were chosen, whether threshold/window size were tuned on held-out data, or whether statistical significance and multiple-comparison corrections were applied; without these controls the reported performance cannot be assessed as robust rather than post-hoc.

Authors: We will update Sections 4.1 to 4.3 with additional details on the selection criteria for the 12 attacks and the LLMs used, drawing from established benchmarks in the jailbreak literature. We will clarify the process for determining the threshold and window size, noting that they were chosen to generalize across the tested scenarios. Furthermore, we will incorporate statistical significance testing and address multiple comparisons appropriately in the revised manuscript to better substantiate the robustness of the results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation and training-free heuristic with no self-referential derivation

full rationale

The paper's core contribution is an empirical demonstration that hidden-state trajectories during decoding exhibit stronger risk signals than static prompts, followed by a practical sliding-window aggregation rule with a fixed threshold to trigger adjudication. No equations, uniqueness theorems, or first-principles derivations are presented that reduce to their own inputs by construction. The method is explicitly training-free and relies on experimental results across multiple models and attacks rather than any fitted parameter or self-citation chain that would force the claimed separation. This is a standard empirical engineering paper whose claims are externally falsifiable via replication on held-out data.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework rests on an empirical observation about hidden-state evolution that is treated as given rather than derived. Threshold and window size are implicit free parameters whose selection process is not detailed.

free parameters (2)

risk threshold
Used to decide when to trigger semantic adjudication; value and selection method not specified in abstract.
sliding window size
Determines how many recent tokens are aggregated for risk quantification; not numerically specified.

axioms (1)

domain assumption Hidden states in critical layers encode progressively stronger risk signals during jailbreak decoding that are more stable than those in input prompts.
This is the key empirical premise stated in the abstract but not proven or bounded within the provided text.

pith-pipeline@v0.9.0 · 5534 in / 1420 out tokens · 53790 ms · 2026-05-10T18:20:26.346731+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and 8-tick period (nowhere referenced) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

sliding window Wt of size w=8 ... Aggτ∈Wt(rl,τ) ... EWMA ... Trigger(t) ⇐⇒ sum I(pτ ≥ γ) = k for k=3 consecutive steps

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Before the Last Token: Diagnosing Final-Token Safety Probe Failures
cs.LG 2026-05 unverdicted novelty 6.0

Final-token probes miss distributed unsafe evidence in jailbreaks, but a PCA-HMM model on prefill trajectories recovers many misses without naive pooling's false positives.

Reference graph

Works this paper leans on

51 extracted references · 43 canonical work pages · cited by 1 Pith paper · 16 internal anchors

[1]

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, and 12 others. 2022. https://arxiv.org/abs/2204.05862 Training a helpful an...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Somnath Banerjee, Sayan Layek, Soham Tripathy, Shanu Kumar, Animesh Mukherjee, and Rima Hazra. 2024. https://arxiv.org/abs/2406.12274 Safeinfer: Context adaptive decoding time safety alignment for large language models . Preprint, arXiv:2406.12274

work page arXiv 2024
[3]

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, and 1 others. 2023. https://arxiv.org/abs/2310.08419 Jailbreaking black box large language models in twenty queries . arXiv preprint arXiv:2310.08419

work page internal anchor Pith review arXiv 2023
[4]

Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. 2023. https://arxiv.org/abs/2310.06474 Multilingual jailbreak challenges in large language models . arXiv preprint arXiv:2310.06474

work page arXiv 2023
[5]

Pei Ding and 1 others. 2024. https://aclanthology.org/2024.naacl-long.118 Generalized nested jailbreak prompts can fool large language models . In Proceedings of NAACL

2024
[6]

Ming Dong, Jinkui Zhang, Bolong Zheng, Xinhui Tu, Po Hu, and Tingting He. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.197 DSCD : Large language model detoxification with self-constrained decoding . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3969--3984, Suzhou, China. Association for Computational...

work page doi:10.18653/v1/2025.emnlp-main.197 2025
[7]

Soumya Suvra Ghosal, Souradip Chakraborty, Vaibhav Singh, Tianrui Guan, Mengdi Wang, Alvaro Velasquez, Ahmad Beirami, Furong Huang, Dinesh Manocha, and Amrit Singh Bedi. 2025. https://arxiv.org/abs/2411.18688 Immune: Improving safety against jailbreaks in multi-modal llms via inference-time alignment . Preprint, arXiv:2411.18688

work page arXiv 2025
[8]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Xiaomeng Hu, Fei Huang, Chenhan Yuan, Junyang Lin, and Tsung-Yi Ho. 2025. https://arxiv.org/abs/2509.06982 Care: Decoding time safety alignment via rollback and introspection intervention . Preprint, arXiv:2509.06982

work page arXiv 2025
[10]

Huang, Sailik Sengupta, Daniele Bonadiman, Yi-An Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchhoff, and Dan Roth

James Y. Huang, Sailik Sengupta, Daniele Bonadiman, Yi-An Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchhoff, and Dan Roth. 2025. https://doi.org/10.18653/v1/2025.acl-long.1274 D e AL : Decoding-time alignment for large language models . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: L...

work page doi:10.18653/v1/2025.acl-long.1274 2025
[11]

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. 2023. https://arxiv.org/abs/2312.06674 Llama guard: Llm-based input-output safeguard for human-ai conversations . Preprint, arXiv:2312.06674

work page internal anchor Pith review arXiv 2023
[12]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. https://arxiv.org/abs/2310.0...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Nathaniel Li, Ziwen Han, Ian Steneker, Willow Primack, Riley Goodside, Hugh Zhang, Zifan Wang, Cristina Menghini, and Summer Yue. 2024. https://arxiv.org/abs/2408.15221 Llm defenses are not robust to multi-turn human jailbreaks yet . Preprint, arXiv:2408.15221

work page arXiv 2024
[14]

Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. 2023. https://arxiv.org/abs/2311.03191 Deepinception: Hypnotize large language model to be jailbreaker . arXiv preprint arXiv:2311.03191

work page arXiv 2023
[15]

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2023. https://arxiv.org/abs/2310.04451 Autodan: Generating stealthy jailbreak prompts on aligned large language models . arXiv preprint arXiv:2310.04451

work page internal anchor Pith review arXiv 2023
[16]

Weikai Lu, Ziqian Zeng, Jianwei Wang, Zhengdong Lu, Zelin Chen, Huiping Zhuang, and Cen Chen. 2024. https://arxiv.org/abs/2404.05880 Eraser: Jailbreaking defense in large language models via unlearning harmful knowledge . Preprint, arXiv:2404.05880

work page arXiv 2024
[17]

Antonio Mastropaolo, Luca Pascarella, Emanuela Guglielmi, Matteo Ciniselli, Simone Scalabrino, Rocco Oliveto, and Gabriele Bavota. 2023. https://arxiv.org/abs/2302.00438 On the robustness of code generation techniques: An empirical study on github copilot . Preprint, arXiv:2302.00438

work page arXiv 2023
[18]

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. 2024. https://arxiv.org/abs/2402.04249 Harmbench: A standardized evaluation framework for automated red teaming and robust refusal . Preprint, arXiv:2402.04249

work page internal anchor Pith review arXiv 2024
[19]

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. 2024. https://arxiv.org/abs/2312.02119 Tree of attacks: Jailbreaking black-box llms automatically . Preprint, arXiv:2312.02119

work page arXiv 2024
[20]

Meta AI . 2024. Llama guard 3: Model card and prompt formats. https://www.llama.com/docs/model-cards-and-prompt-formats/llama-guard-3/

2024
[21]

Thu-Trang Nguyen, Thanh Trong Vu, Hieu Dinh Vo, and Son Nguyen. 2025. https://doi.org/10.1016/j.infsof.2025.107780 An empirical study on capability of large language models in understanding code semantics . Information and Software Technology, 185:107780

work page doi:10.1016/j.infsof.2025.107780 2025
[22]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. https://arxiv.org/abs/2203.02155 Training language models to f...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. 2023. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277

work page internal anchor Pith review arXiv 2023
[24]

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. 2023. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693

work page internal anchor Pith review arXiv 2023
[25]

Cheng Qian, Hainan Zhang, Lei Sha, and Zhiming Zheng. 2025. https://arxiv.org/abs/2409.03788 Hsf: Defending against jailbreak attacks with hidden state filtering . Preprint, arXiv:2409.03788

work page arXiv 2025
[26]

Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, and Jing Shao. 2024. https://arxiv.org/abs/2410.10700 Derail yourself: Multi-turn llm jailbreak attack through self-discovered clues . arXiv preprint arXiv:2410.10700

work page arXiv 2024
[27]

Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, and Jing Shao. 2025. https://doi.org/10.18653/v1/2025.acl-long.1207 LLM s know their vulnerabilities: Uncover safety gaps through natural distribution shifts . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume...

work page doi:10.18653/v1/2025.acl-long.1207 2025
[28]

Mark Russinovich, Ahmed Salem, and Ronen Eldan. 2025 a . https://arxiv.org/abs/2404.01833 Great, now write an article about that: The crescendo multi-turn llm jailbreak attack . Preprint, arXiv:2404.01833

work page arXiv 2025
[29]

Mark Russinovich, Ahmed Salem, and Ronen Eldan. 2025 b . Great, now write an article about that: The crescendo \ Multi-Turn \ \ LLM \ jailbreak attack. In 34th USENIX Security Symposium (USENIX Security 25), pages 2421--2440

2025
[30]

Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. 2024. https://arxiv.org/abs/2308.01263 Xstest: A test suite for identifying exaggerated safety behaviours in large language models . Preprint, arXiv:2308.01263

work page internal anchor Pith review arXiv 2024
[31]

Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming

Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, Amanda Askell, Nathan Bailey, Joe Benton, Emma Bluemke, Samuel R. Bowman, Eric Christiansen, Hoagy Cunningham, Andy Dau, Anjali Gopal, and 24 others. 2025. https://arxiv.org/abs/2501.18837 Constitutional classifiers: Defending aga...

work page arXiv 2025
[32]

Leheng Sheng, Changshuo Shen, Weixiang Zhao, Junfeng Fang, Xiaohao Liu, Zhenkai Liang, Xiang Wang, An Zhang, and Tat-Seng Chua. 2025. Alphasteer: Learning refusal steering with principled null-space constraint. arXiv preprint arXiv:2506.07022

work page arXiv 2025
[33]

Rohan Taori, Ishaan Gulrajani, and 1 others. 2023. https://crfm.stanford.edu/2023/03/13/alpaca.html Stanford alpaca: An instruction-following llm . GitHub

2023
[34]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, and 49 others. 2023. https://arxiv.org/abs/2307.09288 Llama 2: Open fo...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Xuekang Wang, Shengyu Zhu, and Xueqi Cheng. 2025 a . https://doi.org/10.18653/v1/2025.emnlp-main.648 Speculative safety-aware decoding . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12838--12852, Suzhou, China. Association for Computational Linguistics

work page doi:10.18653/v1/2025.emnlp-main.648 2025
[36]

Xunguang Wang, Zhenlan Ji, Wenxuan Wang, Zongjie Li, Daoyuan Wu, and Shuai Wang. 2025 b . https://arxiv.org/abs/2506.10597 Sok: Evaluating jailbreak guardrails for large language models . Preprint, arXiv:2506.10597

work page arXiv 2025
[37]

Zezhong Wang, Fangkai Yang, Lu Wang, Pu Zhao, Hongru Wang, Liang Chen, Qingwei Lin, and Kam-Fai Wong. 2024. https://aclanthology.org/2024.naacl-long.92/ Self-guard: Empower the llm to safeguard itself . In NAACL 2024

2024
[38]

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023 a . https://arxiv.org/abs/2307.02483 Jailbroken: How does llm safety training fail? In Advances in Neural Information Processing Systems (NeurIPS)

work page internal anchor Pith review arXiv 2023
[39]

Zeming Wei and 1 others. 2023 b . https://arxiv.org/abs/2310.06387 Jailbreak and guard aligned language models with only few in-context demonstrations . arXiv preprint arXiv:2310.06387

work page arXiv 2023
[40]

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, and Radha Poovendran. 2024. https://arxiv.org/abs/2402.08983 Safedecoding: Defending against jailbreak attacks via safety-aware decoding . Preprint, arXiv:2402.08983

work page arXiv 2024
[41]

Zitao Xuan, Xiaofeng Mao, Da Chen, Xin Zhang, Yuhan Dong, and Jun Zhou. 2025. https://doi.org/10.18653/v1/2025.findings-acl.932 S hield H ead: Decoding-time safeguard for large language models . In Findings of the Association for Computational Linguistics: ACL 2025, pages 18129--18143, Vienna, Austria. Association for Computational Linguistics

work page doi:10.18653/v1/2025.findings-acl.932 2025
[42]

Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. 2023. https://arxiv.org/abs/2309.10253 Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts . arXiv preprint arXiv:2309.10253

work page internal anchor Pith review arXiv 2023
[43]

Zhiyuan Yu, Xiaogeng Liu, Shunning Liang, Zach Cameron, Chaowei Xiao, and Ning Zhang. 2024. https://arxiv.org/abs/2403.17336 Don't listen to me: Understanding and exploring jailbreak prompts of large language models . In Proceedings of the 33rd USENIX Security Symposium

work page arXiv 2024
[44]

Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. 2023. https://arxiv.org/abs/2308.06463 Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher . arXiv preprint arXiv:2308.06463

work page arXiv 2023
[45]

Zhang and 1 others

Z. Zhang and 1 others. 2024. https://aclanthology.org/2024.acl-long.481.pdf Defending large language models against jailbreaking attacks through goal prioritization . In ACL 2024

2024
[46]

Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, Baosong Yang, Chen Cheng, Jialong Tang, Jiandong Jiang, Jianwei Zhang, Jijie Xu, Ming Yan, Minmin Sun, Pei Zhang, and 24 others. 2025 a . https://arxiv.org/abs/2510.14276 Qwen3guard technical report . Preprint, arXiv:2510.14276

work page internal anchor Pith review arXiv 2025
[47]

Weixiang Zhao, Jiahe Guo, Yulin Hu, Yang Deng, An Zhang, Xingyu Sui, Xinyang Han, Yanyan Zhao, Bing Qin, Tat-Seng Chua, and Ting Liu. 2025 b . https://doi.org/10.18653/v1/2025.emnlp-main.1248 A da S teer: Your aligned LLM is inherently an adaptive jailbreak defender . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processin...

work page doi:10.18653/v1/2025.emnlp-main.1248 2025
[48]

Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, and Tong Sun. 2023. https://arxiv.org/abs/2310.15140 Autodan: Interpretable gradient-based adversarial attacks on large language models . Preprint, arXiv:2310.15140

work page arXiv 2023
[50]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. 2023 b . https://arxiv.org/abs/2307.15043 Universal and transferable adversarial attacks on aligned language models . arXiv preprint arXiv:2307.15043

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[52]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...