pith. machine review for the scientific record. sign in

arxiv: 2605.03095 · v1 · submitted 2026-05-04 · 💻 cs.CR

Recognition: unknown

Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses

Authors on Pith no claims yet

Pith reviewed 2026-05-07 02:13 UTC · model grok-4.3

classification 💻 cs.CR
keywords jailbreak defenserepresentation-level detectionrefusal directionadaptive attackGCGMahalanobis detectionLLM securitycosine similarity
0
0 comments X

The pith

Jailbreak detectors relying on single-layer toxic or refusal signals are structurally vulnerable to adaptive attacks that suppress refusal directions, but multi-layer trajectory verification restores high detection accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that JBShield's concept-based detection can be bypassed by an optimized attack, JB-GCG, which simultaneously suppresses refusal directions and regularizes against the defender's own toxic score. Across multiple Llama-3-8B configurations and recalibrations, this attack raises success rates from the claimed zero to an average of 46 percent. Analysis of hidden states reveals that successful jailbreaks occupy a distinct region in refusal-direction fingerprint space that neither ordinary harmful nor harmless prompts reach. The authors therefore introduce Representation Trajectory Verification, which applies Mahalanobis outlier detection to the sequence of refusal-direction projections across layers. RTV achieves an AUROC of 0.99 on the new attack and limits even a fully adaptive white-box attacker to 7 percent success at thirteen times the original compute cost.

Core claim

Jailbreak representations occupy a distinctive region in refusal-direction fingerprint space that neither harmless nor harmful prompts inhabit. JB-GCG exploits this by jointly minimizing cosine similarity to the refusal direction and maximizing JBShield's toxic-concept score, producing average attack success rates of 46.2 percent (up to 53.4 percent) against the original defense and up to 30.7 percent against its recalibrated variant. Representation Trajectory Verification counters the structural gap by performing Mahalanobis outlier detection on the multi-layer trajectory of refusal-direction projections, attaining an AUROC of 0.99 while keeping the best adaptive attack success rate at 7%.

What carries the argument

Representation Trajectory Verification (RTV): Mahalanobis outlier detection applied to the sequence of cosine similarities between hidden-state representations and a fixed refusal direction across multiple transformer layers.

If this is right

  • Single-layer concept detectors are insufficient under adaptive threat models that directly optimize against their internal signals.
  • Jailbreak success correlates with a consistent, detectable deviation in the multi-layer refusal-direction trajectory rather than with any single-layer toxicity score.
  • Strong non-adaptive detection performance does not guarantee robustness once the attacker has white-box access to the detector.
  • Multi-layer consistency checks can be implemented with modest overhead while raising the computational cost of adaptive attacks by more than an order of magnitude.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same refusal-direction fingerprinting approach could be applied to other representation-level defenses that currently rely on single-layer probes.
  • If the distinctive trajectory pattern persists across model families, RTV-style detectors could serve as a lightweight, training-free complement to output filtering or safety fine-tuning.
  • Future adaptive attacks may need to optimize against the full covariance structure of refusal trajectories rather than against a single direction or concept score.

Load-bearing premise

The observed separation of jailbreak trajectories in refusal-direction space will remain stable for models and attack variants beyond the five Llama-3-8B configurations tested.

What would settle it

A new jailbreak method or different base model that produces refusal-direction trajectories statistically indistinguishable from those of ordinary harmful prompts, driving RTV's AUROC below 0.85.

Figures

Figures reproduced from arXiv: 2605.03095 by Berk Sunar, Kemal Derya.

Figure 1
Figure 1. Figure 1: Mahalanobis distance distributions on the test set. The JB-GCG distribution (median 7.1) is well-separated from view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the attack-defense cycle. Left: JBShield view at source ↗
Figure 3
Figure 3. Figure 3: Refusal-direction cosine similarities across token view at source ↗
Figure 4
Figure 4. Figure 4: Average fingerprint matrix per category (3 layers view at source ↗
Figure 5
Figure 5. Figure 5: Layer-wise refusal-direction ablation on Llama-3 view at source ↗
Figure 6
Figure 6. Figure 6: Joint-GCG cosine-similarity trajectories under ex view at source ↗
Figure 7
Figure 7. Figure 7: PCA and t-SNE projections of the 15-dimensional view at source ↗
read the original abstract

Defending large language models (LLMs) against jailbreak attacks, such as Greedy Coordinate Gradient (GCG), remains a challenge, particularly under adaptive threat models where an attacker directly targets the defense mechanism. JBShield, a recent jailbreak defense with a 0% attack success rate in some settings, detects malicious prompts via two concept signals, a toxic concept and a jailbreak concept. We design JB-GCG, which modifies GCG's objective to combine two terms: refusal-direction suppression via cosine similarity between the refusal direction and hidden-state representations, and toxic-concept regularization via JBShield's own toxic concept score. Across five configurations on Llama-3-8B, JB-GCG achieves an average ASR of 46.2%, reaching up to 53.4% in the strongest setting. We further show that our attack remains effective against JBShield-M, achieving ASR up to 30.7% across evaluated settings. The attack persists across multiple JBShield recalibrations, confirming that the vulnerability is structural rather than calibration-specific. We analyze the cosine-similarity signatures of jailbreak representations and find that they occupy a distinctive region in refusal-direction fingerprint space that neither harmless nor harmful prompts inhabit. We introduce Representation Trajectory Verification (RTV), a new defense based on Mahalanobis outlier detection over multi-layer refusal-direction fingerprints. RTV attains an AUROC of 0.99 against our attack. Finally, we design and evaluate an additional adaptive attack against RTV with full white-box knowledge of the defense; the best attack achieves only 7% ASR at 13x the computational cost. Our results show that strong non-adaptive detection does not imply robustness under adaptive threat models, and that multi-layer representation consistency is a more reliable foundation for jailbreak detection than single-layer concept similarity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces JB-GCG, a modified GCG attack that targets both refusal-direction cosine similarity and JBShield's toxic-concept score, achieving 46.2% average ASR (up to 53.4%) across five Llama-3-8B configurations of JBShield and remaining effective (up to 30.7% ASR) against recalibrated JBShield-M. It demonstrates that jailbreak representations occupy a distinct region in multi-layer refusal-direction fingerprint space, then proposes Representation Trajectory Verification (RTV), a Mahalanobis outlier detector over these fingerprints that attains AUROC 0.99. An adaptive white-box attack on RTV reaches only 7% ASR at 13× cost. The central claim is that single-layer concept detectors are structurally vulnerable under adaptive threat models while multi-layer trajectory consistency provides a more robust foundation.

Significance. If the empirical separation and AUROC results hold, the work supplies concrete evidence that adaptive attacks can expose structural gaps in representation-level defenses and that multi-layer Mahalanobis checks restore high detection performance with limited residual ASR. The reproducible attack success rates, fingerprint visualizations, and cost measurements on held-out prompts constitute falsifiable, parameter-light findings that directly inform the design of future jailbreak detectors.

minor comments (2)
  1. §4.2 and Figure 3: the precise definition of the refusal-direction vector (how it is extracted from the model and whether it is layer-specific) should be stated explicitly so that the fingerprint-space separation can be reproduced without ambiguity.
  2. Table 2: report the standard deviation of ASR across the five configurations rather than only the mean and maximum; this would strengthen the claim that the vulnerability is consistent rather than configuration-dependent.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript, accurate summary of our contributions, and recommendation to accept. We are pleased that the empirical results on adaptive attack success rates, fingerprint separation, and RTV detection performance were viewed as falsifiable and directly informative for future detector design.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper contains no derivations, equations, or fitted parameters that reduce to their own inputs. All central claims (ASR values, AUROC 0.99, 7 % residual ASR) are direct empirical measurements on held-out prompts under explicitly described attack and detection procedures. No self-citation is load-bearing for the reported results, and no ansatz or uniqueness theorem is invoked. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The work is almost entirely empirical. The only free parameters are the implicit thresholds and covariance estimates inside the Mahalanobis detector and the weighting coefficients inside the JB-GCG objective; these are chosen on validation data rather than derived. No new axioms or invented entities are introduced.

free parameters (2)
  • JB-GCG weighting coefficients
    Balance between refusal-direction cosine term and toxic-concept regularization term; fitted to achieve high ASR on JBShield.
  • RTV Mahalanobis covariance and threshold
    Estimated from benign and harmful prompt fingerprints; determines outlier detection boundary.

pith-pipeline@v0.9.0 · 5638 in / 1243 out tokens · 32062 ms · 2026-05-07T02:13:54.863692+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 19 canonical work pages · 9 internal anchors

  1. [1]

    Andrew Adiletta, Kathryn Adiletta, Kemal Derya, and Berk Sunar. 2025. Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultane- ously.arXiv preprint arXiv:2512.11783(2025)

  2. [2]

    AI@Meta. 2024. Llama 3 Model Card. (2024). https://github.com/meta-llama/ llama3/blob/main/MODEL_CARD.md

  3. [3]

    Gabriel Alon and Michael Kamfonas. 2023. Detecting language model attacks with perplexity.arXiv preprint arXiv:2308.14132(2023)

  4. [4]

    Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. 2024. Jail- breaking leading safety-aligned llms with simple adaptive attacks.arXiv preprint arXiv:2404.02151(2024)

  5. [5]

    Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems37 (2024), 136037–136083

  6. [6]

    Luke Bailey, Alex Serrano, Abhay Sheshadri, Mikhail Seleznyov, Jordan Tay- lor, Erik Jenner, Jacob Hilton, Stephen Casper, Carlos Guestrin, and Scott Em- mons. 2026. Obfuscated Activations Bypass LLM Latent-Space Defenses. In The Fourteenth International Conference on Learning Representations. https: //openreview.net/forum?id=ktGmDGoWnB

  7. [7]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  8. [8]

    Choquette-Choo, Matthew Jagielski, Irena Gao, Pang Wei Koh, Daphne Ippolito, Florian Tramèr, and Ludwig Schmidt

    Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Pang Wei Koh, Daphne Ippolito, Florian Tramèr, and Ludwig Schmidt

  9. [9]

    InThirty-seventh Conference on Neural Information Processing Systems

    Are aligned neural networks adversarially aligned?. InThirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/ forum?id=OQQoD8Vc3B

  10. [10]

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. 2025. Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 23–42

  11. [11]

    Christiano, Jan Leike, Tom B

    Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep Reinforcement Learning from Human Preferences. In Advances in Neural Information Processing Systems, Vol. 30

  12. [12]

    Dylan Feng, Pragya Srivastava, Anca Dragan, and Cassidy Laidlaw. 2026. Bench- marking Anomaly Detection for Large Language Model Alignment. https: //openreview.net/forum?id=Tr74UMzBdw

  13. [13]

    Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. 2024. WildGuard: Open One-Stop Mod- eration Tools for Safety Risks, Jailbreaks, and Refusals of LLMs. InAdvances in Neural Information Processing Systems, Vol. 37

  14. [14]

    Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krish- nan, and Dawn Song. 2020. Pretrained transformers improve out-of-distribution robustness. InProceedings of the 58th annual meeting of the association for com- putational linguistics. 2744–2751

  15. [15]

    Xiaomeng Hu, Pin-Yu Chen, and Tsung-Yi Ho. 2024. Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Land- scapes. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=vI1WqFn15v

  16. [16]

    Peichun Hua, Hao Li, Shanghao Shi, Zhiyuan Yu, and Ning Zhang. 2025. Rethink- ing Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring.arXiv preprint arXiv:2512.12069(2025)

  17. [17]

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. 2023. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations.arXiv preprint arXiv:2312.06674(2023)

  18. [18]

    Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchen- bauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. 2023. Baseline Defenses for Adversarial Attacks Against Aligned Language Models.arXiv preprint arXiv:2309.00614(2023)

  19. [19]

    Yilei Jiang, Xinyan Gao, Tianshuo Peng, Yingshui Tan, Xiaoyong Zhu, Bo Zheng, and Xiangyu Yue. 2025. Hiddendetect: Detecting jailbreak attacks against large vision-language models via monitoring hidden states.arXiv preprint arXiv:2502.147443, 5 (2025)

  20. [20]

    Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Aaron Jiaxun Li, Soheil Feizi, and Himabindu Lakkaraju. 2024. Certifying LLM Safety against Adversarial Prompting. InConference on Language Modeling (COLM)

  21. [21]

    Olivier Ledoit and Michael Wolf. 2004. A well-conditioned estimator for large- dimensional covariance matrices.Journal of Multivariate Analysis88, 2 (2004), 365–411. doi:10.1016/S0047-259X(03)00096-4

  22. [22]

    Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. 2018. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems31 (2018)

  23. [23]

    Wenyun Li, Zheng Zhang, Dongmei Jiang, and Xiangyuan Lan. 2026. Bolster Hallucination Detection via Prompt-Guided Data Augmentation. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 37618–37626

  24. [24]

    Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, and Hongyang Zhang. 2024. RAIN: Your Language Models Can Align Themselves without Finetuning.Inter- national Conference on Learning Representations (ICLR)(2024)

  25. [25]

    Xiao Lin, Philip Li, Zhichen Zeng, Tingwei Li, Tianxin Wei, Xuying Ning, Gaotang Li, Yuzhong Chen, and Hanghang Tong. 2026. ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification.arXiv preprint arXiv:2601.03600 (2026)

  26. [26]

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2024. AutoDAN: Gener- ating Stealthy Jailbreak Prompts on Aligned Large Language Models. InInterna- tional Conference on Learning Representations (ICLR)

  27. [27]

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. 2024. HarmBench: a standardized evaluation framework for automated red teaming and robust refusal. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24). JML...

  28. [28]

    Yi Nian, Shenzhe Zhu, Yuehan Qin, Li Li, Ziyi Wang, Chaowei Xiao, and Yue Zhao

  29. [29]

    InSecond Conference on Language Modeling

    JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model. InSecond Conference on Language Modeling. https://openreview.net/ forum?id=8Pxdzsqvx9

  30. [30]

    OpenAI. 2023. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774(2023). doi:10.48550/arXiv.2303.08774

  31. [31]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schul- man, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training Lan- guage Models to Follow Instructions with Human...

  32. [32]

    Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. 2023. Steering llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681(2023)

  33. [33]

    Patil, Tianjun Zhang, Xin Wang, and Joseph E

    Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2024. Gorilla: Large Language Model Connected with Massive APIs. InAdvances in Neural Information Processing Systems, Vol. 37

  34. [34]

    Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red Teaming Language Models with Language Models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). 3419–3448

  35. [35]

    Mansi Phute, Alec Helbling, Matthew Hull, ShengYun Peng, Sebastian Szyller, Cory Cornelius, and Duen Horng Chau. 2023. Llm self defense: By self examina- tion, llms know they are being tricked.arXiv preprint arXiv:2308.07308(2023)

  36. [36]

    Alexander Podolskiy, Dmitry Lipin, Andrey Bout, Ekaterina Artemova, and Irina Piontkovskaya. 2021. Revisiting mahalanobis distance for transformer-based out- of-domain detection. InProceedings of the AAAI conference on artificial intelligence, Vol. 35. 13675–13682

  37. [37]

    Alexander Robey, Eric Wong, Hamed Hassani, and George J. Pappas. 2025. SmoothLLM: Defending Large Language Models Against Jailbreaking At- tacks.Transactions on Machine Learning Research(2025). arXiv preprint arXiv:2310.03684, 2023. Conference acronym ’XX, , Derya et al

  38. [38]

    Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matt...

  39. [39]

    In International Conference on Learning Representations (ICLR)

    Multitask Prompted Training Enables Zero-Shot Task Generalization. In International Conference on Learning Representations (ICLR)

  40. [40]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. InAdvances in Neural Information Processing Systems, Vol. 36

  41. [41]

    Leo Schwinn and Simon Geisler. 2024. Revisiting the robust alignment of circuit breakers.arXiv preprint arXiv:2407.15902(2024)

  42. [42]

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang

  43. [43]

    Do Anything Now

    “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. InProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security (CCS)

  44. [44]

    Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F

    Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F. Christiano. 2020. Learning to Summarize from Human Feedback. InAdvances in Neural Information Processing Systems, Vol. 33

  45. [45]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_ alpaca

  46. [46]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lam- ple. 2023. LLaMA: Open and Efficient Foundation Language Models.arXiv preprint arXiv:2302.13971(2023). doi:10.48550/arXiv.2302.13971

  47. [47]

    Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. 2020. On adaptive attacks to adversarial example defenses.Advances in neural infor- mation processing systems33 (2020), 1633–1645

  48. [48]

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How Does LLM Safety Training Fail?. InAdvances in Neural Information Processing Systems, Vol. 36

  49. [49]

    Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M

    Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2022. Finetuned Language Models Are Zero-Shot Learners. InInternational Conference on Learning Representations (ICLR)

  50. [50]

    Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, et al. 2024. Sorry- bench: Systematically evaluating large language model safety refusal.arXiv preprint arXiv:2406.14598(2024)

  51. [51]

    Yueqi Xie, Minghong Fang, Renjie Pi, and Neil Gong. 2024. GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Volume 1: Long Papers. 507–518

  52. [52]

    Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. 2023. Defending ChatGPT Against Jailbreak Attack via Self-Reminders.Nature Machine Intelligence5, 12 (2023), 1486–1496

  53. [53]

    Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, and Radha Poovendran. 2024. SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Volume 1: Long Papers. 5587– 5605

  54. [54]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR)

  55. [55]

    Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Rad- harapu, Olivia Sturman, and Oscar Wahltinez. 2024. ShieldGemma: Generative AI Content Moderation Based on Gemma.arXiv preprint arXiv:2407.21772(2024)

  56. [56]

    2025.{JBShield}: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation

    Shenyi Zhang, Yuchen Zhai, Keyan Guo, Hongxin Hu, Shengnan Guo, Zheng Fang, Lingchen Zhao, Chao Shen, Cong Wang, and Qian Wang. 2025.{JBShield}: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation. In34th USENIX Security Symposium (USENIX Security 25). 8215–8234

  57. [57]

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al

  58. [58]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405(2023)

  59. [59]

    Zico Kolter, Matt Fredrikson, and Dan Hendrycks

    Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym An- driushchenko, Rowan Wang, J. Zico Kolter, Matt Fredrikson, and Dan Hendrycks

  60. [60]

    Lower safety scores indicate stronger harmfulness after ablation

    Improving Alignment and Robustness with Circuit Breakers.Advances in original 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Layers 0.0 0.2 0.4 0.6 0.8 1.0Average Safety Score Average Safety Scores by Layer (Llama Guard 3 8B) Figure 5: Layer-wise refusal-direction ablation on Llama-3 8B, evaluated using Llama-Guard-3 8...

  61. [61]

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043(2023). A Model Ablation For each layer, we extract a refusal direction and then ablate lay- ers by subtracting the projection of the refusal direction from t...