Recognition: unknown
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses
Pith reviewed 2026-05-07 02:13 UTC · model grok-4.3
The pith
Jailbreak detectors relying on single-layer toxic or refusal signals are structurally vulnerable to adaptive attacks that suppress refusal directions, but multi-layer trajectory verification restores high detection accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Jailbreak representations occupy a distinctive region in refusal-direction fingerprint space that neither harmless nor harmful prompts inhabit. JB-GCG exploits this by jointly minimizing cosine similarity to the refusal direction and maximizing JBShield's toxic-concept score, producing average attack success rates of 46.2 percent (up to 53.4 percent) against the original defense and up to 30.7 percent against its recalibrated variant. Representation Trajectory Verification counters the structural gap by performing Mahalanobis outlier detection on the multi-layer trajectory of refusal-direction projections, attaining an AUROC of 0.99 while keeping the best adaptive attack success rate at 7%.
What carries the argument
Representation Trajectory Verification (RTV): Mahalanobis outlier detection applied to the sequence of cosine similarities between hidden-state representations and a fixed refusal direction across multiple transformer layers.
If this is right
- Single-layer concept detectors are insufficient under adaptive threat models that directly optimize against their internal signals.
- Jailbreak success correlates with a consistent, detectable deviation in the multi-layer refusal-direction trajectory rather than with any single-layer toxicity score.
- Strong non-adaptive detection performance does not guarantee robustness once the attacker has white-box access to the detector.
- Multi-layer consistency checks can be implemented with modest overhead while raising the computational cost of adaptive attacks by more than an order of magnitude.
Where Pith is reading between the lines
- The same refusal-direction fingerprinting approach could be applied to other representation-level defenses that currently rely on single-layer probes.
- If the distinctive trajectory pattern persists across model families, RTV-style detectors could serve as a lightweight, training-free complement to output filtering or safety fine-tuning.
- Future adaptive attacks may need to optimize against the full covariance structure of refusal trajectories rather than against a single direction or concept score.
Load-bearing premise
The observed separation of jailbreak trajectories in refusal-direction space will remain stable for models and attack variants beyond the five Llama-3-8B configurations tested.
What would settle it
A new jailbreak method or different base model that produces refusal-direction trajectories statistically indistinguishable from those of ordinary harmful prompts, driving RTV's AUROC below 0.85.
Figures
read the original abstract
Defending large language models (LLMs) against jailbreak attacks, such as Greedy Coordinate Gradient (GCG), remains a challenge, particularly under adaptive threat models where an attacker directly targets the defense mechanism. JBShield, a recent jailbreak defense with a 0% attack success rate in some settings, detects malicious prompts via two concept signals, a toxic concept and a jailbreak concept. We design JB-GCG, which modifies GCG's objective to combine two terms: refusal-direction suppression via cosine similarity between the refusal direction and hidden-state representations, and toxic-concept regularization via JBShield's own toxic concept score. Across five configurations on Llama-3-8B, JB-GCG achieves an average ASR of 46.2%, reaching up to 53.4% in the strongest setting. We further show that our attack remains effective against JBShield-M, achieving ASR up to 30.7% across evaluated settings. The attack persists across multiple JBShield recalibrations, confirming that the vulnerability is structural rather than calibration-specific. We analyze the cosine-similarity signatures of jailbreak representations and find that they occupy a distinctive region in refusal-direction fingerprint space that neither harmless nor harmful prompts inhabit. We introduce Representation Trajectory Verification (RTV), a new defense based on Mahalanobis outlier detection over multi-layer refusal-direction fingerprints. RTV attains an AUROC of 0.99 against our attack. Finally, we design and evaluate an additional adaptive attack against RTV with full white-box knowledge of the defense; the best attack achieves only 7% ASR at 13x the computational cost. Our results show that strong non-adaptive detection does not imply robustness under adaptive threat models, and that multi-layer representation consistency is a more reliable foundation for jailbreak detection than single-layer concept similarity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces JB-GCG, a modified GCG attack that targets both refusal-direction cosine similarity and JBShield's toxic-concept score, achieving 46.2% average ASR (up to 53.4%) across five Llama-3-8B configurations of JBShield and remaining effective (up to 30.7% ASR) against recalibrated JBShield-M. It demonstrates that jailbreak representations occupy a distinct region in multi-layer refusal-direction fingerprint space, then proposes Representation Trajectory Verification (RTV), a Mahalanobis outlier detector over these fingerprints that attains AUROC 0.99. An adaptive white-box attack on RTV reaches only 7% ASR at 13× cost. The central claim is that single-layer concept detectors are structurally vulnerable under adaptive threat models while multi-layer trajectory consistency provides a more robust foundation.
Significance. If the empirical separation and AUROC results hold, the work supplies concrete evidence that adaptive attacks can expose structural gaps in representation-level defenses and that multi-layer Mahalanobis checks restore high detection performance with limited residual ASR. The reproducible attack success rates, fingerprint visualizations, and cost measurements on held-out prompts constitute falsifiable, parameter-light findings that directly inform the design of future jailbreak detectors.
minor comments (2)
- §4.2 and Figure 3: the precise definition of the refusal-direction vector (how it is extracted from the model and whether it is layer-specific) should be stated explicitly so that the fingerprint-space separation can be reproduced without ambiguity.
- Table 2: report the standard deviation of ASR across the five configurations rather than only the mean and maximum; this would strengthen the claim that the vulnerability is consistent rather than configuration-dependent.
Simulated Author's Rebuttal
We thank the referee for their careful reading of the manuscript, accurate summary of our contributions, and recommendation to accept. We are pleased that the empirical results on adaptive attack success rates, fingerprint separation, and RTV detection performance were viewed as falsifiable and directly informative for future detector design.
Circularity Check
No significant circularity
full rationale
The paper contains no derivations, equations, or fitted parameters that reduce to their own inputs. All central claims (ASR values, AUROC 0.99, 7 % residual ASR) are direct empirical measurements on held-out prompts under explicitly described attack and detection procedures. No self-citation is load-bearing for the reported results, and no ansatz or uniqueness theorem is invoked. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- JB-GCG weighting coefficients
- RTV Mahalanobis covariance and threshold
Reference graph
Works this paper leans on
- [1]
-
[2]
AI@Meta. 2024. Llama 3 Model Card. (2024). https://github.com/meta-llama/ llama3/blob/main/MODEL_CARD.md
2024
- [3]
- [4]
-
[5]
Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems37 (2024), 136037–136083
2024
-
[6]
Luke Bailey, Alex Serrano, Abhay Sheshadri, Mikhail Seleznyov, Jordan Tay- lor, Erik Jenner, Jacob Hilton, Stephen Casper, Carlos Guestrin, and Scott Em- mons. 2026. Obfuscated Activations Bypass LLM Latent-Space Defenses. In The Fourteenth International Conference on Learning Representations. https: //openreview.net/forum?id=ktGmDGoWnB
2026
-
[7]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
2020
-
[8]
Choquette-Choo, Matthew Jagielski, Irena Gao, Pang Wei Koh, Daphne Ippolito, Florian Tramèr, and Ludwig Schmidt
Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Pang Wei Koh, Daphne Ippolito, Florian Tramèr, and Ludwig Schmidt
-
[9]
InThirty-seventh Conference on Neural Information Processing Systems
Are aligned neural networks adversarially aligned?. InThirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/ forum?id=OQQoD8Vc3B
-
[10]
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. 2025. Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 23–42
2025
-
[11]
Christiano, Jan Leike, Tom B
Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep Reinforcement Learning from Human Preferences. In Advances in Neural Information Processing Systems, Vol. 30
2017
-
[12]
Dylan Feng, Pragya Srivastava, Anca Dragan, and Cassidy Laidlaw. 2026. Bench- marking Anomaly Detection for Large Language Model Alignment. https: //openreview.net/forum?id=Tr74UMzBdw
2026
-
[13]
Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. 2024. WildGuard: Open One-Stop Mod- eration Tools for Safety Risks, Jailbreaks, and Refusals of LLMs. InAdvances in Neural Information Processing Systems, Vol. 37
2024
-
[14]
Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krish- nan, and Dawn Song. 2020. Pretrained transformers improve out-of-distribution robustness. InProceedings of the 58th annual meeting of the association for com- putational linguistics. 2744–2751
2020
-
[15]
Xiaomeng Hu, Pin-Yu Chen, and Tsung-Yi Ho. 2024. Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Land- scapes. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=vI1WqFn15v
2024
-
[16]
Peichun Hua, Hao Li, Shanghao Shi, Zhiyuan Yu, and Ning Zhang. 2025. Rethink- ing Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring.arXiv preprint arXiv:2512.12069(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. 2023. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations.arXiv preprint arXiv:2312.06674(2023)
work page internal anchor Pith review arXiv 2023
-
[18]
Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchen- bauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. 2023. Baseline Defenses for Adversarial Attacks Against Aligned Language Models.arXiv preprint arXiv:2309.00614(2023)
work page internal anchor Pith review arXiv 2023
- [19]
-
[20]
Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Aaron Jiaxun Li, Soheil Feizi, and Himabindu Lakkaraju. 2024. Certifying LLM Safety against Adversarial Prompting. InConference on Language Modeling (COLM)
2024
-
[21]
Olivier Ledoit and Michael Wolf. 2004. A well-conditioned estimator for large- dimensional covariance matrices.Journal of Multivariate Analysis88, 2 (2004), 365–411. doi:10.1016/S0047-259X(03)00096-4
-
[22]
Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. 2018. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems31 (2018)
2018
-
[23]
Wenyun Li, Zheng Zhang, Dongmei Jiang, and Xiangyuan Lan. 2026. Bolster Hallucination Detection via Prompt-Guided Data Augmentation. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 37618–37626
2026
-
[24]
Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, and Hongyang Zhang. 2024. RAIN: Your Language Models Can Align Themselves without Finetuning.Inter- national Conference on Learning Representations (ICLR)(2024)
2024
- [25]
-
[26]
Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2024. AutoDAN: Gener- ating Stealthy Jailbreak Prompts on Aligned Large Language Models. InInterna- tional Conference on Learning Representations (ICLR)
2024
-
[27]
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. 2024. HarmBench: a standardized evaluation framework for automated red teaming and robust refusal. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24). JML...
2024
-
[28]
Yi Nian, Shenzhe Zhu, Yuehan Qin, Li Li, Ziyi Wang, Chaowei Xiao, and Yue Zhao
-
[29]
InSecond Conference on Language Modeling
JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model. InSecond Conference on Language Modeling. https://openreview.net/ forum?id=8Pxdzsqvx9
-
[30]
OpenAI. 2023. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774(2023). doi:10.48550/arXiv.2303.08774
work page internal anchor Pith review doi:10.48550/arxiv.2303.08774 2023
-
[31]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schul- man, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training Lan- guage Models to Follow Instructions with Human...
2022
-
[32]
Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. 2023. Steering llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681(2023)
work page internal anchor Pith review arXiv 2023
-
[33]
Patil, Tianjun Zhang, Xin Wang, and Joseph E
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2024. Gorilla: Large Language Model Connected with Massive APIs. InAdvances in Neural Information Processing Systems, Vol. 37
2024
-
[34]
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red Teaming Language Models with Language Models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). 3419–3448
2022
- [35]
-
[36]
Alexander Podolskiy, Dmitry Lipin, Andrey Bout, Ekaterina Artemova, and Irina Piontkovskaya. 2021. Revisiting mahalanobis distance for transformer-based out- of-domain detection. InProceedings of the AAAI conference on artificial intelligence, Vol. 35. 13675–13682
2021
-
[37]
Alexander Robey, Eric Wong, Hamed Hassani, and George J. Pappas. 2025. SmoothLLM: Defending Large Language Models Against Jailbreaking At- tacks.Transactions on Machine Learning Research(2025). arXiv preprint arXiv:2310.03684, 2023. Conference acronym ’XX, , Derya et al
work page internal anchor Pith review arXiv 2025
-
[38]
Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matt...
-
[39]
In International Conference on Learning Representations (ICLR)
Multitask Prompted Training Enables Zero-Shot Task Generalization. In International Conference on Learning Representations (ICLR)
-
[40]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. InAdvances in Neural Information Processing Systems, Vol. 36
2023
- [41]
-
[42]
Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang
-
[43]
Do Anything Now
“Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. InProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security (CCS)
2024
-
[44]
Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F. Christiano. 2020. Learning to Summarize from Human Feedback. InAdvances in Neural Information Processing Systems, Vol. 33
2020
-
[45]
Hashimoto
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_ alpaca
2023
-
[46]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lam- ple. 2023. LLaMA: Open and Efficient Foundation Language Models.arXiv preprint arXiv:2302.13971(2023). doi:10.48550/arXiv.2302.13971
work page internal anchor Pith review doi:10.48550/arxiv.2302.13971 2023
-
[47]
Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. 2020. On adaptive attacks to adversarial example defenses.Advances in neural infor- mation processing systems33 (2020), 1633–1645
2020
-
[48]
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How Does LLM Safety Training Fail?. InAdvances in Neural Information Processing Systems, Vol. 36
2023
-
[49]
Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M
Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2022. Finetuned Language Models Are Zero-Shot Learners. InInternational Conference on Learning Representations (ICLR)
2022
- [50]
-
[51]
Yueqi Xie, Minghong Fang, Renjie Pi, and Neil Gong. 2024. GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Volume 1: Long Papers. 507–518
2024
-
[52]
Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. 2023. Defending ChatGPT Against Jailbreak Attack via Self-Reminders.Nature Machine Intelligence5, 12 (2023), 1486–1496
2023
-
[53]
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, and Radha Poovendran. 2024. SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Volume 1: Long Papers. 5587– 5605
2024
-
[54]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR)
2023
-
[55]
Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Rad- harapu, Olivia Sturman, and Oscar Wahltinez. 2024. ShieldGemma: Generative AI Content Moderation Based on Gemma.arXiv preprint arXiv:2407.21772(2024)
-
[56]
2025.{JBShield}: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation
Shenyi Zhang, Yuchen Zhai, Keyan Guo, Hongxin Hu, Shengnan Guo, Zheng Fang, Lingchen Zhao, Chao Shen, Cong Wang, and Qian Wang. 2025.{JBShield}: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation. In34th USENIX Security Symposium (USENIX Security 25). 8215–8234
2025
-
[57]
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al
-
[58]
Representation Engineering: A Top-Down Approach to AI Transparency
Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405(2023)
work page internal anchor Pith review arXiv 2023
-
[59]
Zico Kolter, Matt Fredrikson, and Dan Hendrycks
Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym An- driushchenko, Rowan Wang, J. Zico Kolter, Matt Fredrikson, and Dan Hendrycks
-
[60]
Lower safety scores indicate stronger harmfulness after ablation
Improving Alignment and Robustness with Circuit Breakers.Advances in original 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Layers 0.0 0.2 0.4 0.6 0.8 1.0Average Safety Score Average Safety Scores by Layer (Llama Guard 3 8B) Figure 5: Layer-wise refusal-direction ablation on Llama-3 8B, evaluated using Llama-Guard-3 8...
2024
-
[61]
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043(2023). A Model Ablation For each layer, we extract a refusal direction and then ablate lay- ers by subtracting the projection of the refusal direction from t...
work page internal anchor Pith review arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.