arxiv: 2605.03095 · v1 · submitted 2026-05-04 · 💻 cs.CR

Recognition: unknown

Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses

Kemal Derya , Berk Sunar

Authors on Pith no claims yet

Pith reviewed 2026-05-07 02:13 UTC · model grok-4.3

classification 💻 cs.CR

keywords jailbreak defenserepresentation-level detectionrefusal directionadaptive attackGCGMahalanobis detectionLLM securitycosine similarity

0 comments

The pith

Jailbreak detectors relying on single-layer toxic or refusal signals are structurally vulnerable to adaptive attacks that suppress refusal directions, but multi-layer trajectory verification restores high detection accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that JBShield's concept-based detection can be bypassed by an optimized attack, JB-GCG, which simultaneously suppresses refusal directions and regularizes against the defender's own toxic score. Across multiple Llama-3-8B configurations and recalibrations, this attack raises success rates from the claimed zero to an average of 46 percent. Analysis of hidden states reveals that successful jailbreaks occupy a distinct region in refusal-direction fingerprint space that neither ordinary harmful nor harmless prompts reach. The authors therefore introduce Representation Trajectory Verification, which applies Mahalanobis outlier detection to the sequence of refusal-direction projections across layers. RTV achieves an AUROC of 0.99 on the new attack and limits even a fully adaptive white-box attacker to 7 percent success at thirteen times the original compute cost.

Core claim

Jailbreak representations occupy a distinctive region in refusal-direction fingerprint space that neither harmless nor harmful prompts inhabit. JB-GCG exploits this by jointly minimizing cosine similarity to the refusal direction and maximizing JBShield's toxic-concept score, producing average attack success rates of 46.2 percent (up to 53.4 percent) against the original defense and up to 30.7 percent against its recalibrated variant. Representation Trajectory Verification counters the structural gap by performing Mahalanobis outlier detection on the multi-layer trajectory of refusal-direction projections, attaining an AUROC of 0.99 while keeping the best adaptive attack success rate at 7%.

What carries the argument

Representation Trajectory Verification (RTV): Mahalanobis outlier detection applied to the sequence of cosine similarities between hidden-state representations and a fixed refusal direction across multiple transformer layers.

If this is right

Single-layer concept detectors are insufficient under adaptive threat models that directly optimize against their internal signals.
Jailbreak success correlates with a consistent, detectable deviation in the multi-layer refusal-direction trajectory rather than with any single-layer toxicity score.
Strong non-adaptive detection performance does not guarantee robustness once the attacker has white-box access to the detector.
Multi-layer consistency checks can be implemented with modest overhead while raising the computational cost of adaptive attacks by more than an order of magnitude.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same refusal-direction fingerprinting approach could be applied to other representation-level defenses that currently rely on single-layer probes.
If the distinctive trajectory pattern persists across model families, RTV-style detectors could serve as a lightweight, training-free complement to output filtering or safety fine-tuning.
Future adaptive attacks may need to optimize against the full covariance structure of refusal trajectories rather than against a single direction or concept score.

Load-bearing premise

The observed separation of jailbreak trajectories in refusal-direction space will remain stable for models and attack variants beyond the five Llama-3-8B configurations tested.

What would settle it

A new jailbreak method or different base model that produces refusal-direction trajectories statistically indistinguishable from those of ordinary harmful prompts, driving RTV's AUROC below 0.85.

Figures

Figures reproduced from arXiv: 2605.03095 by Berk Sunar, Kemal Derya.

**Figure 1.** Figure 1: Mahalanobis distance distributions on the test set. The JB-GCG distribution (median 7.1) is well-separated from view at source ↗

**Figure 2.** Figure 2: Overview of the attack-defense cycle. Left: JBShield view at source ↗

**Figure 3.** Figure 3: Refusal-direction cosine similarities across token view at source ↗

**Figure 4.** Figure 4: Average fingerprint matrix per category (3 layers view at source ↗

**Figure 5.** Figure 5: Layer-wise refusal-direction ablation on Llama-3 view at source ↗

**Figure 6.** Figure 6: Joint-GCG cosine-similarity trajectories under ex view at source ↗

**Figure 7.** Figure 7: PCA and t-SNE projections of the 15-dimensional view at source ↗

read the original abstract

Defending large language models (LLMs) against jailbreak attacks, such as Greedy Coordinate Gradient (GCG), remains a challenge, particularly under adaptive threat models where an attacker directly targets the defense mechanism. JBShield, a recent jailbreak defense with a 0% attack success rate in some settings, detects malicious prompts via two concept signals, a toxic concept and a jailbreak concept. We design JB-GCG, which modifies GCG's objective to combine two terms: refusal-direction suppression via cosine similarity between the refusal direction and hidden-state representations, and toxic-concept regularization via JBShield's own toxic concept score. Across five configurations on Llama-3-8B, JB-GCG achieves an average ASR of 46.2%, reaching up to 53.4% in the strongest setting. We further show that our attack remains effective against JBShield-M, achieving ASR up to 30.7% across evaluated settings. The attack persists across multiple JBShield recalibrations, confirming that the vulnerability is structural rather than calibration-specific. We analyze the cosine-similarity signatures of jailbreak representations and find that they occupy a distinctive region in refusal-direction fingerprint space that neither harmless nor harmful prompts inhabit. We introduce Representation Trajectory Verification (RTV), a new defense based on Mahalanobis outlier detection over multi-layer refusal-direction fingerprints. RTV attains an AUROC of 0.99 against our attack. Finally, we design and evaluate an additional adaptive attack against RTV with full white-box knowledge of the defense; the best attack achieves only 7% ASR at 13x the computational cost. Our results show that strong non-adaptive detection does not imply robustness under adaptive threat models, and that multi-layer representation consistency is a more reliable foundation for jailbreak detection than single-layer concept similarity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

JB-GCG shows single-layer concept detectors on Llama-3 are structurally breakable, while the new multi-layer RTV Mahalanobis check holds up far better under white-box adaptation.

read the letter

The main takeaway is straightforward: JBShield's refusal and toxic concept signals can be evaded by a modified GCG that directly suppresses the refusal direction while regularizing on the toxic score, and this works across several recalibrations. The authors then demonstrate that jailbreak hidden states sit in a distinct region of multi-layer refusal-direction space, and a simple Mahalanobis detector (RTV) on those trajectories recovers AUROC 0.99 and limits even a fully adaptive attacker to 7 % ASR at 13× cost. That empirical arc is the paper's real contribution. It gives concrete numbers on five Llama-3-8B configurations and shows the gap is not just a calibration artifact. The attack and detector are both modest extensions of existing GCG and representation-engineering ideas, but the side-by-side evaluation under adaptive threat models is useful and cleanly executed. The main limitations are narrow scope and lack of formal grounding. Everything is reported on a single model family; we do not know whether the same fingerprint separation appears in other architectures or after further fine-tuning. The 7 % residual ASR is low but not zero, and both JB-GCG and RTV introduce a handful of tunable coefficients that would need careful validation in any deployment. No proof is offered that the observed separation must hold for future attacks. Still, the measurements are internally consistent and falsifiable, and the central claim—that single-layer concept checks are fragile while trajectory consistency is harder to spoof—is supported by the data they actually ran. This is the kind of targeted stress-test that the LLM safety literature needs more of. It belongs in a workshop or conference track on adversarial robustness rather than a top-tier venue, but it deserves real referee time so the community can check the implementation details and push the evaluation to additional models.

Referee Report

0 major / 2 minor

Summary. The paper introduces JB-GCG, a modified GCG attack that targets both refusal-direction cosine similarity and JBShield's toxic-concept score, achieving 46.2% average ASR (up to 53.4%) across five Llama-3-8B configurations of JBShield and remaining effective (up to 30.7% ASR) against recalibrated JBShield-M. It demonstrates that jailbreak representations occupy a distinct region in multi-layer refusal-direction fingerprint space, then proposes Representation Trajectory Verification (RTV), a Mahalanobis outlier detector over these fingerprints that attains AUROC 0.99. An adaptive white-box attack on RTV reaches only 7% ASR at 13× cost. The central claim is that single-layer concept detectors are structurally vulnerable under adaptive threat models while multi-layer trajectory consistency provides a more robust foundation.

Significance. If the empirical separation and AUROC results hold, the work supplies concrete evidence that adaptive attacks can expose structural gaps in representation-level defenses and that multi-layer Mahalanobis checks restore high detection performance with limited residual ASR. The reproducible attack success rates, fingerprint visualizations, and cost measurements on held-out prompts constitute falsifiable, parameter-light findings that directly inform the design of future jailbreak detectors.

minor comments (2)

§4.2 and Figure 3: the precise definition of the refusal-direction vector (how it is extracted from the model and whether it is layer-specific) should be stated explicitly so that the fingerprint-space separation can be reproduced without ambiguity.
Table 2: report the standard deviation of ASR across the five configurations rather than only the mean and maximum; this would strengthen the claim that the vulnerability is consistent rather than configuration-dependent.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript, accurate summary of our contributions, and recommendation to accept. We are pleased that the empirical results on adaptive attack success rates, fingerprint separation, and RTV detection performance were viewed as falsifiable and directly informative for future detector design.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper contains no derivations, equations, or fitted parameters that reduce to their own inputs. All central claims (ASR values, AUROC 0.99, 7 % residual ASR) are direct empirical measurements on held-out prompts under explicitly described attack and detection procedures. No self-citation is load-bearing for the reported results, and no ansatz or uniqueness theorem is invoked. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The work is almost entirely empirical. The only free parameters are the implicit thresholds and covariance estimates inside the Mahalanobis detector and the weighting coefficients inside the JB-GCG objective; these are chosen on validation data rather than derived. No new axioms or invented entities are introduced.

free parameters (2)

JB-GCG weighting coefficients
Balance between refusal-direction cosine term and toxic-concept regularization term; fitted to achieve high ASR on JBShield.
RTV Mahalanobis covariance and threshold
Estimated from benign and harmful prompt fingerprints; determines outlier detection boundary.

pith-pipeline@v0.9.0 · 5638 in / 1243 out tokens · 32062 ms · 2026-05-07T02:13:54.863692+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 19 canonical work pages · 9 internal anchors

[1]

Andrew Adiletta, Kathryn Adiletta, Kemal Derya, and Berk Sunar. 2025. Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultane- ously.arXiv preprint arXiv:2512.11783(2025)

work page arXiv 2025
[2]

AI@Meta. 2024. Llama 3 Model Card. (2024). https://github.com/meta-llama/ llama3/blob/main/MODEL_CARD.md

2024
[3]

Gabriel Alon and Michael Kamfonas. 2023. Detecting language model attacks with perplexity.arXiv preprint arXiv:2308.14132(2023)

work page arXiv 2023
[4]

Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. 2024. Jail- breaking leading safety-aligned llms with simple adaptive attacks.arXiv preprint arXiv:2404.02151(2024)

work page arXiv 2024
[5]

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems37 (2024), 136037–136083

2024
[6]

Luke Bailey, Alex Serrano, Abhay Sheshadri, Mikhail Seleznyov, Jordan Tay- lor, Erik Jenner, Jacob Hilton, Stephen Casper, Carlos Guestrin, and Scott Em- mons. 2026. Obfuscated Activations Bypass LLM Latent-Space Defenses. In The Fourteenth International Conference on Learning Representations. https: //openreview.net/forum?id=ktGmDGoWnB

2026
[7]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

2020
[8]

Choquette-Choo, Matthew Jagielski, Irena Gao, Pang Wei Koh, Daphne Ippolito, Florian Tramèr, and Ludwig Schmidt

Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Pang Wei Koh, Daphne Ippolito, Florian Tramèr, and Ludwig Schmidt
[9]

InThirty-seventh Conference on Neural Information Processing Systems

Are aligned neural networks adversarially aligned?. InThirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/ forum?id=OQQoD8Vc3B
[10]

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. 2025. Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 23–42

2025
[11]

Christiano, Jan Leike, Tom B

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep Reinforcement Learning from Human Preferences. In Advances in Neural Information Processing Systems, Vol. 30

2017
[12]

Dylan Feng, Pragya Srivastava, Anca Dragan, and Cassidy Laidlaw. 2026. Bench- marking Anomaly Detection for Large Language Model Alignment. https: //openreview.net/forum?id=Tr74UMzBdw

2026
[13]

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. 2024. WildGuard: Open One-Stop Mod- eration Tools for Safety Risks, Jailbreaks, and Refusals of LLMs. InAdvances in Neural Information Processing Systems, Vol. 37

2024
[14]

Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam Dziedzic, Rishabh Krish- nan, and Dawn Song. 2020. Pretrained transformers improve out-of-distribution robustness. InProceedings of the 58th annual meeting of the association for com- putational linguistics. 2744–2751

2020
[15]

Xiaomeng Hu, Pin-Yu Chen, and Tsung-Yi Ho. 2024. Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Land- scapes. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=vI1WqFn15v

2024
[16]

Peichun Hua, Hao Li, Shanghao Shi, Zhiyuan Yu, and Ning Zhang. 2025. Rethink- ing Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring.arXiv preprint arXiv:2512.12069(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. 2023. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations.arXiv preprint arXiv:2312.06674(2023)

work page internal anchor Pith review arXiv 2023
[18]

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchen- bauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. 2023. Baseline Defenses for Adversarial Attacks Against Aligned Language Models.arXiv preprint arXiv:2309.00614(2023)

work page internal anchor Pith review arXiv 2023
[19]

Yilei Jiang, Xinyan Gao, Tianshuo Peng, Yingshui Tan, Xiaoyong Zhu, Bo Zheng, and Xiangyu Yue. 2025. Hiddendetect: Detecting jailbreak attacks against large vision-language models via monitoring hidden states.arXiv preprint arXiv:2502.147443, 5 (2025)

work page arXiv 2025
[20]

Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Aaron Jiaxun Li, Soheil Feizi, and Himabindu Lakkaraju. 2024. Certifying LLM Safety against Adversarial Prompting. InConference on Language Modeling (COLM)

2024
[21]

Olivier Ledoit and Michael Wolf. 2004. A well-conditioned estimator for large- dimensional covariance matrices.Journal of Multivariate Analysis88, 2 (2004), 365–411. doi:10.1016/S0047-259X(03)00096-4

work page doi:10.1016/s0047-259x(03)00096-4 2004
[22]

Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. 2018. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems31 (2018)

2018
[23]

Wenyun Li, Zheng Zhang, Dongmei Jiang, and Xiangyuan Lan. 2026. Bolster Hallucination Detection via Prompt-Guided Data Augmentation. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 37618–37626

2026
[24]

Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, and Hongyang Zhang. 2024. RAIN: Your Language Models Can Align Themselves without Finetuning.Inter- national Conference on Learning Representations (ICLR)(2024)

2024
[25]

Xiao Lin, Philip Li, Zhichen Zeng, Tingwei Li, Tianxin Wei, Xuying Ning, Gaotang Li, Yuzhong Chen, and Hanghang Tong. 2026. ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification.arXiv preprint arXiv:2601.03600 (2026)

work page arXiv 2026
[26]

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2024. AutoDAN: Gener- ating Stealthy Jailbreak Prompts on Aligned Large Language Models. InInterna- tional Conference on Learning Representations (ICLR)

2024
[27]

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. 2024. HarmBench: a standardized evaluation framework for automated red teaming and robust refusal. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24). JML...

2024
[28]

Yi Nian, Shenzhe Zhu, Yuehan Qin, Li Li, Ziyi Wang, Chaowei Xiao, and Yue Zhao
[29]

InSecond Conference on Language Modeling

JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model. InSecond Conference on Language Modeling. https://openreview.net/ forum?id=8Pxdzsqvx9
[30]

OpenAI. 2023. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774(2023). doi:10.48550/arXiv.2303.08774

work page internal anchor Pith review doi:10.48550/arxiv.2303.08774 2023
[31]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schul- man, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training Lan- guage Models to Follow Instructions with Human...

2022
[32]

Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. 2023. Steering llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681(2023)

work page internal anchor Pith review arXiv 2023
[33]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2024. Gorilla: Large Language Model Connected with Massive APIs. InAdvances in Neural Information Processing Systems, Vol. 37

2024
[34]

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red Teaming Language Models with Language Models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). 3419–3448

2022
[35]

Mansi Phute, Alec Helbling, Matthew Hull, ShengYun Peng, Sebastian Szyller, Cory Cornelius, and Duen Horng Chau. 2023. Llm self defense: By self examina- tion, llms know they are being tricked.arXiv preprint arXiv:2308.07308(2023)

work page arXiv 2023
[36]

Alexander Podolskiy, Dmitry Lipin, Andrey Bout, Ekaterina Artemova, and Irina Piontkovskaya. 2021. Revisiting mahalanobis distance for transformer-based out- of-domain detection. InProceedings of the AAAI conference on artificial intelligence, Vol. 35. 13675–13682

2021
[37]

Alexander Robey, Eric Wong, Hamed Hassani, and George J. Pappas. 2025. SmoothLLM: Defending Large Language Models Against Jailbreaking At- tacks.Transactions on Machine Learning Research(2025). arXiv preprint arXiv:2310.03684, 2023. Conference acronym ’XX, , Derya et al

work page internal anchor Pith review arXiv 2025
[38]

Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matt...
[39]

In International Conference on Learning Representations (ICLR)

Multitask Prompted Training Enables Zero-Shot Task Generalization. In International Conference on Learning Representations (ICLR)
[40]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. InAdvances in Neural Information Processing Systems, Vol. 36

2023
[41]

Leo Schwinn and Simon Geisler. 2024. Revisiting the robust alignment of circuit breakers.arXiv preprint arXiv:2407.15902(2024)

work page arXiv 2024
[42]

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang
[43]

Do Anything Now

“Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. InProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security (CCS)

2024
[44]

Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F. Christiano. 2020. Learning to Summarize from Human Feedback. InAdvances in Neural Information Processing Systems, Vol. 33

2020
[45]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_ alpaca

2023
[46]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lam- ple. 2023. LLaMA: Open and Efficient Foundation Language Models.arXiv preprint arXiv:2302.13971(2023). doi:10.48550/arXiv.2302.13971

work page internal anchor Pith review doi:10.48550/arxiv.2302.13971 2023
[47]

Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. 2020. On adaptive attacks to adversarial example defenses.Advances in neural infor- mation processing systems33 (2020), 1633–1645

2020
[48]

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How Does LLM Safety Training Fail?. InAdvances in Neural Information Processing Systems, Vol. 36

2023
[49]

Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2022. Finetuned Language Models Are Zero-Shot Learners. InInternational Conference on Learning Representations (ICLR)

2022
[50]

Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, et al. 2024. Sorry- bench: Systematically evaluating large language model safety refusal.arXiv preprint arXiv:2406.14598(2024)

work page arXiv 2024
[51]

Yueqi Xie, Minghong Fang, Renjie Pi, and Neil Gong. 2024. GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Volume 1: Long Papers. 507–518

2024
[52]

Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. 2023. Defending ChatGPT Against Jailbreak Attack via Self-Reminders.Nature Machine Intelligence5, 12 (2023), 1486–1496

2023
[53]

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, and Radha Poovendran. 2024. SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Volume 1: Long Papers. 5587– 5605

2024
[54]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR)

2023
[55]

Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Rad- harapu, Olivia Sturman, and Oscar Wahltinez. 2024. ShieldGemma: Generative AI Content Moderation Based on Gemma.arXiv preprint arXiv:2407.21772(2024)

work page arXiv 2024
[56]

2025.{JBShield}: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation

Shenyi Zhang, Yuchen Zhai, Keyan Guo, Hongxin Hu, Shengnan Guo, Zheng Fang, Lingchen Zhao, Chao Shen, Cong Wang, and Qian Wang. 2025.{JBShield}: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation. In34th USENIX Security Symposium (USENIX Security 25). 8215–8234

2025
[57]

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al
[58]

Representation Engineering: A Top-Down Approach to AI Transparency

Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405(2023)

work page internal anchor Pith review arXiv 2023
[59]

Zico Kolter, Matt Fredrikson, and Dan Hendrycks

Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym An- driushchenko, Rowan Wang, J. Zico Kolter, Matt Fredrikson, and Dan Hendrycks
[60]

Lower safety scores indicate stronger harmfulness after ablation

Improving Alignment and Robustness with Circuit Breakers.Advances in original 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Layers 0.0 0.2 0.4 0.6 0.8 1.0Average Safety Score Average Safety Scores by Layer (Llama Guard 3 8B) Figure 5: Layer-wise refusal-direction ablation on Llama-3 8B, evaluated using Llama-Guard-3 8...

2024
[61]

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043(2023). A Model Ablation For each layer, we extract a refusal direction and then ablate lay- ers by subtracting the projection of the refusal direction from t...

work page internal anchor Pith review arXiv 2023