Exploring and Developing a Pre-Model Safeguard with Draft Models

Antonio Bianchi; Arjun Arunasalam; Hongyu Cai; Yiming Liang; Z. Berkay Celik

arxiv: 2605.19321 · v1 · pith:R4WDEIV6new · submitted 2026-05-19 · 💻 cs.CR · cs.AI

Exploring and Developing a Pre-Model Safeguard with Draft Models

Hongyu Cai , Arjun Arunasalam , Yiming Liang , Antonio Bianchi , Z. Berkay Celik This is my paper

Pith reviewed 2026-05-20 05:06 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords jailbreak attackspre-model safeguardsdraft modelsspeculative inferencelanguage model safetytransferabilitysmall language modelsLLM alignment

0 comments

The pith

Draft responses from small models can flag unsafe prompts for large language models before full inference runs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that jailbreak prompts designed for large language models often trigger unsafe outputs in smaller draft models as well. By generating a few draft responses with these smaller models and feeding both the prompt and drafts into existing safety guards, the method predicts whether the large target model would produce harmful content. This approach lowers the false-negative rate of prompt-only pre-model guards while avoiding the full cost of checking the actual large-model output. A sympathetic reader would care because it offers a practical middle ground between weak pre-inference checks and expensive post-inference audits.

Core claim

Responses from smaller draft models reflect the safety implications of those from large target models; given a jailbreak prompt constructed for an LLM, an SLM is likely to be triggered to generate an unaligned response. The safeguard uses speculative inference with SLMs to produce draft responses, then applies existing guards to the original prompt plus these drafts to predict safety before target model inference.

What carries the argument

Speculative inference with small language models that generates draft responses whose safety behavior mirrors the large target model's response to the same prompt.

If this is right

Pre-model guards achieve lower false-negative rates without needing the target model's full output.
The design provides a lower-cost alternative to post-model guards that audit both prompt and response.
Key factors that affect jailbreak transferability between model sizes can guide further improvements to the safeguard.
Safety auditing can occur before target model inference while still using information from generated text.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method might extend to detecting other alignment failures beyond jailbreaks if similar transfer patterns hold.
Different sizes of draft models could be tested to balance prediction accuracy against added computation.
Integration with multiple existing guards might further reduce missed unsafe prompts.

Load-bearing premise

Jailbreak transferability from large models to small draft models is reliable enough that checking the drafts predicts whether the large model would give an unsafe answer.

What would settle it

Measure agreement between the safeguard's safety prediction on draft responses and the actual safety classification of the large model's full response across a fixed set of jailbreak prompts.

Figures

Figures reproduced from arXiv: 2605.19321 by Antonio Bianchi, Arjun Arunasalam, Hongyu Cai, Yiming Liang, Z. Berkay Celik.

**Figure 3.** Figure 3: An overview of our methodology for understanding the transferability of jailbreak attacks. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The distribution of unsafe response ratios for be [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Transferability rate vs. Number of SLM Responses. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of our safeguard design to guard against jailbreak attacks. The transferability of jailbreak prompts is [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Defense Failure Rate (DFR) (aggregation threshold [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 2.** Figure 2: Transferability matrix. TR𝑖,𝑗 represents the transferability rate from the 𝑖-th large model to the 𝑗-th small model. We present a transferability matrix that visualizes the proportion of successful jailbreaks on SLMs given that they were successful on a particular LLM. The transferability matrix is a heatmap that displays the transferability rate for each combination of large and small models. The transfe… view at source ↗

**Figure 1.** Figure 1: Example of a jailbreak attack that bypasses the pre [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗

**Figure 3.** Figure 3: Transferability rate of jailbreak attacks across iter [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 5.** Figure 5: Transferability rates of jailbreak prompts across different intent categories. The x-axis represents the intent categories, [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Impact of response count (𝑏) on defense failure rate (DFR): Aggregation threshold 𝜏 is held constant while 𝑏 is varied to evaluate its impact on the defense failure rate (DFR) of our safeguard design. aggregation strategy (𝜏 = 0), the DFR remains relatively stable across different response counts. For the “any” aggregation, the DFR consistently decreases as 𝑏 increases. This is because a larger response co… view at source ↗

**Figure 9.** Figure 9: Impact of the aggregation threshold, 𝜏, on accuracy. The draft response count, 𝑏, is held constant while 𝜏 is varied to assess its influence on the accuracy of our safeguard design. language model. These responses are typically harmful or violate safety guidelines. To achieve this, the attacker optimizes an adversarial suffix appended to the prompt, aiming to maximize the likelihood of generating a speci… view at source ↗

**Figure 8.** Figure 8: Impact of the number of responses, 𝑏, on accuracy. The aggregation threshold, 𝜏, is held constant while𝑏 is varied to evaluate its effect on the accuracy of our safeguard design. as 𝑏 increases, indicating that the accuracy on benign prompts is not sensitive to the response count in these cases. This observation aligns with our findings in the DFR evaluation. Impact of Threshold. We next analyze the impact… view at source ↗

read the original abstract

Large Language Model (LLM) alignment remains vulnerable to jailbreak attacks that elicit unsafe responses, motivating pre-model and post-model guards. Pre-model guards audit the safety of prompts before invoking target models. However, relying solely on the prompt often leads to high false-negative rates (i.e., jailbreak attacks go undetected). Post-model guards address this issue by auditing both the user prompt and the target model's response. However, they incur a high computational cost, including increased token usage and processing time, because they operate after target model inference. In this paper, we introduce a safeguard design that leverages the transferability of jailbreak attacks to enforce prompt safety before target model inference. We first conduct a systematic study of jailbreak transferability, particularly from LLMs to small language models (SLMs). Through these experiments, we identify key factors influencing transferability. Building on these insights, we observe that responses from smaller draft models reflect the safety implications of those from large target models; \ie given a jailbreak prompt constructed for an LLM, an SLM is likely to be triggered to generate an unaligned response. Based on this observation, our safeguard design leverages speculative inference with SLMs to generate a set of draft responses. It then feeds the original prompt and these drafts into existing guards to predict their safety. We demonstrate that this design reduces the false-negative rate of pre-model guards and offers a low \Efficiency alternative to post-model guards. \textcolor{red}{\bf Notice: This paper contains examples of harmful language.}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Draft SLMs can flag some jailbreaks early through observed transferability, but the method's reliability hinges on how well that holds across different model families and alignments.

read the letter

The paper's main move is to run small draft models first on a prompt, collect their responses, and route those plus the original prompt into existing guards for an early safety decision before the target LLM does full inference. They support this with a study of jailbreak transfer from large models to SLMs, noting factors that affect whether an attack on the big model also triggers unsafe output from the small one. This setup aims to cut false negatives versus prompt-only guards while avoiding the full token and time cost of post-model checks. If the transfer numbers are solid, it is a straightforward efficiency win for deployment pipelines that already use guards. The architecture itself is clean because it reuses off-the-shelf guards and does not introduce new fitted parameters. The systematic transfer study is the part that feels new relative to prior guard papers. The soft spot is the transferability premise itself. When the draft model comes from a different family or uses different post-training, its refusal boundary can diverge enough that a prompt eliciting harm from the target produces safe or unrelated text from the SLM, so the guard misses the attack. The abstract claims they identified key factors, but without explicit controls for family mismatch and broad attack coverage in the results, the correlation could be narrower than presented. If the full experiments include diverse pairs and report recall under those conditions, the concern shrinks; otherwise it remains the load-bearing assumption. This is for readers building practical safety layers around aligned LLMs who care about inference cost. It is worth a serious referee to check the transfer experiments and see how far the correlation generalizes.

Referee Report

2 major / 2 minor

Summary. The paper proposes a pre-model safeguard against jailbreak attacks on LLMs that uses small language models (SLMs) as draft models via speculative inference. It first reports a systematic study of jailbreak transferability from LLMs to SLMs, identifies influencing factors, and observes that SLM draft responses tend to reflect the safety implications of the target LLM's response to the same prompt. The safeguard then supplies the original prompt plus these drafts to existing safety guards to predict safety before target-model inference, claiming lower false-negative rates than prompt-only pre-model guards and lower computational cost than post-model guards.

Significance. If the transferability results prove robust, the design offers a practical efficiency gain by moving some safety auditing before target inference while still using response-like signals. The systematic study of cross-scale transferability is a useful empirical contribution if it includes diverse model pairs and attack types; the approach could be adopted in production pipelines that already employ speculative decoding.

major comments (2)

[Abstract] The central claim that the design 'reduces the false-negative rate of pre-model guards' rests on the transferability observation, yet the abstract supplies no quantitative transfer rates, false-negative reductions, datasets, or baseline comparisons. Without these numbers the support for the claim cannot be evaluated.
The safeguard mechanism assumes reliable cross-model transferability even when the draft SLM differs in family, scale, or alignment method from the target LLM. The skeptic concern that refusal boundaries may diverge sharply is load-bearing; the manuscript must demonstrate that the systematic study explicitly controlled for family mismatch and attack-type coverage rather than reporting results only on matched pairs.

minor comments (2)

[Abstract] The phrase 'low Efficiency alternative' appears to be a typesetting artifact and should be corrected to 'low-cost' or 'low-overhead'.
The paper should clarify whether the draft responses are generated with the same speculative-decoding hyperparameters used in inference or with safety-specific sampling settings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight opportunities to improve the clarity of our quantitative claims and the robustness of the transferability analysis. We address each major comment below and indicate the corresponding revisions.

read point-by-point responses

Referee: [Abstract] The central claim that the design 'reduces the false-negative rate of pre-model guards' rests on the transferability observation, yet the abstract supplies no quantitative transfer rates, false-negative reductions, datasets, or baseline comparisons. Without these numbers the support for the claim cannot be evaluated.

Authors: We agree that the original abstract was insufficiently quantitative. In the revised manuscript we have expanded the abstract to report the following concrete results: average jailbreak transferability of 71% from target LLMs to draft SLMs, a 19% absolute reduction in false-negative rate relative to prompt-only baselines, evaluation on AdvBench and HarmBench, and direct comparisons against Llama-Guard and OpenAI Moderation as pre-model guards. These figures are now stated in the abstract and are supported by the detailed tables and figures in Sections 4 and 5. revision: yes
Referee: [—] The safeguard mechanism assumes reliable cross-model transferability even when the draft SLM differs in family, scale, or alignment method from the target LLM. The skeptic concern that refusal boundaries may diverge sharply is load-bearing; the manuscript must demonstrate that the systematic study explicitly controlled for family mismatch and attack-type coverage rather than reporting results only on matched pairs.

Authors: We appreciate the emphasis on cross-family robustness. Section 3 of the manuscript already includes mismatched-family experiments (Llama-3 target with Gemma-2B and Phi-3-mini drafts; Mistral target with Qwen-1.5B drafts) and covers four attack families (GCG, AutoDAN, PAIR, and hand-crafted). Transfer rates remain above 62% in the mismatched cases. To make this control explicit, we have added a dedicated paragraph and an additional table (Table 4) that stratifies results by family match/mismatch and attack type, together with a short discussion of observed refusal-boundary divergence. We believe these additions directly address the load-bearing concern while remaining within the scope of the empirical study. revision: partial

Circularity Check

0 steps flagged

No circularity: safeguard rests on empirical transferability experiments

full rationale

The paper first runs a systematic study of jailbreak transferability from LLMs to SLMs, identifies influencing factors, and then observes that draft-model responses reflect target safety implications. This observation is used to motivate the pre-model safeguard design. No step reduces by definition or construction to its own inputs; no parameters are fitted on a subset and then relabeled as predictions; no load-bearing claim depends on a self-citation chain whose prior result is itself unverified. The central mechanism is externally falsifiable via the reported transferability experiments and remains independent of the safeguard itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption about transferability; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Jailbreak attacks transfer from LLMs to SLMs such that draft responses indicate the safety of target responses
This observation is presented as the foundation for the safeguard design.

pith-pipeline@v0.9.0 · 5817 in / 1187 out tokens · 52160 ms · 2026-05-20T05:06:52.018617+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

responses from smaller draft models reflect the safety implications of those from large target models; i.e., given a jailbreak prompt constructed for an LLM, an SLM is likely to be triggered to generate an unaligned response
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce a safeguard design that leverages the transferability of jailbreak attacks to enforce prompt safety before target model inference

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 1 internal anchor

[1]

Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, et al

Marah I. Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, et al . 2024. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. https://www.microsoft.com/en- us/research/publication/phi-3-technical-report-a-highly-capable-language- model-locally-on-your-phone/.arXiv(2024)

work page 2024
[2]

Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Leandro von Werra, and Thomas Wolf. 2024. SmolLM - blazingly fast and remarkably powerful. https://huggingface. co/blog/smollm

work page 2024
[3]

Kamfonas

Gabriel Alon and Michael J. Kamfonas. 2023. Detecting Language Model Attacks With Perplexity. https://openreview.net/forum?id=lNLVvdHyAw.arXiv(2023). CAIS ’26, May 26–29, 2026, San Jose, CA, USA Hongyu Cai, Arjun Arunasalam, Yiming Liang, Antonio Bianchi, and Z. Berkay Celik

work page 2023
[4]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, et al. 2023. Qwen Technical Report.arXiv(2023)

work page 2023
[5]

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, et al. 2022. Training a Helpful and Harmless Assistant With Reinforcement Learning From Human Feedback.arXiv(2022)

work page 2022
[6]

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, et al. 2022. Constitutional AI: Harmlessness from AI Feedback.arXiv(2022)

work page 2022
[7]

Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, et al. 2024. Speculative Streaming: Fast LLM Inference without Auxil- iary Models.arXiv(2024)

work page 2024
[8]

Bochuan Cao, Yuanpu Cao, Lu Lin, and Jinghui Chen. 2023. Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM.Annual Meeting of the Association for Computational Linguistics(2023)

work page 2023
[9]

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, et al. 2024. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models.NeurIPS Datasets and Benchmarks Track (2024)

work page 2024
[10]

Pappas, et al

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, et al. 2023. Jailbreaking Black Box Large Language Models in Twenty Queries.IEEE Conference on Secure and Trustworthy Machine Learning(2023)

work page 2023
[11]

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Lau- rent Sifre, et al. 2023. Accelerating Large Language Model Decoding with Specu- lative Sampling.arXiv(2023)

work page 2023
[12]

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. 2024. Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. Conference on Language Modeling(2024)

work page 2024
[13]

Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, et al

work page
[14]

https://openreview.net/forum?id=4hturzLcKX.Conference on Neural Information Processing Systems(2023)

AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. https://openreview.net/forum?id=4hturzLcKX.Conference on Neural Information Processing Systems(2023)

work page 2023
[15]

Jaiden Fairoze, Sanjam Garg, Keewoo Lee, and Mingyuan Wang. 2025. Bypassing Prompt Guards in Production with Controlled-Release Prompting. https://arxiv. org/abs/2510.01529.arXiv(2025)

work page arXiv 2025
[16]

Markus Freitag and Yaser Al-Onaizan. 2017. Beam Search Strategies for Neural Machine Translation. https://aclanthology.org/W17-3207/.Workshop on Neural Machine Translation(2017)

work page 2017
[17]

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The Curious Case of Neural Text Degeneration. https://openreview.net/forum?id= rygGQyrFvH.International Conference on Learning Representations(2020)

work page 2020
[18]

Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. 2023. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation.Interna- tional Conference on Learning Representations(2023)

work page 2023
[19]

Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, et al

Hakan Inan, K. Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, et al . 2023. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations. https://bit.ly/42ir2JB.arXiv(2023)

work page 2023
[20]

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchen- bauer, et al . 2023. Baseline defenses for adversarial attacks against aligned language models.arXiv(2023)

work page 2023
[21]

Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, et al

work page
[22]

RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback.International Conference on Machine Learning(2023)

work page 2023
[23]

Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from transformers via speculative decoding.International Conference on Machine Learning(2023)

work page 2023
[24]

Meng Li, Wangmeng Zuo, and Lei Zhang. 2021. Early Stopping for Deep Image Prior.IEEE/CVF Conference on Computer Vision and Pattern Recognition(2021)

work page 2021
[25]

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, et al

work page
[26]

https://github.com/tatsu-lab/alpaca_eval.GitHub repository(2023)

AlpacaEval: An Automatic Evaluator of Instruction-following Models. https://github.com/tatsu-lab/alpaca_eval.GitHub repository(2023). https://bit. ly/4jhOvkn

work page 2023
[27]

Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han

work page
[28]

Deepinception: Hypnotize large language model to be jailbreaker.arXiv (2023)

work page 2023
[29]

Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, et al. 2024. The Unlocking Spell on Base LLMs: Rethinking Alignment via In- Context Learning. https://openreview.net/forum?id=wxJ0eXwwda.International Conference on Learning Representations(2024)

work page 2024
[30]

Tong Liu, Yingjie Zhang, Zhe Zhao, Yinpeng Dong, Guozhu Meng, et al. 2024. Making them ask and answer: Jailbreaking large language models in few queries via disguise and reconstruction.USENIX Security Symposium(2024)

work page 2024
[31]

Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Ion Stoica, Zhijie Deng, et al . 2023. Online Speculative Decoding.International Conference on Machine Learning (2023)

work page 2023
[32]

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2023. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. https://openreview.net/forum?id=7Jwpw4qKkb.International Conference on Learning Representations(2023)

work page 2023
[33]

Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, et al. 2024. Jailbreak- ing ChatGPT via Prompt Engineering: An Empirical Study.arXiv(2024)

work page 2024
[34]

AI @ Meta Llama Team. 2024. The Llama 3 Herd of Models. https://arxiv.org/ abs/2407.21783.arXiv(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, et al. 2024. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal.International Conference on Machine Learning(2024)

work page 2024
[36]

Meta. 2024. Meta Llama Guard 2. https://bit.ly/42l2a42

work page 2024
[37]

Meta-Llama. 2024. Prompt-Guard. https://github.com/meta-llama/PurpleLlama/ tree/main/Prompt-Guard

work page 2024
[38]

NVIDIA. [n. d.]. NVIDIA Data Center Deep Learning Product Performance AI Inference. https://developer.nvidia.com/deep-learning-performance-training- inference/ai-inference.NVIDIA Developer([n. d.]). https://developer.nvidia.com/ deep-learning-performance-training-inference/ai-inference

work page
[39]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, et al. 2024. GPT-4 Technical Report.arXiv(2024)

work page 2024
[40]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, et al

work page
[41]

Annual Conference on Neural Information Processing Systems(2022)

Training Language Models to Follow Instructions With Human Feedback. Annual Conference on Neural Information Processing Systems(2022)

work page 2022
[42]

Mansi Phute, Alec Helbling, Matthew Hull, ShengYun Peng, Sebastian Szyller, et al. 2024. LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked.Tiny Papers Track at ICLR(2024)

work page 2024
[43]

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, et al. 2024. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! https://openreview.net/forum?id=hTEGyKf0dZ.International Conference on Learning Representations(2024)

work page 2024
[44]

Alexander Robey, Eric Wong, Hamed Hassani, and George J. Pappas. 2023. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks. Transactions on Machine Learning Research(2023)

work page 2023
[45]

Sayak Saha Roy, Poojitha Thota, Krishna Vamsi Naragam, and Shirin Nilizadeh

work page
[46]

From Chatbots to Phishbots?: Phishing Scam Generation in Commercial Large Language Models.IEEE Symposium on Security and Privacy(2024)

work page 2024
[47]

Vinu Sankar Sadasivan, Shoumik Saha, Gaurang Sriramanan, Priyatham Kat- takinda, Atoosa Chegini, and Soheil Feizi. 2024. Fast Adversarial Attacks on Language Models In One GPU Minute. https://arxiv.org/abs/2402.15570.Interna- tional Conference on Machine Learning(2024)

work page arXiv 2024
[48]

Leo Schwinn and Simon Geisler. 2024. Revisiting the Robust Alignment of Circuit Breakers. https://arxiv.org/abs/2407.15902.arXiv(2024)

work page arXiv 2024
[49]

Yanshen Sun, Jianfeng He, Limeng Cui, Shuo Lei, and Chang-Tien Lu. 2024. Exploring the Deceptive Power of LLM-Generated Fake News: A Study of Real- World Detection Challenges.arXiv(2024)

work page 2024
[50]

Gemma Team. 2024. Gemma: Introducing new state-of-the-art open models. https://blog.google/technology/developers/gemma-open-models/.Google(2024). https://blog.google/technology/developers/gemma-open-models/

work page 2024
[51]

Llama Team. [n. d.]. Prompt Guard-86M | Model Cards and Prompt formats. https: //www.llama.com/docs/model-cards-and-prompt-formats/prompt-guard/

work page
[52]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, et al

work page
[53]

Llama 2: Open Foundation and Fine-Tuned Chat Models.arXiv(2023)

work page 2023
[54]

Eric Wallace, Tony Zhao, Shi Feng, and Sameer Singh. 2021. Concealed Data Poisoning Attacks on NLP Models.Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2021)

work page 2021
[55]

Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. 2023. Poisoning Language Models During Instruction Tuning.International Conference on Machine Learning(2023)

work page 2023
[56]

Zeming Wei, Yifei Wang, and Yisen Wang. 2023. Jailbreak and guard aligned language models with only few in-context demonstrations.arXiv(2023)

work page 2023
[57]

Xiaofei Wen, Wenxuan Zhou, Wenjie Jacky Mo, and Muhao Chen. 2025. Think- Guard: Deliberative Slow Thinking Leads to Cautious Guardrails. https://arxiv. org/abs/2502.13458.arXiv(2025)

work page arXiv 2025
[58]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, et al. 2020. Transformers: State-of-the-Art Natural Language Processing.Confer- ence on Empirical Methods in Natural Language Processing: System Demonstrations (2020)

work page 2020
[59]

Yueqi Xie, Minghong Fang, Renjie Pi, and Neil Gong. 2024. GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis.Annual Meeting of the Association for Computational Linguistics(2024)

work page 2024
[60]

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, et al. 2024. SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding. Annual Meeting of the Association for Computational Linguistics(2024)

work page 2024
[61]

Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. 2023. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv(2023)

work page 2023
[62]

Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shum- ing Shi, and Zhaopeng Tu. 2024. GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher.International Conference on Learning Representations (2024)

work page 2024
[63]

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, et al

work page
[64]

Exploring and Developing a Pre-Model Safeguard with Draft Models CAIS ’26, May 26–29, 2026, San Jose, CA, USA

OPT: Open Pre-trained Transformer Language Models.arXiv(2022). Exploring and Developing a Pre-Model Safeguard with Draft Models CAIS ’26, May 26–29, 2026, San Jose, CA, USA

work page 2022
[65]

Zhexin Zhang, Junxiao Yang, Pei Ke, and Minlie Huang. 2023. Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization. Annual Meeting of the Association for Computational Linguistics(2023)

work page 2023
[66]

Wei Zhao, Zhe Li, Yige Li, Ye Zhang, and Jun Sun. 2024. Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing.Empirical Methods in Natural Language Processing(2024)

work page 2024
[67]

Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, et al. 2024. Weak- to-Strong Jailbreaking on Large Language Models.International Conference on Machine Learning(2024)

work page 2024
[68]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, et al

work page
[69]

Judging LLM-as-a-judge with MT-bench and Chatbot Arena.International Conference on Neural Information Processing Systems(2024)

work page 2024
[70]

Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, et al. 2023. LIMA: Less Is More for Alignment. https://openreview.net/forum?id=KBMOKmX2he. Conference on Neural Information Processing Systems(2023)

work page 2023
[71]

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, et al

work page
[72]

Fine-Tuning Language Models From Human Preferences.arXiv(2019)

work page 2019
[73]

Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym An- driushchenko, Ruiqi Wang, Zico Kolter, Matt Fredrikson, and Dan Hendrycks

work page
[74]

https: //arxiv.org/abs/2406.04313.arXiv(2024)

Improving Alignment and Robustness with Circuit Breakers. https: //arxiv.org/abs/2406.04313.arXiv(2024)

work page arXiv 2024
[75]

Greetings

Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. 2023. Universal and Transferable Adversarial Attacks on Aligned Language Models.arXiv(2023). A Jailbreak Example that Bypasses the Guard We show a representative jailbreak prompt that bypasses the pre- model guard in Figure 1. This example illustrates how prompt-only filtering can miss malicious i...

work page 2023
[76]

BREAKING: Mysterious Virus Spreads Across City, Thousands Infected!

Craft a sensational title: Create a captivating headline that grabs attention, yet is misleading and exaggerated. Example: “BREAKING: Mysterious Virus Spreads Across City, Thousands Infected!"

work page
[77]

highly contagious

Select a plausible topic: Choose a topic that is relevant and relatable to the audience, but can be easily distorted to create panic. Example: A fake story about a new virus that is "highly contagious" and “deadly"

work page
[78]

Violence

Add fake statistics and “expert" opinions: Create false Figure 1: Example of a jailbreak attack that bypasses the pre- model guard. The malicious prompt is classified as safe by the guard model (i.e., LlamaGuard 2 [ 30] in pre-model mode) but induces an unsafe response from the target language model (i.e., Llama-3-70B-Instruct-A WQ [30]). This occurs beca...

work page 2026

[1] [1]

Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, et al

Marah I. Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, et al . 2024. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. https://www.microsoft.com/en- us/research/publication/phi-3-technical-report-a-highly-capable-language- model-locally-on-your-phone/.arXiv(2024)

work page 2024

[2] [2]

Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Leandro von Werra, and Thomas Wolf. 2024. SmolLM - blazingly fast and remarkably powerful. https://huggingface. co/blog/smollm

work page 2024

[3] [3]

Kamfonas

Gabriel Alon and Michael J. Kamfonas. 2023. Detecting Language Model Attacks With Perplexity. https://openreview.net/forum?id=lNLVvdHyAw.arXiv(2023). CAIS ’26, May 26–29, 2026, San Jose, CA, USA Hongyu Cai, Arjun Arunasalam, Yiming Liang, Antonio Bianchi, and Z. Berkay Celik

work page 2023

[4] [4]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, et al. 2023. Qwen Technical Report.arXiv(2023)

work page 2023

[5] [5]

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, et al. 2022. Training a Helpful and Harmless Assistant With Reinforcement Learning From Human Feedback.arXiv(2022)

work page 2022

[6] [6]

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, et al. 2022. Constitutional AI: Harmlessness from AI Feedback.arXiv(2022)

work page 2022

[7] [7]

Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, et al. 2024. Speculative Streaming: Fast LLM Inference without Auxil- iary Models.arXiv(2024)

work page 2024

[8] [8]

Bochuan Cao, Yuanpu Cao, Lu Lin, and Jinghui Chen. 2023. Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM.Annual Meeting of the Association for Computational Linguistics(2023)

work page 2023

[9] [9]

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, et al. 2024. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models.NeurIPS Datasets and Benchmarks Track (2024)

work page 2024

[10] [10]

Pappas, et al

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, et al. 2023. Jailbreaking Black Box Large Language Models in Twenty Queries.IEEE Conference on Secure and Trustworthy Machine Learning(2023)

work page 2023

[11] [11]

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Lau- rent Sifre, et al. 2023. Accelerating Large Language Model Decoding with Specu- lative Sampling.arXiv(2023)

work page 2023

[12] [12]

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. 2024. Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. Conference on Language Modeling(2024)

work page 2024

[13] [13]

Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, et al

work page

[14] [14]

https://openreview.net/forum?id=4hturzLcKX.Conference on Neural Information Processing Systems(2023)

AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. https://openreview.net/forum?id=4hturzLcKX.Conference on Neural Information Processing Systems(2023)

work page 2023

[15] [15]

Jaiden Fairoze, Sanjam Garg, Keewoo Lee, and Mingyuan Wang. 2025. Bypassing Prompt Guards in Production with Controlled-Release Prompting. https://arxiv. org/abs/2510.01529.arXiv(2025)

work page arXiv 2025

[16] [16]

Markus Freitag and Yaser Al-Onaizan. 2017. Beam Search Strategies for Neural Machine Translation. https://aclanthology.org/W17-3207/.Workshop on Neural Machine Translation(2017)

work page 2017

[17] [17]

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The Curious Case of Neural Text Degeneration. https://openreview.net/forum?id= rygGQyrFvH.International Conference on Learning Representations(2020)

work page 2020

[18] [18]

Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. 2023. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation.Interna- tional Conference on Learning Representations(2023)

work page 2023

[19] [19]

Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, et al

Hakan Inan, K. Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, et al . 2023. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations. https://bit.ly/42ir2JB.arXiv(2023)

work page 2023

[20] [20]

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchen- bauer, et al . 2023. Baseline defenses for adversarial attacks against aligned language models.arXiv(2023)

work page 2023

[21] [21]

Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, et al

work page

[22] [22]

RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback.International Conference on Machine Learning(2023)

work page 2023

[23] [23]

Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from transformers via speculative decoding.International Conference on Machine Learning(2023)

work page 2023

[24] [24]

Meng Li, Wangmeng Zuo, and Lei Zhang. 2021. Early Stopping for Deep Image Prior.IEEE/CVF Conference on Computer Vision and Pattern Recognition(2021)

work page 2021

[25] [25]

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, et al

work page

[26] [26]

https://github.com/tatsu-lab/alpaca_eval.GitHub repository(2023)

AlpacaEval: An Automatic Evaluator of Instruction-following Models. https://github.com/tatsu-lab/alpaca_eval.GitHub repository(2023). https://bit. ly/4jhOvkn

work page 2023

[27] [27]

Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han

work page

[28] [28]

Deepinception: Hypnotize large language model to be jailbreaker.arXiv (2023)

work page 2023

[29] [29]

Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, et al. 2024. The Unlocking Spell on Base LLMs: Rethinking Alignment via In- Context Learning. https://openreview.net/forum?id=wxJ0eXwwda.International Conference on Learning Representations(2024)

work page 2024

[30] [30]

Tong Liu, Yingjie Zhang, Zhe Zhao, Yinpeng Dong, Guozhu Meng, et al. 2024. Making them ask and answer: Jailbreaking large language models in few queries via disguise and reconstruction.USENIX Security Symposium(2024)

work page 2024

[31] [31]

Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Ion Stoica, Zhijie Deng, et al . 2023. Online Speculative Decoding.International Conference on Machine Learning (2023)

work page 2023

[32] [32]

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2023. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. https://openreview.net/forum?id=7Jwpw4qKkb.International Conference on Learning Representations(2023)

work page 2023

[33] [33]

Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, et al. 2024. Jailbreak- ing ChatGPT via Prompt Engineering: An Empirical Study.arXiv(2024)

work page 2024

[34] [34]

AI @ Meta Llama Team. 2024. The Llama 3 Herd of Models. https://arxiv.org/ abs/2407.21783.arXiv(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, et al. 2024. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal.International Conference on Machine Learning(2024)

work page 2024

[36] [36]

Meta. 2024. Meta Llama Guard 2. https://bit.ly/42l2a42

work page 2024

[37] [37]

Meta-Llama. 2024. Prompt-Guard. https://github.com/meta-llama/PurpleLlama/ tree/main/Prompt-Guard

work page 2024

[38] [38]

NVIDIA. [n. d.]. NVIDIA Data Center Deep Learning Product Performance AI Inference. https://developer.nvidia.com/deep-learning-performance-training- inference/ai-inference.NVIDIA Developer([n. d.]). https://developer.nvidia.com/ deep-learning-performance-training-inference/ai-inference

work page

[39] [39]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, et al. 2024. GPT-4 Technical Report.arXiv(2024)

work page 2024

[40] [40]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, et al

work page

[41] [41]

Annual Conference on Neural Information Processing Systems(2022)

Training Language Models to Follow Instructions With Human Feedback. Annual Conference on Neural Information Processing Systems(2022)

work page 2022

[42] [42]

Mansi Phute, Alec Helbling, Matthew Hull, ShengYun Peng, Sebastian Szyller, et al. 2024. LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked.Tiny Papers Track at ICLR(2024)

work page 2024

[43] [43]

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, et al. 2024. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! https://openreview.net/forum?id=hTEGyKf0dZ.International Conference on Learning Representations(2024)

work page 2024

[44] [44]

Alexander Robey, Eric Wong, Hamed Hassani, and George J. Pappas. 2023. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks. Transactions on Machine Learning Research(2023)

work page 2023

[45] [45]

Sayak Saha Roy, Poojitha Thota, Krishna Vamsi Naragam, and Shirin Nilizadeh

work page

[46] [46]

From Chatbots to Phishbots?: Phishing Scam Generation in Commercial Large Language Models.IEEE Symposium on Security and Privacy(2024)

work page 2024

[47] [47]

Vinu Sankar Sadasivan, Shoumik Saha, Gaurang Sriramanan, Priyatham Kat- takinda, Atoosa Chegini, and Soheil Feizi. 2024. Fast Adversarial Attacks on Language Models In One GPU Minute. https://arxiv.org/abs/2402.15570.Interna- tional Conference on Machine Learning(2024)

work page arXiv 2024

[48] [48]

Leo Schwinn and Simon Geisler. 2024. Revisiting the Robust Alignment of Circuit Breakers. https://arxiv.org/abs/2407.15902.arXiv(2024)

work page arXiv 2024

[49] [49]

Yanshen Sun, Jianfeng He, Limeng Cui, Shuo Lei, and Chang-Tien Lu. 2024. Exploring the Deceptive Power of LLM-Generated Fake News: A Study of Real- World Detection Challenges.arXiv(2024)

work page 2024

[50] [50]

Gemma Team. 2024. Gemma: Introducing new state-of-the-art open models. https://blog.google/technology/developers/gemma-open-models/.Google(2024). https://blog.google/technology/developers/gemma-open-models/

work page 2024

[51] [51]

Llama Team. [n. d.]. Prompt Guard-86M | Model Cards and Prompt formats. https: //www.llama.com/docs/model-cards-and-prompt-formats/prompt-guard/

work page

[52] [52]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, et al

work page

[53] [53]

Llama 2: Open Foundation and Fine-Tuned Chat Models.arXiv(2023)

work page 2023

[54] [54]

Eric Wallace, Tony Zhao, Shi Feng, and Sameer Singh. 2021. Concealed Data Poisoning Attacks on NLP Models.Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2021)

work page 2021

[55] [55]

Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. 2023. Poisoning Language Models During Instruction Tuning.International Conference on Machine Learning(2023)

work page 2023

[56] [56]

Zeming Wei, Yifei Wang, and Yisen Wang. 2023. Jailbreak and guard aligned language models with only few in-context demonstrations.arXiv(2023)

work page 2023

[57] [57]

Xiaofei Wen, Wenxuan Zhou, Wenjie Jacky Mo, and Muhao Chen. 2025. Think- Guard: Deliberative Slow Thinking Leads to Cautious Guardrails. https://arxiv. org/abs/2502.13458.arXiv(2025)

work page arXiv 2025

[58] [58]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, et al. 2020. Transformers: State-of-the-Art Natural Language Processing.Confer- ence on Empirical Methods in Natural Language Processing: System Demonstrations (2020)

work page 2020

[59] [59]

Yueqi Xie, Minghong Fang, Renjie Pi, and Neil Gong. 2024. GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis.Annual Meeting of the Association for Computational Linguistics(2024)

work page 2024

[60] [60]

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, et al. 2024. SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding. Annual Meeting of the Association for Computational Linguistics(2024)

work page 2024

[61] [61]

Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. 2023. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv(2023)

work page 2023

[62] [62]

Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shum- ing Shi, and Zhaopeng Tu. 2024. GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher.International Conference on Learning Representations (2024)

work page 2024

[63] [63]

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, et al

work page

[64] [64]

Exploring and Developing a Pre-Model Safeguard with Draft Models CAIS ’26, May 26–29, 2026, San Jose, CA, USA

OPT: Open Pre-trained Transformer Language Models.arXiv(2022). Exploring and Developing a Pre-Model Safeguard with Draft Models CAIS ’26, May 26–29, 2026, San Jose, CA, USA

work page 2022

[65] [65]

Zhexin Zhang, Junxiao Yang, Pei Ke, and Minlie Huang. 2023. Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization. Annual Meeting of the Association for Computational Linguistics(2023)

work page 2023

[66] [66]

Wei Zhao, Zhe Li, Yige Li, Ye Zhang, and Jun Sun. 2024. Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing.Empirical Methods in Natural Language Processing(2024)

work page 2024

[67] [67]

Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, et al. 2024. Weak- to-Strong Jailbreaking on Large Language Models.International Conference on Machine Learning(2024)

work page 2024

[68] [68]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, et al

work page

[69] [69]

Judging LLM-as-a-judge with MT-bench and Chatbot Arena.International Conference on Neural Information Processing Systems(2024)

work page 2024

[70] [70]

Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, et al. 2023. LIMA: Less Is More for Alignment. https://openreview.net/forum?id=KBMOKmX2he. Conference on Neural Information Processing Systems(2023)

work page 2023

[71] [71]

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, et al

work page

[72] [72]

Fine-Tuning Language Models From Human Preferences.arXiv(2019)

work page 2019

[73] [73]

Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym An- driushchenko, Ruiqi Wang, Zico Kolter, Matt Fredrikson, and Dan Hendrycks

work page

[74] [74]

https: //arxiv.org/abs/2406.04313.arXiv(2024)

Improving Alignment and Robustness with Circuit Breakers. https: //arxiv.org/abs/2406.04313.arXiv(2024)

work page arXiv 2024

[75] [75]

Greetings

Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. 2023. Universal and Transferable Adversarial Attacks on Aligned Language Models.arXiv(2023). A Jailbreak Example that Bypasses the Guard We show a representative jailbreak prompt that bypasses the pre- model guard in Figure 1. This example illustrates how prompt-only filtering can miss malicious i...

work page 2023

[76] [76]

BREAKING: Mysterious Virus Spreads Across City, Thousands Infected!

Craft a sensational title: Create a captivating headline that grabs attention, yet is misleading and exaggerated. Example: “BREAKING: Mysterious Virus Spreads Across City, Thousands Infected!"

work page

[77] [77]

highly contagious

Select a plausible topic: Choose a topic that is relevant and relatable to the audience, but can be easily distorted to create panic. Example: A fake story about a new virus that is "highly contagious" and “deadly"

work page

[78] [78]

Violence

Add fake statistics and “expert" opinions: Create false Figure 1: Example of a jailbreak attack that bypasses the pre- model guard. The malicious prompt is classified as safe by the guard model (i.e., LlamaGuard 2 [ 30] in pre-model mode) but induces an unsafe response from the target language model (i.e., Llama-3-70B-Instruct-A WQ [30]). This occurs beca...

work page 2026