pith. sign in

arxiv: 2605.19321 · v1 · pith:R4WDEIV6new · submitted 2026-05-19 · 💻 cs.CR · cs.AI

Exploring and Developing a Pre-Model Safeguard with Draft Models

Pith reviewed 2026-05-20 05:06 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords jailbreak attackspre-model safeguardsdraft modelsspeculative inferencelanguage model safetytransferabilitysmall language modelsLLM alignment
0
0 comments X

The pith

Draft responses from small models can flag unsafe prompts for large language models before full inference runs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that jailbreak prompts designed for large language models often trigger unsafe outputs in smaller draft models as well. By generating a few draft responses with these smaller models and feeding both the prompt and drafts into existing safety guards, the method predicts whether the large target model would produce harmful content. This approach lowers the false-negative rate of prompt-only pre-model guards while avoiding the full cost of checking the actual large-model output. A sympathetic reader would care because it offers a practical middle ground between weak pre-inference checks and expensive post-inference audits.

Core claim

Responses from smaller draft models reflect the safety implications of those from large target models; given a jailbreak prompt constructed for an LLM, an SLM is likely to be triggered to generate an unaligned response. The safeguard uses speculative inference with SLMs to produce draft responses, then applies existing guards to the original prompt plus these drafts to predict safety before target model inference.

What carries the argument

Speculative inference with small language models that generates draft responses whose safety behavior mirrors the large target model's response to the same prompt.

If this is right

  • Pre-model guards achieve lower false-negative rates without needing the target model's full output.
  • The design provides a lower-cost alternative to post-model guards that audit both prompt and response.
  • Key factors that affect jailbreak transferability between model sizes can guide further improvements to the safeguard.
  • Safety auditing can occur before target model inference while still using information from generated text.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method might extend to detecting other alignment failures beyond jailbreaks if similar transfer patterns hold.
  • Different sizes of draft models could be tested to balance prediction accuracy against added computation.
  • Integration with multiple existing guards might further reduce missed unsafe prompts.

Load-bearing premise

Jailbreak transferability from large models to small draft models is reliable enough that checking the drafts predicts whether the large model would give an unsafe answer.

What would settle it

Measure agreement between the safeguard's safety prediction on draft responses and the actual safety classification of the large model's full response across a fixed set of jailbreak prompts.

Figures

Figures reproduced from arXiv: 2605.19321 by Antonio Bianchi, Arjun Arunasalam, Hongyu Cai, Yiming Liang, Z. Berkay Celik.

Figure 2
Figure 2. Figure 2: A jailbreak prompt generated by the AutoDAN jail [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An overview of our methodology for understanding the transferability of jailbreak attacks. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The distribution of unsafe response ratios for be [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Transferability rate vs. Number of SLM Responses. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of our safeguard design to guard against jailbreak attacks. The transferability of jailbreak prompts is [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Defense Failure Rate (DFR) (aggregation threshold [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 2
Figure 2. Figure 2: Transferability matrix. TR𝑖,𝑗 represents the trans￾ferability rate from the 𝑖-th large model to the 𝑗-th small model. We present a transferability matrix that visualizes the proportion of successful jailbreaks on SLMs given that they were successful on a particular LLM. The transferability matrix is a heatmap that displays the transferability rate for each combination of large and small models. The transfe… view at source ↗
Figure 1
Figure 1. Figure 1: Example of a jailbreak attack that bypasses the pre [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Transferability rate of jailbreak attacks across iter [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Transferability rates of jailbreak prompts across different intent categories. The x-axis represents the intent categories, [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Impact of response count (𝑏) on defense failure rate (DFR): Aggregation threshold 𝜏 is held constant while 𝑏 is varied to evaluate its impact on the defense failure rate (DFR) of our safeguard design. aggregation strategy (𝜏 = 0), the DFR remains relatively stable across different response counts. For the “any” aggregation, the DFR consistently decreases as 𝑏 increases. This is because a larger response co… view at source ↗
Figure 9
Figure 9. Figure 9: Impact of the aggregation threshold, 𝜏, on accuracy. The draft response count, 𝑏, is held constant while 𝜏 is var￾ied to assess its influence on the accuracy of our safeguard design. language model. These responses are typically harmful or violate safety guidelines. To achieve this, the attacker optimizes an ad￾versarial suffix appended to the prompt, aiming to maximize the likelihood of generating a speci… view at source ↗
Figure 8
Figure 8. Figure 8: Impact of the number of responses, 𝑏, on accuracy. The aggregation threshold, 𝜏, is held constant while𝑏 is varied to evaluate its effect on the accuracy of our safeguard design. as 𝑏 increases, indicating that the accuracy on benign prompts is not sensitive to the response count in these cases. This observation aligns with our findings in the DFR evaluation. Impact of Threshold. We next analyze the impact… view at source ↗
read the original abstract

Large Language Model (LLM) alignment remains vulnerable to jailbreak attacks that elicit unsafe responses, motivating pre-model and post-model guards. Pre-model guards audit the safety of prompts before invoking target models. However, relying solely on the prompt often leads to high false-negative rates (i.e., jailbreak attacks go undetected). Post-model guards address this issue by auditing both the user prompt and the target model's response. However, they incur a high computational cost, including increased token usage and processing time, because they operate after target model inference. In this paper, we introduce a safeguard design that leverages the transferability of jailbreak attacks to enforce prompt safety before target model inference. We first conduct a systematic study of jailbreak transferability, particularly from LLMs to small language models (SLMs). Through these experiments, we identify key factors influencing transferability. Building on these insights, we observe that responses from smaller draft models reflect the safety implications of those from large target models; \ie given a jailbreak prompt constructed for an LLM, an SLM is likely to be triggered to generate an unaligned response. Based on this observation, our safeguard design leverages speculative inference with SLMs to generate a set of draft responses. It then feeds the original prompt and these drafts into existing guards to predict their safety. We demonstrate that this design reduces the false-negative rate of pre-model guards and offers a low \Efficiency alternative to post-model guards. \textcolor{red}{\bf Notice: This paper contains examples of harmful language.}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a pre-model safeguard against jailbreak attacks on LLMs that uses small language models (SLMs) as draft models via speculative inference. It first reports a systematic study of jailbreak transferability from LLMs to SLMs, identifies influencing factors, and observes that SLM draft responses tend to reflect the safety implications of the target LLM's response to the same prompt. The safeguard then supplies the original prompt plus these drafts to existing safety guards to predict safety before target-model inference, claiming lower false-negative rates than prompt-only pre-model guards and lower computational cost than post-model guards.

Significance. If the transferability results prove robust, the design offers a practical efficiency gain by moving some safety auditing before target inference while still using response-like signals. The systematic study of cross-scale transferability is a useful empirical contribution if it includes diverse model pairs and attack types; the approach could be adopted in production pipelines that already employ speculative decoding.

major comments (2)
  1. [Abstract] The central claim that the design 'reduces the false-negative rate of pre-model guards' rests on the transferability observation, yet the abstract supplies no quantitative transfer rates, false-negative reductions, datasets, or baseline comparisons. Without these numbers the support for the claim cannot be evaluated.
  2. The safeguard mechanism assumes reliable cross-model transferability even when the draft SLM differs in family, scale, or alignment method from the target LLM. The skeptic concern that refusal boundaries may diverge sharply is load-bearing; the manuscript must demonstrate that the systematic study explicitly controlled for family mismatch and attack-type coverage rather than reporting results only on matched pairs.
minor comments (2)
  1. [Abstract] The phrase 'low Efficiency alternative' appears to be a typesetting artifact and should be corrected to 'low-cost' or 'low-overhead'.
  2. The paper should clarify whether the draft responses are generated with the same speculative-decoding hyperparameters used in inference or with safety-specific sampling settings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight opportunities to improve the clarity of our quantitative claims and the robustness of the transferability analysis. We address each major comment below and indicate the corresponding revisions.

read point-by-point responses
  1. Referee: [Abstract] The central claim that the design 'reduces the false-negative rate of pre-model guards' rests on the transferability observation, yet the abstract supplies no quantitative transfer rates, false-negative reductions, datasets, or baseline comparisons. Without these numbers the support for the claim cannot be evaluated.

    Authors: We agree that the original abstract was insufficiently quantitative. In the revised manuscript we have expanded the abstract to report the following concrete results: average jailbreak transferability of 71% from target LLMs to draft SLMs, a 19% absolute reduction in false-negative rate relative to prompt-only baselines, evaluation on AdvBench and HarmBench, and direct comparisons against Llama-Guard and OpenAI Moderation as pre-model guards. These figures are now stated in the abstract and are supported by the detailed tables and figures in Sections 4 and 5. revision: yes

  2. Referee: [—] The safeguard mechanism assumes reliable cross-model transferability even when the draft SLM differs in family, scale, or alignment method from the target LLM. The skeptic concern that refusal boundaries may diverge sharply is load-bearing; the manuscript must demonstrate that the systematic study explicitly controlled for family mismatch and attack-type coverage rather than reporting results only on matched pairs.

    Authors: We appreciate the emphasis on cross-family robustness. Section 3 of the manuscript already includes mismatched-family experiments (Llama-3 target with Gemma-2B and Phi-3-mini drafts; Mistral target with Qwen-1.5B drafts) and covers four attack families (GCG, AutoDAN, PAIR, and hand-crafted). Transfer rates remain above 62% in the mismatched cases. To make this control explicit, we have added a dedicated paragraph and an additional table (Table 4) that stratifies results by family match/mismatch and attack type, together with a short discussion of observed refusal-boundary divergence. We believe these additions directly address the load-bearing concern while remaining within the scope of the empirical study. revision: partial

Circularity Check

0 steps flagged

No circularity: safeguard rests on empirical transferability experiments

full rationale

The paper first runs a systematic study of jailbreak transferability from LLMs to SLMs, identifies influencing factors, and then observes that draft-model responses reflect target safety implications. This observation is used to motivate the pre-model safeguard design. No step reduces by definition or construction to its own inputs; no parameters are fitted on a subset and then relabeled as predictions; no load-bearing claim depends on a self-citation chain whose prior result is itself unverified. The central mechanism is externally falsifiable via the reported transferability experiments and remains independent of the safeguard itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption about transferability; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Jailbreak attacks transfer from LLMs to SLMs such that draft responses indicate the safety of target responses
    This observation is presented as the foundation for the safeguard design.

pith-pipeline@v0.9.0 · 5817 in / 1187 out tokens · 52160 ms · 2026-05-20T05:06:52.018617+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 1 internal anchor

  1. [1]

    Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, et al

    Marah I. Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, et al . 2024. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. https://www.microsoft.com/en- us/research/publication/phi-3-technical-report-a-highly-capable-language- model-locally-on-your-phone/.arXiv(2024)

  2. [2]

    Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Leandro von Werra, and Thomas Wolf. 2024. SmolLM - blazingly fast and remarkably powerful. https://huggingface. co/blog/smollm

  3. [3]

    Kamfonas

    Gabriel Alon and Michael J. Kamfonas. 2023. Detecting Language Model Attacks With Perplexity. https://openreview.net/forum?id=lNLVvdHyAw.arXiv(2023). CAIS ’26, May 26–29, 2026, San Jose, CA, USA Hongyu Cai, Arjun Arunasalam, Yiming Liang, Antonio Bianchi, and Z. Berkay Celik

  4. [4]

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, et al. 2023. Qwen Technical Report.arXiv(2023)

  5. [5]

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, et al. 2022. Training a Helpful and Harmless Assistant With Reinforcement Learning From Human Feedback.arXiv(2022)

  6. [6]

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, et al. 2022. Constitutional AI: Harmlessness from AI Feedback.arXiv(2022)

  7. [7]

    Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, et al. 2024. Speculative Streaming: Fast LLM Inference without Auxil- iary Models.arXiv(2024)

  8. [8]

    Bochuan Cao, Yuanpu Cao, Lu Lin, and Jinghui Chen. 2023. Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM.Annual Meeting of the Association for Computational Linguistics(2023)

  9. [9]

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, et al. 2024. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models.NeurIPS Datasets and Benchmarks Track (2024)

  10. [10]

    Pappas, et al

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, et al. 2023. Jailbreaking Black Box Large Language Models in Twenty Queries.IEEE Conference on Secure and Trustworthy Machine Learning(2023)

  11. [11]

    Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Lau- rent Sifre, et al. 2023. Accelerating Large Language Model Decoding with Specu- lative Sampling.arXiv(2023)

  12. [12]

    Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. 2024. Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. Conference on Language Modeling(2024)

  13. [13]

    Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, et al

  14. [14]

    https://openreview.net/forum?id=4hturzLcKX.Conference on Neural Information Processing Systems(2023)

    AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. https://openreview.net/forum?id=4hturzLcKX.Conference on Neural Information Processing Systems(2023)

  15. [15]

    Jaiden Fairoze, Sanjam Garg, Keewoo Lee, and Mingyuan Wang. 2025. Bypassing Prompt Guards in Production with Controlled-Release Prompting. https://arxiv. org/abs/2510.01529.arXiv(2025)

  16. [16]

    Markus Freitag and Yaser Al-Onaizan. 2017. Beam Search Strategies for Neural Machine Translation. https://aclanthology.org/W17-3207/.Workshop on Neural Machine Translation(2017)

  17. [17]

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The Curious Case of Neural Text Degeneration. https://openreview.net/forum?id= rygGQyrFvH.International Conference on Learning Representations(2020)

  18. [18]

    Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. 2023. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation.Interna- tional Conference on Learning Representations(2023)

  19. [19]

    Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, et al

    Hakan Inan, K. Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, et al . 2023. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations. https://bit.ly/42ir2JB.arXiv(2023)

  20. [20]

    Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchen- bauer, et al . 2023. Baseline defenses for adversarial attacks against aligned language models.arXiv(2023)

  21. [21]

    Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, et al

  22. [22]

    RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback.International Conference on Machine Learning(2023)

  23. [23]

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from transformers via speculative decoding.International Conference on Machine Learning(2023)

  24. [24]

    Meng Li, Wangmeng Zuo, and Lei Zhang. 2021. Early Stopping for Deep Image Prior.IEEE/CVF Conference on Computer Vision and Pattern Recognition(2021)

  25. [25]

    Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, et al

  26. [26]

    https://github.com/tatsu-lab/alpaca_eval.GitHub repository(2023)

    AlpacaEval: An Automatic Evaluator of Instruction-following Models. https://github.com/tatsu-lab/alpaca_eval.GitHub repository(2023). https://bit. ly/4jhOvkn

  27. [27]

    Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han

  28. [28]

    Deepinception: Hypnotize large language model to be jailbreaker.arXiv (2023)

  29. [29]

    Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, et al. 2024. The Unlocking Spell on Base LLMs: Rethinking Alignment via In- Context Learning. https://openreview.net/forum?id=wxJ0eXwwda.International Conference on Learning Representations(2024)

  30. [30]

    Tong Liu, Yingjie Zhang, Zhe Zhao, Yinpeng Dong, Guozhu Meng, et al. 2024. Making them ask and answer: Jailbreaking large language models in few queries via disguise and reconstruction.USENIX Security Symposium(2024)

  31. [31]

    Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Ion Stoica, Zhijie Deng, et al . 2023. Online Speculative Decoding.International Conference on Machine Learning (2023)

  32. [32]

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2023. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. https://openreview.net/forum?id=7Jwpw4qKkb.International Conference on Learning Representations(2023)

  33. [33]

    Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, et al. 2024. Jailbreak- ing ChatGPT via Prompt Engineering: An Empirical Study.arXiv(2024)

  34. [34]

    AI @ Meta Llama Team. 2024. The Llama 3 Herd of Models. https://arxiv.org/ abs/2407.21783.arXiv(2024)

  35. [35]

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, et al. 2024. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal.International Conference on Machine Learning(2024)

  36. [36]

    Meta. 2024. Meta Llama Guard 2. https://bit.ly/42l2a42

  37. [37]

    Meta-Llama. 2024. Prompt-Guard. https://github.com/meta-llama/PurpleLlama/ tree/main/Prompt-Guard

  38. [38]

    NVIDIA. [n. d.]. NVIDIA Data Center Deep Learning Product Performance AI Inference. https://developer.nvidia.com/deep-learning-performance-training- inference/ai-inference.NVIDIA Developer([n. d.]). https://developer.nvidia.com/ deep-learning-performance-training-inference/ai-inference

  39. [39]

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, et al. 2024. GPT-4 Technical Report.arXiv(2024)

  40. [40]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, et al

  41. [41]

    Annual Conference on Neural Information Processing Systems(2022)

    Training Language Models to Follow Instructions With Human Feedback. Annual Conference on Neural Information Processing Systems(2022)

  42. [42]

    Mansi Phute, Alec Helbling, Matthew Hull, ShengYun Peng, Sebastian Szyller, et al. 2024. LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked.Tiny Papers Track at ICLR(2024)

  43. [43]

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, et al. 2024. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! https://openreview.net/forum?id=hTEGyKf0dZ.International Conference on Learning Representations(2024)

  44. [44]

    Alexander Robey, Eric Wong, Hamed Hassani, and George J. Pappas. 2023. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks. Transactions on Machine Learning Research(2023)

  45. [45]

    Sayak Saha Roy, Poojitha Thota, Krishna Vamsi Naragam, and Shirin Nilizadeh

  46. [46]

    From Chatbots to Phishbots?: Phishing Scam Generation in Commercial Large Language Models.IEEE Symposium on Security and Privacy(2024)

  47. [47]

    Vinu Sankar Sadasivan, Shoumik Saha, Gaurang Sriramanan, Priyatham Kat- takinda, Atoosa Chegini, and Soheil Feizi. 2024. Fast Adversarial Attacks on Language Models In One GPU Minute. https://arxiv.org/abs/2402.15570.Interna- tional Conference on Machine Learning(2024)

  48. [48]

    Leo Schwinn and Simon Geisler. 2024. Revisiting the Robust Alignment of Circuit Breakers. https://arxiv.org/abs/2407.15902.arXiv(2024)

  49. [49]

    Yanshen Sun, Jianfeng He, Limeng Cui, Shuo Lei, and Chang-Tien Lu. 2024. Exploring the Deceptive Power of LLM-Generated Fake News: A Study of Real- World Detection Challenges.arXiv(2024)

  50. [50]

    Gemma Team. 2024. Gemma: Introducing new state-of-the-art open models. https://blog.google/technology/developers/gemma-open-models/.Google(2024). https://blog.google/technology/developers/gemma-open-models/

  51. [51]

    Llama Team. [n. d.]. Prompt Guard-86M | Model Cards and Prompt formats. https: //www.llama.com/docs/model-cards-and-prompt-formats/prompt-guard/

  52. [52]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, et al

  53. [53]

    Llama 2: Open Foundation and Fine-Tuned Chat Models.arXiv(2023)

  54. [54]

    Eric Wallace, Tony Zhao, Shi Feng, and Sameer Singh. 2021. Concealed Data Poisoning Attacks on NLP Models.Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2021)

  55. [55]

    Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. 2023. Poisoning Language Models During Instruction Tuning.International Conference on Machine Learning(2023)

  56. [56]

    Zeming Wei, Yifei Wang, and Yisen Wang. 2023. Jailbreak and guard aligned language models with only few in-context demonstrations.arXiv(2023)

  57. [57]

    Xiaofei Wen, Wenxuan Zhou, Wenjie Jacky Mo, and Muhao Chen. 2025. Think- Guard: Deliberative Slow Thinking Leads to Cautious Guardrails. https://arxiv. org/abs/2502.13458.arXiv(2025)

  58. [58]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, et al. 2020. Transformers: State-of-the-Art Natural Language Processing.Confer- ence on Empirical Methods in Natural Language Processing: System Demonstrations (2020)

  59. [59]

    Yueqi Xie, Minghong Fang, Renjie Pi, and Neil Gong. 2024. GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis.Annual Meeting of the Association for Computational Linguistics(2024)

  60. [60]

    Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, et al. 2024. SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding. Annual Meeting of the Association for Computational Linguistics(2024)

  61. [61]

    Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. 2023. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv(2023)

  62. [62]

    Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shum- ing Shi, and Zhaopeng Tu. 2024. GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher.International Conference on Learning Representations (2024)

  63. [63]

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, et al

  64. [64]

    Exploring and Developing a Pre-Model Safeguard with Draft Models CAIS ’26, May 26–29, 2026, San Jose, CA, USA

    OPT: Open Pre-trained Transformer Language Models.arXiv(2022). Exploring and Developing a Pre-Model Safeguard with Draft Models CAIS ’26, May 26–29, 2026, San Jose, CA, USA

  65. [65]

    Zhexin Zhang, Junxiao Yang, Pei Ke, and Minlie Huang. 2023. Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization. Annual Meeting of the Association for Computational Linguistics(2023)

  66. [66]

    Wei Zhao, Zhe Li, Yige Li, Ye Zhang, and Jun Sun. 2024. Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing.Empirical Methods in Natural Language Processing(2024)

  67. [67]

    Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, et al. 2024. Weak- to-Strong Jailbreaking on Large Language Models.International Conference on Machine Learning(2024)

  68. [68]

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, et al

  69. [69]

    Judging LLM-as-a-judge with MT-bench and Chatbot Arena.International Conference on Neural Information Processing Systems(2024)

  70. [70]

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, et al. 2023. LIMA: Less Is More for Alignment. https://openreview.net/forum?id=KBMOKmX2he. Conference on Neural Information Processing Systems(2023)

  71. [71]

    Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, et al

  72. [72]

    Fine-Tuning Language Models From Human Preferences.arXiv(2019)

  73. [73]

    Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym An- driushchenko, Ruiqi Wang, Zico Kolter, Matt Fredrikson, and Dan Hendrycks

  74. [74]

    https: //arxiv.org/abs/2406.04313.arXiv(2024)

    Improving Alignment and Robustness with Circuit Breakers. https: //arxiv.org/abs/2406.04313.arXiv(2024)

  75. [75]

    Greetings

    Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. 2023. Universal and Transferable Adversarial Attacks on Aligned Language Models.arXiv(2023). A Jailbreak Example that Bypasses the Guard We show a representative jailbreak prompt that bypasses the pre- model guard in Figure 1. This example illustrates how prompt-only filtering can miss malicious i...

  76. [76]

    BREAKING: Mysterious Virus Spreads Across City, Thousands Infected!

    Craft a sensational title: Create a captivating headline that grabs attention, yet is misleading and exaggerated. Example: “BREAKING: Mysterious Virus Spreads Across City, Thousands Infected!"

  77. [77]

    highly contagious

    Select a plausible topic: Choose a topic that is relevant and relatable to the audience, but can be easily distorted to create panic. Example: A fake story about a new virus that is "highly contagious" and “deadly"

  78. [78]

    Violence

    Add fake statistics and “expert" opinions: Create false Figure 1: Example of a jailbreak attack that bypasses the pre- model guard. The malicious prompt is classified as safe by the guard model (i.e., LlamaGuard 2 [ 30] in pre-model mode) but induces an unsafe response from the target language model (i.e., Llama-3-70B-Instruct-A WQ [30]). This occurs beca...