Exploring and Developing a Pre-Model Safeguard with Draft Models
Pith reviewed 2026-05-20 05:06 UTC · model grok-4.3
The pith
Draft responses from small models can flag unsafe prompts for large language models before full inference runs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Responses from smaller draft models reflect the safety implications of those from large target models; given a jailbreak prompt constructed for an LLM, an SLM is likely to be triggered to generate an unaligned response. The safeguard uses speculative inference with SLMs to produce draft responses, then applies existing guards to the original prompt plus these drafts to predict safety before target model inference.
What carries the argument
Speculative inference with small language models that generates draft responses whose safety behavior mirrors the large target model's response to the same prompt.
If this is right
- Pre-model guards achieve lower false-negative rates without needing the target model's full output.
- The design provides a lower-cost alternative to post-model guards that audit both prompt and response.
- Key factors that affect jailbreak transferability between model sizes can guide further improvements to the safeguard.
- Safety auditing can occur before target model inference while still using information from generated text.
Where Pith is reading between the lines
- The method might extend to detecting other alignment failures beyond jailbreaks if similar transfer patterns hold.
- Different sizes of draft models could be tested to balance prediction accuracy against added computation.
- Integration with multiple existing guards might further reduce missed unsafe prompts.
Load-bearing premise
Jailbreak transferability from large models to small draft models is reliable enough that checking the drafts predicts whether the large model would give an unsafe answer.
What would settle it
Measure agreement between the safeguard's safety prediction on draft responses and the actual safety classification of the large model's full response across a fixed set of jailbreak prompts.
Figures
read the original abstract
Large Language Model (LLM) alignment remains vulnerable to jailbreak attacks that elicit unsafe responses, motivating pre-model and post-model guards. Pre-model guards audit the safety of prompts before invoking target models. However, relying solely on the prompt often leads to high false-negative rates (i.e., jailbreak attacks go undetected). Post-model guards address this issue by auditing both the user prompt and the target model's response. However, they incur a high computational cost, including increased token usage and processing time, because they operate after target model inference. In this paper, we introduce a safeguard design that leverages the transferability of jailbreak attacks to enforce prompt safety before target model inference. We first conduct a systematic study of jailbreak transferability, particularly from LLMs to small language models (SLMs). Through these experiments, we identify key factors influencing transferability. Building on these insights, we observe that responses from smaller draft models reflect the safety implications of those from large target models; \ie given a jailbreak prompt constructed for an LLM, an SLM is likely to be triggered to generate an unaligned response. Based on this observation, our safeguard design leverages speculative inference with SLMs to generate a set of draft responses. It then feeds the original prompt and these drafts into existing guards to predict their safety. We demonstrate that this design reduces the false-negative rate of pre-model guards and offers a low \Efficiency alternative to post-model guards. \textcolor{red}{\bf Notice: This paper contains examples of harmful language.}
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a pre-model safeguard against jailbreak attacks on LLMs that uses small language models (SLMs) as draft models via speculative inference. It first reports a systematic study of jailbreak transferability from LLMs to SLMs, identifies influencing factors, and observes that SLM draft responses tend to reflect the safety implications of the target LLM's response to the same prompt. The safeguard then supplies the original prompt plus these drafts to existing safety guards to predict safety before target-model inference, claiming lower false-negative rates than prompt-only pre-model guards and lower computational cost than post-model guards.
Significance. If the transferability results prove robust, the design offers a practical efficiency gain by moving some safety auditing before target inference while still using response-like signals. The systematic study of cross-scale transferability is a useful empirical contribution if it includes diverse model pairs and attack types; the approach could be adopted in production pipelines that already employ speculative decoding.
major comments (2)
- [Abstract] The central claim that the design 'reduces the false-negative rate of pre-model guards' rests on the transferability observation, yet the abstract supplies no quantitative transfer rates, false-negative reductions, datasets, or baseline comparisons. Without these numbers the support for the claim cannot be evaluated.
- The safeguard mechanism assumes reliable cross-model transferability even when the draft SLM differs in family, scale, or alignment method from the target LLM. The skeptic concern that refusal boundaries may diverge sharply is load-bearing; the manuscript must demonstrate that the systematic study explicitly controlled for family mismatch and attack-type coverage rather than reporting results only on matched pairs.
minor comments (2)
- [Abstract] The phrase 'low Efficiency alternative' appears to be a typesetting artifact and should be corrected to 'low-cost' or 'low-overhead'.
- The paper should clarify whether the draft responses are generated with the same speculative-decoding hyperparameters used in inference or with safety-specific sampling settings.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The comments highlight opportunities to improve the clarity of our quantitative claims and the robustness of the transferability analysis. We address each major comment below and indicate the corresponding revisions.
read point-by-point responses
-
Referee: [Abstract] The central claim that the design 'reduces the false-negative rate of pre-model guards' rests on the transferability observation, yet the abstract supplies no quantitative transfer rates, false-negative reductions, datasets, or baseline comparisons. Without these numbers the support for the claim cannot be evaluated.
Authors: We agree that the original abstract was insufficiently quantitative. In the revised manuscript we have expanded the abstract to report the following concrete results: average jailbreak transferability of 71% from target LLMs to draft SLMs, a 19% absolute reduction in false-negative rate relative to prompt-only baselines, evaluation on AdvBench and HarmBench, and direct comparisons against Llama-Guard and OpenAI Moderation as pre-model guards. These figures are now stated in the abstract and are supported by the detailed tables and figures in Sections 4 and 5. revision: yes
-
Referee: [—] The safeguard mechanism assumes reliable cross-model transferability even when the draft SLM differs in family, scale, or alignment method from the target LLM. The skeptic concern that refusal boundaries may diverge sharply is load-bearing; the manuscript must demonstrate that the systematic study explicitly controlled for family mismatch and attack-type coverage rather than reporting results only on matched pairs.
Authors: We appreciate the emphasis on cross-family robustness. Section 3 of the manuscript already includes mismatched-family experiments (Llama-3 target with Gemma-2B and Phi-3-mini drafts; Mistral target with Qwen-1.5B drafts) and covers four attack families (GCG, AutoDAN, PAIR, and hand-crafted). Transfer rates remain above 62% in the mismatched cases. To make this control explicit, we have added a dedicated paragraph and an additional table (Table 4) that stratifies results by family match/mismatch and attack type, together with a short discussion of observed refusal-boundary divergence. We believe these additions directly address the load-bearing concern while remaining within the scope of the empirical study. revision: partial
Circularity Check
No circularity: safeguard rests on empirical transferability experiments
full rationale
The paper first runs a systematic study of jailbreak transferability from LLMs to SLMs, identifies influencing factors, and then observes that draft-model responses reflect target safety implications. This observation is used to motivate the pre-model safeguard design. No step reduces by definition or construction to its own inputs; no parameters are fitted on a subset and then relabeled as predictions; no load-bearing claim depends on a self-citation chain whose prior result is itself unverified. The central mechanism is externally falsifiable via the reported transferability experiments and remains independent of the safeguard itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Jailbreak attacks transfer from LLMs to SLMs such that draft responses indicate the safety of target responses
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
responses from smaller draft models reflect the safety implications of those from large target models; i.e., given a jailbreak prompt constructed for an LLM, an SLM is likely to be triggered to generate an unaligned response
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we introduce a safeguard design that leverages the transferability of jailbreak attacks to enforce prompt safety before target model inference
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, et al
Marah I. Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, et al . 2024. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. https://www.microsoft.com/en- us/research/publication/phi-3-technical-report-a-highly-capable-language- model-locally-on-your-phone/.arXiv(2024)
work page 2024
-
[2]
Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Leandro von Werra, and Thomas Wolf. 2024. SmolLM - blazingly fast and remarkably powerful. https://huggingface. co/blog/smollm
work page 2024
- [3]
-
[4]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, et al. 2023. Qwen Technical Report.arXiv(2023)
work page 2023
-
[5]
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, et al. 2022. Training a Helpful and Harmless Assistant With Reinforcement Learning From Human Feedback.arXiv(2022)
work page 2022
-
[6]
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, et al. 2022. Constitutional AI: Harmlessness from AI Feedback.arXiv(2022)
work page 2022
-
[7]
Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Mason, Mohammad Rastegari, et al. 2024. Speculative Streaming: Fast LLM Inference without Auxil- iary Models.arXiv(2024)
work page 2024
-
[8]
Bochuan Cao, Yuanpu Cao, Lu Lin, and Jinghui Chen. 2023. Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM.Annual Meeting of the Association for Computational Linguistics(2023)
work page 2023
-
[9]
Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, et al. 2024. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models.NeurIPS Datasets and Benchmarks Track (2024)
work page 2024
-
[10]
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, et al. 2023. Jailbreaking Black Box Large Language Models in Twenty Queries.IEEE Conference on Secure and Trustworthy Machine Learning(2023)
work page 2023
-
[11]
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Lau- rent Sifre, et al. 2023. Accelerating Large Language Model Decoding with Specu- lative Sampling.arXiv(2023)
work page 2023
-
[12]
Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. 2024. Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. Conference on Language Modeling(2024)
work page 2024
-
[13]
Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, et al
-
[14]
https://openreview.net/forum?id=4hturzLcKX.Conference on Neural Information Processing Systems(2023)
AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. https://openreview.net/forum?id=4hturzLcKX.Conference on Neural Information Processing Systems(2023)
work page 2023
- [15]
-
[16]
Markus Freitag and Yaser Al-Onaizan. 2017. Beam Search Strategies for Neural Machine Translation. https://aclanthology.org/W17-3207/.Workshop on Neural Machine Translation(2017)
work page 2017
-
[17]
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The Curious Case of Neural Text Degeneration. https://openreview.net/forum?id= rygGQyrFvH.International Conference on Learning Representations(2020)
work page 2020
-
[18]
Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. 2023. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation.Interna- tional Conference on Learning Representations(2023)
work page 2023
-
[19]
Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, et al
Hakan Inan, K. Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, et al . 2023. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations. https://bit.ly/42ir2JB.arXiv(2023)
work page 2023
-
[20]
Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchen- bauer, et al . 2023. Baseline defenses for adversarial attacks against aligned language models.arXiv(2023)
work page 2023
-
[21]
Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, et al
-
[22]
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback.International Conference on Machine Learning(2023)
work page 2023
-
[23]
Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from transformers via speculative decoding.International Conference on Machine Learning(2023)
work page 2023
-
[24]
Meng Li, Wangmeng Zuo, and Lei Zhang. 2021. Early Stopping for Deep Image Prior.IEEE/CVF Conference on Computer Vision and Pattern Recognition(2021)
work page 2021
-
[25]
Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, et al
-
[26]
https://github.com/tatsu-lab/alpaca_eval.GitHub repository(2023)
AlpacaEval: An Automatic Evaluator of Instruction-following Models. https://github.com/tatsu-lab/alpaca_eval.GitHub repository(2023). https://bit. ly/4jhOvkn
work page 2023
-
[27]
Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han
-
[28]
Deepinception: Hypnotize large language model to be jailbreaker.arXiv (2023)
work page 2023
-
[29]
Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, et al. 2024. The Unlocking Spell on Base LLMs: Rethinking Alignment via In- Context Learning. https://openreview.net/forum?id=wxJ0eXwwda.International Conference on Learning Representations(2024)
work page 2024
-
[30]
Tong Liu, Yingjie Zhang, Zhe Zhao, Yinpeng Dong, Guozhu Meng, et al. 2024. Making them ask and answer: Jailbreaking large language models in few queries via disguise and reconstruction.USENIX Security Symposium(2024)
work page 2024
-
[31]
Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Ion Stoica, Zhijie Deng, et al . 2023. Online Speculative Decoding.International Conference on Machine Learning (2023)
work page 2023
-
[32]
Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2023. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. https://openreview.net/forum?id=7Jwpw4qKkb.International Conference on Learning Representations(2023)
work page 2023
-
[33]
Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, et al. 2024. Jailbreak- ing ChatGPT via Prompt Engineering: An Empirical Study.arXiv(2024)
work page 2024
-
[34]
AI @ Meta Llama Team. 2024. The Llama 3 Herd of Models. https://arxiv.org/ abs/2407.21783.arXiv(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, et al. 2024. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal.International Conference on Machine Learning(2024)
work page 2024
-
[36]
Meta. 2024. Meta Llama Guard 2. https://bit.ly/42l2a42
work page 2024
-
[37]
Meta-Llama. 2024. Prompt-Guard. https://github.com/meta-llama/PurpleLlama/ tree/main/Prompt-Guard
work page 2024
-
[38]
NVIDIA. [n. d.]. NVIDIA Data Center Deep Learning Product Performance AI Inference. https://developer.nvidia.com/deep-learning-performance-training- inference/ai-inference.NVIDIA Developer([n. d.]). https://developer.nvidia.com/ deep-learning-performance-training-inference/ai-inference
-
[39]
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, et al. 2024. GPT-4 Technical Report.arXiv(2024)
work page 2024
-
[40]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, et al
-
[41]
Annual Conference on Neural Information Processing Systems(2022)
Training Language Models to Follow Instructions With Human Feedback. Annual Conference on Neural Information Processing Systems(2022)
work page 2022
-
[42]
Mansi Phute, Alec Helbling, Matthew Hull, ShengYun Peng, Sebastian Szyller, et al. 2024. LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked.Tiny Papers Track at ICLR(2024)
work page 2024
-
[43]
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, et al. 2024. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! https://openreview.net/forum?id=hTEGyKf0dZ.International Conference on Learning Representations(2024)
work page 2024
-
[44]
Alexander Robey, Eric Wong, Hamed Hassani, and George J. Pappas. 2023. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks. Transactions on Machine Learning Research(2023)
work page 2023
-
[45]
Sayak Saha Roy, Poojitha Thota, Krishna Vamsi Naragam, and Shirin Nilizadeh
-
[46]
From Chatbots to Phishbots?: Phishing Scam Generation in Commercial Large Language Models.IEEE Symposium on Security and Privacy(2024)
work page 2024
- [47]
- [48]
-
[49]
Yanshen Sun, Jianfeng He, Limeng Cui, Shuo Lei, and Chang-Tien Lu. 2024. Exploring the Deceptive Power of LLM-Generated Fake News: A Study of Real- World Detection Challenges.arXiv(2024)
work page 2024
-
[50]
Gemma Team. 2024. Gemma: Introducing new state-of-the-art open models. https://blog.google/technology/developers/gemma-open-models/.Google(2024). https://blog.google/technology/developers/gemma-open-models/
work page 2024
-
[51]
Llama Team. [n. d.]. Prompt Guard-86M | Model Cards and Prompt formats. https: //www.llama.com/docs/model-cards-and-prompt-formats/prompt-guard/
-
[52]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, et al
-
[53]
Llama 2: Open Foundation and Fine-Tuned Chat Models.arXiv(2023)
work page 2023
-
[54]
Eric Wallace, Tony Zhao, Shi Feng, and Sameer Singh. 2021. Concealed Data Poisoning Attacks on NLP Models.Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2021)
work page 2021
-
[55]
Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. 2023. Poisoning Language Models During Instruction Tuning.International Conference on Machine Learning(2023)
work page 2023
-
[56]
Zeming Wei, Yifei Wang, and Yisen Wang. 2023. Jailbreak and guard aligned language models with only few in-context demonstrations.arXiv(2023)
work page 2023
- [57]
-
[58]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, et al. 2020. Transformers: State-of-the-Art Natural Language Processing.Confer- ence on Empirical Methods in Natural Language Processing: System Demonstrations (2020)
work page 2020
-
[59]
Yueqi Xie, Minghong Fang, Renjie Pi, and Neil Gong. 2024. GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis.Annual Meeting of the Association for Computational Linguistics(2024)
work page 2024
-
[60]
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, et al. 2024. SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding. Annual Meeting of the Association for Computational Linguistics(2024)
work page 2024
-
[61]
Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. 2023. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv(2023)
work page 2023
-
[62]
Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shum- ing Shi, and Zhaopeng Tu. 2024. GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher.International Conference on Learning Representations (2024)
work page 2024
-
[63]
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, et al
-
[64]
OPT: Open Pre-trained Transformer Language Models.arXiv(2022). Exploring and Developing a Pre-Model Safeguard with Draft Models CAIS ’26, May 26–29, 2026, San Jose, CA, USA
work page 2022
-
[65]
Zhexin Zhang, Junxiao Yang, Pei Ke, and Minlie Huang. 2023. Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization. Annual Meeting of the Association for Computational Linguistics(2023)
work page 2023
-
[66]
Wei Zhao, Zhe Li, Yige Li, Ye Zhang, and Jun Sun. 2024. Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing.Empirical Methods in Natural Language Processing(2024)
work page 2024
-
[67]
Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, et al. 2024. Weak- to-Strong Jailbreaking on Large Language Models.International Conference on Machine Learning(2024)
work page 2024
-
[68]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, et al
-
[69]
Judging LLM-as-a-judge with MT-bench and Chatbot Arena.International Conference on Neural Information Processing Systems(2024)
work page 2024
-
[70]
Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, et al. 2023. LIMA: Less Is More for Alignment. https://openreview.net/forum?id=KBMOKmX2he. Conference on Neural Information Processing Systems(2023)
work page 2023
-
[71]
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, et al
-
[72]
Fine-Tuning Language Models From Human Preferences.arXiv(2019)
work page 2019
-
[73]
Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym An- driushchenko, Ruiqi Wang, Zico Kolter, Matt Fredrikson, and Dan Hendrycks
-
[74]
https: //arxiv.org/abs/2406.04313.arXiv(2024)
Improving Alignment and Robustness with Circuit Breakers. https: //arxiv.org/abs/2406.04313.arXiv(2024)
-
[75]
Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. 2023. Universal and Transferable Adversarial Attacks on Aligned Language Models.arXiv(2023). A Jailbreak Example that Bypasses the Guard We show a representative jailbreak prompt that bypasses the pre- model guard in Figure 1. This example illustrates how prompt-only filtering can miss malicious i...
work page 2023
-
[76]
BREAKING: Mysterious Virus Spreads Across City, Thousands Infected!
Craft a sensational title: Create a captivating headline that grabs attention, yet is misleading and exaggerated. Example: “BREAKING: Mysterious Virus Spreads Across City, Thousands Infected!"
-
[77]
Select a plausible topic: Choose a topic that is relevant and relatable to the audience, but can be easily distorted to create panic. Example: A fake story about a new virus that is "highly contagious" and “deadly"
-
[78]
Add fake statistics and “expert" opinions: Create false Figure 1: Example of a jailbreak attack that bypasses the pre- model guard. The malicious prompt is classified as safe by the guard model (i.e., LlamaGuard 2 [ 30] in pre-model mode) but induces an unsafe response from the target language model (i.e., Llama-3-70B-Instruct-A WQ [30]). This occurs beca...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.