Distilling Safe LLM Systems via Soft Prompts for On Device Settings

Christos Louizos; Cristina Pinneri; Dana Kianfar; Mohammed Almousa; Motasem Alfarra

arxiv: 2606.09388 · v1 · pith:TTDFU6VKnew · submitted 2026-06-08 · 💻 cs.LG

Distilling Safe LLM Systems via Soft Prompts for On Device Settings

Motasem Alfarra , Cristina Pinneri , Dana Kianfar , Mohammed Almousa , Christos Louizos This is my paper

Pith reviewed 2026-06-27 17:14 UTC · model grok-4.3

classification 💻 cs.LG

keywords soft promptsmodel distillationLLM safetyon-device inferenceparameter efficient methodsguard modelssafety alignmentedge deployment

0 comments

The pith

Soft prompt distillation transfers guard model safety to on-device LLMs more effectively than other efficient methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines ways to make large language models safe when running on phones and other small devices. Dual systems with a separate guard model work but use too much memory. The authors test several ways to add safety using few extra parameters and find that training soft prompts with distillation from the guard model works best. This approach keeps the model useful while blocking unsafe outputs and adds almost no cost when the model runs.

Core claim

By training soft prompts using distillation objectives based on total variation distance and KL divergence, safety behaviors from a larger guard model can be transferred into a small set of prompt parameters. This yields better safety versus usefulness trade-offs than LoRA adapters, steering vectors, or direct optimization, across multiple model architectures, with negligible inference overhead.

What carries the argument

Distillation of safety behaviors from guard models into learned soft prompts using total variation and KL divergence losses.

If this is right

Superior safety-usefulness balance on benchmarks
Minimal extra memory and compute needed at inference time
Consistent outperformance over LoRA, steering vectors, and direct methods
Applicable across various LLM architectures

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could allow safe LLM use on consumer hardware without sending data to the cloud
The method might extend to distilling other behaviors like style or factuality into prompts
Testing on more diverse user queries could reveal limits not seen in benchmarks

Load-bearing premise

The safety alignment learned in the soft prompts will continue to work for new types of user requests and on model architectures not included in the tests.

What would settle it

An experiment where the soft-prompt model is attacked with novel jailbreak prompts or run on a different base model and shows higher unsafe response rates than a LoRA-based alternative.

Figures

Figures reproduced from arXiv: 2606.09388 by Christos Louizos, Cristina Pinneri, Dana Kianfar, Mohammed Almousa, Motasem Alfarra.

**Figure 1.** Figure 1: Safety - Compute Trade-off. LLMs (denoted as Base Model) can generate unsafe and toxic content. When paired with a Guard Model; altogether called a safe LLM system, their safety improves at the expense of a substantial compute and memory penalty which may hinder their usability. In this work, we systematically study safety alignment methods for on-device deployment and identify soft prompts with distilla… view at source ↗

**Figure 2.** Figure 2: Pipeline for our proposed TV-DiSP. We distill a safe LLM system composed of a paired LLM and guard model into a set of learnable parameters (soft prompts) equipped to the LLM. process, represent soft prompts. Once we have distilled the safe LLM system into these soft prompts W, they are prepended to the sequence of token embeddings of the user prompt and are fed into subsequent layers. When the distillat… view at source ↗

**Figure 3.** Figure 3: Safety-Compute trade-offs when trained on Beavertails or Toxigen, and tested on HarmBench. We report on the y-axis the Safety Guard Score (SGS) according to LlamaGuard3-8B for three variations: the base LLM (red), the safe LLM system with LlamaGuard3-1B in-the-loop (purple), and our proposed distilled LLM with soft prompts (blue). The x-axis shows the test-time compute measured in the number of floating-po… view at source ↗

**Figure 4.** Figure 4: Comparing TV-DiSP against TV-DiSV and TV-DiLoRA. We employ our distillation scheme in Equation equation 4 to distill the safe LLM system into a steering vector (SV) or a low rank adaptor (LoRA). We conduct a single epoch training on Beavertails under different learning rates and report SGS on HarmBench. TV-DiSP consistently outperforms TV-DiSV and TV-DiLoRA. 4.2 RECOVERING SAFETY WITH DISTILLATION We firs… view at source ↗

**Figure 5.** Figure 5: Comparing TV-DiSP against other distillation schemes. We compare our proposed total variation objective function to other loss functions in distilling the safe LLM system. We experiment with perplexity optimization, REINFORCE and KL divergence minimization. We report on the x-axis the learning rate used for training, the SGS on the y-axis on HarmBench. Left: Llama3-1B and Right: Llama3-3B model is the base… view at source ↗

**Figure 6.** Figure 6: Generalization to in-distribution and out-of-distribution. Left: Results on Beavertails test-set (in distribution). Right:Results on Detect-Jailbreak dataset. Our proposed TV-DiSP provides consistent safety gains on both in- and out ofdistribution settings on two different LLM architectures. (ix) the two distillation variants – KL-DiSP and TV-DiSP – exhibit consistently favorable trade-offs: they deliver … view at source ↗

**Figure 7.** Figure 7: Ablating the impact of different number of soft prompts. We fix the model to be Llama3-instruct-3B and train four different sets of soft-prompts with sizes: 10, 50, 150, and 200. We follow our training recipe outlined in section 4.1 and evaluate the SGS on HarmBench (left) and Detect-Jailbreak (right). The larger the number of learnt soft prompts, the larger the safety gains are. base weights remain frozen… view at source ↗

**Figure 8.** Figure 8: Comparing TV-DiSP against other distillation schemes in terms usefulness vs safety. The x-axis reports the usefulness: 5-shot in context learning accuracy on MMLU benchmark, and the y-axis shows the SGS measured by Llama3 Guard - 8B for our proposed TV-DiSP, against REINFORCE and KL. While KL distillation can achieve better SGS score compared to TV distillation, it comes at a significant cost on the LLM’s … view at source ↗

**Figure 9.** Figure 9: Convergence curves for different soft prompt sizes (10, 100, 200). Loss stabilizes after [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Example 1 (top): Prompt asking about illegal activities. Example 2 (bottom): Prompt encouraging unethical [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Training with TV-DiSP under four different seeds. TV-DiSP consistently converges under all seeds. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

read the original abstract

Deploying safe large language models (LLMs) on resource-constrained edge devices presents a critical challenge: while dual-model systems combining LLMs with guard models provide effective safety guarantees, their substantial memory and computational demands make them prohibitively expensive for on-device deployment. This paper presents a comprehensive study of parameter-efficient safety alignment methods for resource-constrained settings. Through systematic evaluation across multiple LLM architectures, training objectives, and parameter-efficient fine-tuning approaches, we identify that soft prompts combined with distillation-based training consistently outperform alternative methods. We introduce distillation frameworks based on total variation and KL divergence that effectively transfer safety behaviors from guard models into learned soft prompts. Our evaluations on various benchmarks demonstrate that this combination achieves superior safety-usefulness trade-offs compared to LoRA adapters, steering vectors, and direct optimization methods, while requiring minimal additional memory and compute at inference time. These findings establish soft prompt distillation as the preferred approach for safety alignment in on-device LLM deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Soft prompt distillation from guard models beats LoRA and steering vectors on the reported benchmarks but the outperformance claim rests on unshown details and weak generalization checks.

read the letter

The paper's core finding is that soft prompts trained with total-variation and KL distillation from a guard model give better safety-usefulness trade-offs than LoRA adapters, steering vectors, or direct optimization, all while adding almost no inference cost. That is the practical takeaway for on-device work.

What stands out is the systematic comparison across multiple LLM architectures and PEFT methods. The authors actually run the same safety transfer task under several training objectives and report that the distillation route wins. They also spell out two concrete distillation losses instead of leaving the transfer method vague. That level of side-by-side evaluation is useful even if the absolute numbers are not yet public.

The soft spots are exactly where the stress-test note flags them. The abstract claims superiority on "various benchmarks" but gives no error bars, no table of results, and no description of how the test prompts differ from the distillation data. If the benchmarks stay inside the same distribution, the safety-usefulness edge could shrink under real user prompts or jailbreak attempts. The paper also does not show transfer to an unseen model family, which matters for on-device claims. Those gaps are not fatal but they are load-bearing for the main recommendation.

This is the kind of work that belongs in an applied ML venue rather than a theory track. Practitioners who already run guard-model setups and want to cut memory will get immediate value from the comparison, even if they have to re-run the numbers themselves. The thinking is straightforward and the literature is engaged; there is no obvious internal contradiction.

I would send it to peer review. The experiments need more transparency and at least one OOD or adversarial section, but the framing and the method choice are solid enough to justify referee time.

Referee Report

2 major / 0 minor

Summary. The paper claims that soft prompts trained via distillation (using total variation and KL divergence objectives) from guard models outperform LoRA adapters, steering vectors, and direct optimization for safety alignment of LLMs in on-device settings. It reports superior safety-usefulness trade-offs on various benchmarks while incurring only minimal additional memory and compute at inference time, positioning soft-prompt distillation as the preferred method over dual-model guard systems.

Significance. If the empirical results hold under broader testing, the approach could meaningfully reduce the resource cost of safe LLM deployment on edge devices by transferring guard-model behaviors into lightweight soft prompts without requiring a second model at inference.

major comments (2)

[Abstract / Evaluation] The central claim of reliable safety-usefulness trade-offs rests on generalization beyond the evaluated benchmarks, yet the abstract provides no indication that test distributions include adversarial jailbreaks, real-world prompt diversity, or distribution shift to unseen model architectures (as flagged in the stress-test note).
[Abstract] Soundness is limited because the provided text contains only high-level claims with no quantitative results, error bars, benchmark details, or statistical evidence; without these, it is impossible to verify whether the reported outperformance over baselines is statistically meaningful or reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments on the manuscript. We address each major comment below and have revised the abstract where it improves clarity on the evaluation scope without altering the paper's core claims.

read point-by-point responses

Referee: [Abstract / Evaluation] The central claim of reliable safety-usefulness trade-offs rests on generalization beyond the evaluated benchmarks, yet the abstract provides no indication that test distributions include adversarial jailbreaks, real-world prompt diversity, or distribution shift to unseen model architectures (as flagged in the stress-test note).

Authors: The paper's experimental section evaluates across multiple LLM architectures using benchmarks that explicitly include adversarial jailbreaks and diverse real-world-style prompts. The stress-test note acknowledges that results on completely unseen architectures represent a limitation rather than a full guarantee. We have revised the abstract to explicitly note the inclusion of adversarial testing and the range of benchmarks and architectures evaluated, while avoiding overstatement of generalization. revision: partial
Referee: [Abstract] Soundness is limited because the provided text contains only high-level claims with no quantitative results, error bars, benchmark details, or statistical evidence; without these, it is impossible to verify whether the reported outperformance over baselines is statistically meaningful or reproducible.

Authors: Abstracts are designed as concise overviews and standard practice in the field omits detailed statistics to meet length constraints. The full manuscript (Sections 4–5 and associated tables) reports quantitative results with means, standard deviations across multiple runs, benchmark specifications, and direct comparisons to LoRA, steering vectors, and direct optimization, enabling verification of statistical meaningfulness and reproducibility. revision: no

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivation chain

full rationale

The paper is an empirical study comparing soft-prompt distillation (via total variation and KL) against LoRA, steering vectors, and direct optimization on safety-usefulness benchmarks. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation load-bearing uniqueness theorems appear in the provided text or abstract. All claims rest on reported benchmark results rather than any reduction of outputs to inputs by construction. Generalization to unseen distributions is a validity concern, not a circularity issue. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are specified or required by the high-level claims.

pith-pipeline@v0.9.1-grok · 5706 in / 1061 out tokens · 20696 ms · 2026-06-27T17:14:27.041785+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 16 linked inside Pith

[3]

The malicious use of artificial intelligence: Forecasting

Miles Brundage, Shahar Avin, Jack Clark, Helen Toner, Peter Eckersley, Ben Garfinkel, Allan Dafoe, Paul Scharre, Thomas Zeitzoff, Bobby Filar, et al. The malicious use of artificial intelligence: Forecasting. Prevention, and Mitigation, 20, 2018

2018
[6]

Information theory: coding theorems for discrete memoryless systems

Imre Csisz \'a r and J \'a nos K \"o rner. Information theory: coding theorems for discrete memoryless systems. Cambridge University Press, 2011

2011
[7]

Qlora: Efficient finetuning of quantized llms

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36: 0 10088--10115, 2023

2023
[10]

Figstep: Jailbreaking large vision-language models via typographic visual prompts

Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. Figstep: Jailbreaking large vision-language models via typographic visual prompts. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23951--23959, 2025

2025
[13]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1 0 (2): 0 3, 2022

2022
[15]

Beavertails: Towards improved safety alignment of llm via a human-preference dataset

Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36: 0 24678--24704, 2023

2023
[18]

Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel ...

2024
[19]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems, 6: 0 87--100, 2024

2024
[23]

Llama 2 responsible use guide

Meta. Llama 2 responsible use guide. 2024. URL https://ai.meta.com/static-resource/responsible-use-guide/

2024
[25]

Steering llama 2 via contrastive activation addition, 2024

Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition, 2024. URL https://arxiv. org/abs/2312.06681

Pith/arXiv arXiv 2024
[26]

Lecture notes on information theory

Yury Polyanskiy and Yihong Wu. Lecture notes on information theory. Lecture Notes for ECE563 (UIUC) and, 6 0 (2012-2016): 0 7, 2014

2012
[27]

Empirical guidelines for deploying llms onto resource-constrained edge devices

Ruiyang Qin, Dancheng Liu, Chenhui Xu, Zheyu Yan, Zhaoxuan Tan, Zhenge Jia, Amir Nassereldine, Jiajie Li, Meng Jiang, Ahmed Abbasi, et al. Empirical guidelines for deploying llms onto resource-constrained edge devices. ACM Transactions on Design Automation of Electronic Systems, 2024

2024
[29]

``Do Anything Now'': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. ``Do Anything Now'': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models . In ACM SIGSAC Conference on Computer and Communications Security (CCS) . ACM, 2024

2024
[31]

Activation addition: Steering language models without optimization

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization. arXiv e-prints, pages arXiv--2308, 2023

2023
[34]

Soft prompt recovers compressed llms, transferably

Zhaozhuo Xu, Zirui Liu, Beidi Chen, Shaochen Zhong, Yuxin Tang, Jue WANG, Kaixiong Zhou, Xia Hu, and Anshumali Shrivastava. Soft prompt recovers compressed llms, transferably. In Forty-first International Conference on Machine Learning, 2024 a . URL https://openreview.net/forum?id=muBJPCIqZT

2024
[35]

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models , 2024 b

Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, and Stjepan Picek. A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models , 2024 b

2024
[36]

Prompt-driven llm safeguarding via directed representation optimization

Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng. Prompt-driven llm safeguarding via directed representation optimization. CoRR, 2024

2024
[37]

Instruction-following evaluation for large language models, 2023

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023. URL https://arxiv.org/abs/2311.07911

Pith/arXiv arXiv 2023
[40]

Forty-first International Conference on Machine Learning , year=

Soft Prompt Recovers Compressed LLMs, Transferably , author=. Forty-first International Conference on Machine Learning , year=
[41]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[42]

Prevention, and Mitigation , volume=

The Malicious Use of Artificial Intelligence: Forecasting , author=. Prevention, and Mitigation , volume=
[43]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

A holistic approach to undesired content detection in the real world , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[44]

arXiv preprint arXiv:2312.06674 , year=

Llama guard: Llm-based input-output safeguard for human-ai conversations , author=. arXiv preprint arXiv:2312.06674 , year=

Pith/arXiv arXiv
[45]

arXiv preprint arXiv:2412.07724 , year=

Granite guardian , author=. arXiv preprint arXiv:2412.07724 , year=

arXiv
[46]

arXiv preprint arXiv:2309.16609 , year=

Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

Pith/arXiv arXiv
[47]

arXiv preprint arXiv:2408.00118 , year=

Gemma 2: Improving open language models at a practical size , author=. arXiv preprint arXiv:2408.00118 , year=

Pith/arXiv arXiv
[48]

arXiv preprint arXiv:2404.09932 , year=

Foundational challenges in assuring alignment and safety of large language models , author=. arXiv preprint arXiv:2404.09932 , year=

arXiv
[49]

ACM Transactions on Design Automation of Electronic Systems , year=

Empirical guidelines for deploying llms onto resource-constrained edge devices , author=. ACM Transactions on Design Automation of Electronic Systems , year=
[50]

Proceedings of Machine Learning and Systems , volume=

Awq: Activation-aware weight quantization for on-device llm compression and acceleration , author=. Proceedings of Machine Learning and Systems , volume=
[51]

arXiv preprint arXiv:2411.17713 , year=

Llama Guard 3-1B-INT4: Compact and Efficient Safeguard for Human-AI Conversations , author=. arXiv preprint arXiv:2411.17713 , year=

arXiv
[52]

Lecture Notes for ECE563 (UIUC) and , volume=

Lecture notes on information theory , author=. Lecture Notes for ECE563 (UIUC) and , volume=. 2014 , publisher=

2014
[53]

arXiv preprint arXiv:2204.05862 , year=

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

Pith/arXiv arXiv
[54]

Llama 2 responsible use guide , author=
[55]

arXiv preprint arXiv:2402.15911 , year=

Prp: Propagating universal perturbations to attack large language model guard-rails , author=. arXiv preprint arXiv:2402.15911 , year=

arXiv
[56]

arXiv preprint arXiv:2009.03300 , year=

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

Pith/arXiv arXiv 2009
[57]

arXiv preprint arXiv:2310.04451 , year=

Autodan: Generating stealthy jailbreak prompts on aligned large language models , author=. arXiv preprint arXiv:2310.04451 , year=

Pith/arXiv arXiv
[58]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Figstep: Jailbreaking large vision-language models via typographic visual prompts , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[59]

arXiv preprint arXiv:2307.15043 , year=

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

Pith/arXiv arXiv
[60]

arXiv preprint arXiv:2402.04249 , year=

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal , author=. arXiv preprint arXiv:2402.04249 , year=

Pith/arXiv arXiv
[61]

arXiv preprint arXiv:2404.01318 , year=

Jailbreakbench: An open robustness benchmark for jailbreaking large language models , author=. arXiv preprint arXiv:2404.01318 , year=

Pith/arXiv arXiv
[62]

arXiv preprint arXiv:2203.09509 , year=

Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection , author=. arXiv preprint arXiv:2203.09509 , year=

arXiv
[63]

arXiv preprint arXiv:2310.12773 , year=

Safe rlhf: Safe reinforcement learning from human feedback , author=. arXiv preprint arXiv:2310.12773 , year=

Pith/arXiv arXiv
[64]

arXiv preprint arXiv:2405.07863 , year=

Rlhf workflow: From reward modeling to online rlhf , author=. arXiv preprint arXiv:2405.07863 , year=

Pith/arXiv arXiv
[65]

, author=

Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=
[66]

Advances in neural information processing systems , volume=

Qlora: Efficient finetuning of quantized llms , author=. Advances in neural information processing systems , volume=
[68]

arXiv e-prints , pages=

Activation addition: Steering language models without optimization , author=. arXiv e-prints , pages=
[69]

Steering llama 2 via contrastive activation addition, 2024 , author=

2024
[70]

arXiv preprint arXiv:2311.09433 , year=

Trojan activation attack: Red-teaming large language models using activation steering for safety-alignment , author=. arXiv preprint arXiv:2311.09433 , year=

arXiv
[71]

Advances in Neural Information Processing Systems , volume=

Beavertails: Towards improved safety alignment of llm via a human-preference dataset , author=. Advances in Neural Information Processing Systems , volume=
[72]

2011 , publisher=

Information theory: coding theorems for discrete memoryless systems , author=. 2011 , publisher=

2011
[73]

Xinyue Shen and Zeyuan Chen and Michael Backes and Yun Shen and Yang Zhang , title =
[74]

2024 , eprint =

Zihao Xu and Yi Liu and Gelei Deng and Yuekang Li and Stjepan Picek , title =. 2024 , eprint =

2024
[75]

2024 , eprint=

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning , author=. 2024 , eprint=

2024
[76]

2017 , eprint=

Adam: A Method for Stochastic Optimization , author=. 2017 , eprint=

2017
[77]

arXiv preprint arXiv:2310.13345 , year=

An llm can fool itself: A prompt-based adversarial attack , author=. arXiv preprint arXiv:2310.13345 , year=

arXiv
[78]

arXiv preprint arXiv:1707.06347 , year=

Proximal Policy Optimization Algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

Pith/arXiv arXiv
[79]

arXiv preprint arXiv:1412.6980 , year=

Adam: A Method for Stochastic Optimization , author=. arXiv preprint arXiv:1412.6980 , year=

Pith/arXiv arXiv
[80]

CoRR , year=

Prompt-driven llm safeguarding via directed representation optimization , author=. CoRR , year=
[81]

URL https://arxiv

Improving alignment and robustness with circuit breakers, 2024 , author=. URL https://arxiv. org/abs/2406.04313 , volume=

arXiv 2024
[82]

2023 , eprint=

Instruction-Following Evaluation for Large Language Models , author=. 2023 , eprint=

2023
[83]

arXiv preprint arXiv:2110.14168 , year=

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

Pith/arXiv arXiv

[1] [3]

The malicious use of artificial intelligence: Forecasting

Miles Brundage, Shahar Avin, Jack Clark, Helen Toner, Peter Eckersley, Ben Garfinkel, Allan Dafoe, Paul Scharre, Thomas Zeitzoff, Bobby Filar, et al. The malicious use of artificial intelligence: Forecasting. Prevention, and Mitigation, 20, 2018

2018

[2] [6]

Information theory: coding theorems for discrete memoryless systems

Imre Csisz \'a r and J \'a nos K \"o rner. Information theory: coding theorems for discrete memoryless systems. Cambridge University Press, 2011

2011

[3] [7]

Qlora: Efficient finetuning of quantized llms

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36: 0 10088--10115, 2023

2023

[4] [10]

Figstep: Jailbreaking large vision-language models via typographic visual prompts

Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. Figstep: Jailbreaking large vision-language models via typographic visual prompts. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23951--23959, 2025

2025

[5] [13]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1 0 (2): 0 3, 2022

2022

[6] [15]

Beavertails: Towards improved safety alignment of llm via a human-preference dataset

Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36: 0 24678--24704, 2023

2023

[7] [18]

Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel ...

2024

[8] [19]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems, 6: 0 87--100, 2024

2024

[9] [23]

Llama 2 responsible use guide

Meta. Llama 2 responsible use guide. 2024. URL https://ai.meta.com/static-resource/responsible-use-guide/

2024

[10] [25]

Steering llama 2 via contrastive activation addition, 2024

Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition, 2024. URL https://arxiv. org/abs/2312.06681

Pith/arXiv arXiv 2024

[11] [26]

Lecture notes on information theory

Yury Polyanskiy and Yihong Wu. Lecture notes on information theory. Lecture Notes for ECE563 (UIUC) and, 6 0 (2012-2016): 0 7, 2014

2012

[12] [27]

Empirical guidelines for deploying llms onto resource-constrained edge devices

Ruiyang Qin, Dancheng Liu, Chenhui Xu, Zheyu Yan, Zhaoxuan Tan, Zhenge Jia, Amir Nassereldine, Jiajie Li, Meng Jiang, Ahmed Abbasi, et al. Empirical guidelines for deploying llms onto resource-constrained edge devices. ACM Transactions on Design Automation of Electronic Systems, 2024

2024

[13] [29]

``Do Anything Now'': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. ``Do Anything Now'': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models . In ACM SIGSAC Conference on Computer and Communications Security (CCS) . ACM, 2024

2024

[14] [31]

Activation addition: Steering language models without optimization

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization. arXiv e-prints, pages arXiv--2308, 2023

2023

[15] [34]

Soft prompt recovers compressed llms, transferably

Zhaozhuo Xu, Zirui Liu, Beidi Chen, Shaochen Zhong, Yuxin Tang, Jue WANG, Kaixiong Zhou, Xia Hu, and Anshumali Shrivastava. Soft prompt recovers compressed llms, transferably. In Forty-first International Conference on Machine Learning, 2024 a . URL https://openreview.net/forum?id=muBJPCIqZT

2024

[16] [35]

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models , 2024 b

Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, and Stjepan Picek. A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models , 2024 b

2024

[17] [36]

Prompt-driven llm safeguarding via directed representation optimization

Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng. Prompt-driven llm safeguarding via directed representation optimization. CoRR, 2024

2024

[18] [37]

Instruction-following evaluation for large language models, 2023

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023. URL https://arxiv.org/abs/2311.07911

Pith/arXiv arXiv 2023

[19] [40]

Forty-first International Conference on Machine Learning , year=

Soft Prompt Recovers Compressed LLMs, Transferably , author=. Forty-first International Conference on Machine Learning , year=

[20] [41]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

[21] [42]

Prevention, and Mitigation , volume=

The Malicious Use of Artificial Intelligence: Forecasting , author=. Prevention, and Mitigation , volume=

[22] [43]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

A holistic approach to undesired content detection in the real world , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[23] [44]

arXiv preprint arXiv:2312.06674 , year=

Llama guard: Llm-based input-output safeguard for human-ai conversations , author=. arXiv preprint arXiv:2312.06674 , year=

Pith/arXiv arXiv

[24] [45]

arXiv preprint arXiv:2412.07724 , year=

Granite guardian , author=. arXiv preprint arXiv:2412.07724 , year=

arXiv

[25] [46]

arXiv preprint arXiv:2309.16609 , year=

Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

Pith/arXiv arXiv

[26] [47]

arXiv preprint arXiv:2408.00118 , year=

Gemma 2: Improving open language models at a practical size , author=. arXiv preprint arXiv:2408.00118 , year=

Pith/arXiv arXiv

[27] [48]

arXiv preprint arXiv:2404.09932 , year=

Foundational challenges in assuring alignment and safety of large language models , author=. arXiv preprint arXiv:2404.09932 , year=

arXiv

[28] [49]

ACM Transactions on Design Automation of Electronic Systems , year=

Empirical guidelines for deploying llms onto resource-constrained edge devices , author=. ACM Transactions on Design Automation of Electronic Systems , year=

[29] [50]

Proceedings of Machine Learning and Systems , volume=

Awq: Activation-aware weight quantization for on-device llm compression and acceleration , author=. Proceedings of Machine Learning and Systems , volume=

[30] [51]

arXiv preprint arXiv:2411.17713 , year=

Llama Guard 3-1B-INT4: Compact and Efficient Safeguard for Human-AI Conversations , author=. arXiv preprint arXiv:2411.17713 , year=

arXiv

[31] [52]

Lecture Notes for ECE563 (UIUC) and , volume=

Lecture notes on information theory , author=. Lecture Notes for ECE563 (UIUC) and , volume=. 2014 , publisher=

2014

[32] [53]

arXiv preprint arXiv:2204.05862 , year=

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

Pith/arXiv arXiv

[33] [54]

Llama 2 responsible use guide , author=

[34] [55]

arXiv preprint arXiv:2402.15911 , year=

Prp: Propagating universal perturbations to attack large language model guard-rails , author=. arXiv preprint arXiv:2402.15911 , year=

arXiv

[35] [56]

arXiv preprint arXiv:2009.03300 , year=

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

Pith/arXiv arXiv 2009

[36] [57]

arXiv preprint arXiv:2310.04451 , year=

Autodan: Generating stealthy jailbreak prompts on aligned large language models , author=. arXiv preprint arXiv:2310.04451 , year=

Pith/arXiv arXiv

[37] [58]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Figstep: Jailbreaking large vision-language models via typographic visual prompts , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[38] [59]

arXiv preprint arXiv:2307.15043 , year=

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

Pith/arXiv arXiv

[39] [60]

arXiv preprint arXiv:2402.04249 , year=

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal , author=. arXiv preprint arXiv:2402.04249 , year=

Pith/arXiv arXiv

[40] [61]

arXiv preprint arXiv:2404.01318 , year=

Jailbreakbench: An open robustness benchmark for jailbreaking large language models , author=. arXiv preprint arXiv:2404.01318 , year=

Pith/arXiv arXiv

[41] [62]

arXiv preprint arXiv:2203.09509 , year=

Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection , author=. arXiv preprint arXiv:2203.09509 , year=

arXiv

[42] [63]

arXiv preprint arXiv:2310.12773 , year=

Safe rlhf: Safe reinforcement learning from human feedback , author=. arXiv preprint arXiv:2310.12773 , year=

Pith/arXiv arXiv

[43] [64]

arXiv preprint arXiv:2405.07863 , year=

Rlhf workflow: From reward modeling to online rlhf , author=. arXiv preprint arXiv:2405.07863 , year=

Pith/arXiv arXiv

[44] [65]

, author=

Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=

[45] [66]

Advances in neural information processing systems , volume=

Qlora: Efficient finetuning of quantized llms , author=. Advances in neural information processing systems , volume=

[46] [68]

arXiv e-prints , pages=

Activation addition: Steering language models without optimization , author=. arXiv e-prints , pages=

[47] [69]

Steering llama 2 via contrastive activation addition, 2024 , author=

2024

[48] [70]

arXiv preprint arXiv:2311.09433 , year=

Trojan activation attack: Red-teaming large language models using activation steering for safety-alignment , author=. arXiv preprint arXiv:2311.09433 , year=

arXiv

[49] [71]

Advances in Neural Information Processing Systems , volume=

Beavertails: Towards improved safety alignment of llm via a human-preference dataset , author=. Advances in Neural Information Processing Systems , volume=

[50] [72]

2011 , publisher=

Information theory: coding theorems for discrete memoryless systems , author=. 2011 , publisher=

2011

[51] [73]

Xinyue Shen and Zeyuan Chen and Michael Backes and Yun Shen and Yang Zhang , title =

[52] [74]

2024 , eprint =

Zihao Xu and Yi Liu and Gelei Deng and Yuekang Li and Stjepan Picek , title =. 2024 , eprint =

2024

[53] [75]

2024 , eprint=

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning , author=. 2024 , eprint=

2024

[54] [76]

2017 , eprint=

Adam: A Method for Stochastic Optimization , author=. 2017 , eprint=

2017

[55] [77]

arXiv preprint arXiv:2310.13345 , year=

An llm can fool itself: A prompt-based adversarial attack , author=. arXiv preprint arXiv:2310.13345 , year=

arXiv

[56] [78]

arXiv preprint arXiv:1707.06347 , year=

Proximal Policy Optimization Algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

Pith/arXiv arXiv

[57] [79]

arXiv preprint arXiv:1412.6980 , year=

Adam: A Method for Stochastic Optimization , author=. arXiv preprint arXiv:1412.6980 , year=

Pith/arXiv arXiv

[58] [80]

CoRR , year=

Prompt-driven llm safeguarding via directed representation optimization , author=. CoRR , year=

[59] [81]

URL https://arxiv

Improving alignment and robustness with circuit breakers, 2024 , author=. URL https://arxiv. org/abs/2406.04313 , volume=

arXiv 2024

[60] [82]

2023 , eprint=

Instruction-Following Evaluation for Large Language Models , author=. 2023 , eprint=

2023

[61] [83]

arXiv preprint arXiv:2110.14168 , year=

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

Pith/arXiv arXiv