SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

Hao Li; Hao Wang; Jin-Ge Yao; Jingkun An; Lei Sha; Lijun Li; Pengyu Zhu; Rui Li; Wendi Feng; Yesheng Liu

arxiv: 2606.02530 · v1 · pith:ZFDOCI4Cnew · submitted 2026-06-01 · 💻 cs.AI · cs.CL

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

Hao Li , Jingkun An , Zijun Song , Pengyu Zhu , Rui Li , Hao Wang , Wendi Feng , Yesheng Liu

show 3 more authors

Lijun Li Jin-Ge Yao Lei Sha

This is my paper

Pith reviewed 2026-06-28 14:36 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords safety alignmentlarge language modelson-policy distillationactivation steeringtoken selectionalignment taxreverse KL divergencelocalized alignment

0 comments

The pith

SafeSteer aligns LLMs to safety by distilling only on sparse safety tokens, avoiding broad capability trade-offs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that safety alignment can be achieved through localized modifications to the output distribution rather than global adjustments that degrade general capabilities. This approach matters because existing methods often require large amounts of general-purpose data or auxiliary models, incurring high costs and an alignment tax. SafeSteer constructs a safety teacher using activation steering, selects safety tokens, and applies on-policy distillation with reverse KL penalty restricted to those tokens. It demonstrates this using only 100 harmful samples without any general data, achieving strong safety on benchmarks while preserving capabilities.

Core claim

SafeSteer performs on-policy distillation confined to safety tokens. It first builds a safety teacher via activation steering and develops a safety token selection algorithm. The reverse KL penalty is then restricted to these tokens during training, which allows strong safety performance on seven benchmarks with minimal degradation on five general capability benchmarks using only 100 harmful samples and no general-purpose data.

What carries the argument

The mechanism of restricting the reverse KL penalty to a small set of selected safety tokens during on-policy distillation.

Load-bearing premise

Safety features are inherently sparse within the output distribution so that changes limited to a small set of safety tokens suffice for alignment without global capability trade-offs.

What would settle it

A test where applying the reverse KL penalty only to the selected safety tokens fails to improve safety benchmark scores or where general capability benchmarks show significant degradation despite the localization.

Figures

Figures reproduced from arXiv: 2606.02530 by Hao Li, Hao Wang, Jin-Ge Yao, Jingkun An, Lei Sha, Lijun Li, Pengyu Zhu, Rui Li, Wendi Feng, Yesheng Liu, Zijun Song.

**Figure 1.** Figure 1: Safety–capability trade-off on Qwen2.5-7BInstruct. Each point is a method, with the gray point marking the base model. Our SafeSteer achieves the highest safety score while preserving general capability. 2022) is therefore essential. However, mainstream safety alignment methods often degrade the models’ general capabilities, a phenomenon known as the alignment tax (Huang et al., 2025). Existing safety al… view at source ↗

**Figure 2.** Figure 2: SafeSteer pipeline: (1) construct a safety teacher πt via activation steering, (2) select safety tokens S by contrastive log probability from πt responses, and (3) distill πt into πs with a token-level localized reverse KL on S. resist various jailbreak attacks (Zou et al., 2023b; Ren et al., 2024). However, these methods still require a large amount of general-purpose data. 3 Method As [PITH_FULL_IMAGE:f… view at source ↗

**Figure 3.** Figure 3: PCA projection of hidden states for π0, πt, and πs on Llama family. The activation-steered πt is overrefusing on these harmless prompts (see Appendix D). SafeSteer acquires safety behaviors from πt without inducing a representation shift on general capabilities. Results for other models are shown in Appendix E. Method Q3-4B Q2.5-7B L3.2-3B L3-8B Temperature 0 SafeSteer (full) 59.03 53.76 44.92 45.49 w/ fo… view at source ↗

**Figure 4.** Figure 4: Safety token distribution of Qwen2.5-7B-Instruct under different response lengths. Token size reflects its [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: PCA projection of hidden states for π0, πt, and πs on Qwen family. SafeSteer acquires safety behaviors from πt without inducing a representation shift on general capabilities. Model π0 πs πt (regex) πt (LLM-judge) Qwen3-4B-Instruct 2.24 3.98 96.27 98.39 Qwen2.5-7B-Instruct 1.37 2.73 94.91 96.52 Llama-3.2-3B-Instruct 2.11 2.24 100.00 100.00 Llama-3-8B-Instruct 1.61 1.24 99.88 100.00 [PITH_FULL_IMAGE:figure… view at source ↗

**Figure 6.** Figure 6: Safety token distribution of Qwen3-4B-Instruct under different response lengths. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Safety token distribution of Llama-3-8B-Instruct under different response lengths. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Safety token distribution of Llama-3.2-3B-Instruct under different response lengths. [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

read the original abstract

Aligning Large Language Models (LLMs) with human values often degrades their general capabilities, termed the alignment tax. Existing methods mitigate this by balancing dual objectives, which heavily rely on massive general-purpose data or auxiliary reward models. In this paper, we argue that, because safety features are inherently sparse within the output distribution, alignment requires localized modifications rather than global trade-offs. To this end, we propose SafeSteer, which performs on-policy distillation confined to safety tokens. First, we construct a safety teacher via activation steering. Based on this teacher, we develop a safety token selection algorithm. Consequently, SafeSteer restricts the reverse KL penalty to these tokens during training to preserve general capabilities. Experimental results across diverse models show that our SafeSteer achieves a superior trade-off between safety and general capability compared with existing methods, attaining strong safety performance on seven safety benchmarks with only minimal degradation on five general capability benchmarks. Notably, SafeSteer requires only 100 harmful samples without using any general-purpose data, less than 1% of what previous baselines used, considerably reducing alignment cost. More details are on our project page at https://anjingkun.github.io/SafeSteer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SafeSteer claims localized distillation on safety tokens via activation steering cuts alignment data to 100 samples with little capability loss, but the sparsity premise still needs coverage checks.

read the letter

The new piece here is restricting reverse KL during on-policy distillation to a small set of tokens picked by an activation-steered teacher. That is a concrete alternative to the usual global balancing with large general-purpose datasets.

The experiments look like the strongest part: they run the method on several models, report results on seven safety benchmarks and five capability ones, and show the safety gains come with only small drops while using under 1% of the data previous baselines needed. If those numbers hold up with proper controls, the data-efficiency angle is practically useful.

The soft spot is the sparsity assumption. The paper treats safety features as localized enough that penalizing only the selected tokens is sufficient, but there is no reported measurement of what fraction of harmful continuations on the benchmarks actually pass through those tokens. Without that coverage check or an ablation on token selection quality, it is possible the safety results are incomplete rather than truly localized. The abstract gives no quantitative details on baselines or verification steps, so the full paper needs to supply those.

This is for researchers working on cheap safety tuning for deployed models. A reader who cares about practical alignment cost would get value from the token-selection idea and the low-data results, provided the coverage evidence is there.

It deserves a serious referee because the data claim is sharp enough to test and the method is reproducible enough to check.

Referee Report

2 major / 0 minor

Summary. The paper proposes SafeSteer for LLM safety alignment via on-policy distillation restricted to a small set of safety tokens. A safety teacher is built using activation steering, followed by a token selection algorithm; the reverse KL penalty is then applied only to the selected tokens during training. The central claim is that this localized approach exploits inherent sparsity of safety features to achieve strong performance on seven safety benchmarks while incurring only minimal degradation on five general capability benchmarks, using just 100 harmful samples and no general-purpose data.

Significance. If the sparsity premise is validated and the selected tokens demonstrably cover the relevant harmful probability mass, the method would represent a meaningful reduction in alignment data requirements and capability trade-offs compared to global balancing approaches, potentially lowering the cost of safety alignment substantially.

major comments (2)

[Abstract] Abstract: The claim that restricting the reverse KL penalty to the selected safety tokens suffices for alignment rests on the unverified premise that safety features are sparse enough that non-selected tokens carry negligible safety-relevant mass. No coverage metric is reported (e.g., the fraction of harmful continuations on the seven safety benchmarks that actually contain the chosen tokens), leaving open the possibility that the observed safety gains are an artifact of incomplete enforcement rather than true localization.
[Abstract] Abstract: The abstract asserts a 'superior trade-off' and 'strong safety performance' with 'minimal degradation' but supplies no quantitative numbers, baseline comparisons, or verification steps. Without these details in the experimental sections, it is impossible to determine whether the reported results actually support the central claim of negligible capability loss.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that restricting the reverse KL penalty to the selected safety tokens suffices for alignment rests on the unverified premise that safety features are sparse enough that non-selected tokens carry negligible safety-relevant mass. No coverage metric is reported (e.g., the fraction of harmful continuations on the seven safety benchmarks that actually contain the chosen tokens), leaving open the possibility that the observed safety gains are an artifact of incomplete enforcement rather than true localization.

Authors: We agree that a coverage metric would provide stronger direct evidence for the sparsity premise and rule out incomplete enforcement. While our results demonstrate effective safety gains using only 100 harmful samples, we will add an explicit coverage analysis (fraction of harmful continuations containing selected tokens) to the experiments in the revised manuscript. revision: yes
Referee: [Abstract] Abstract: The abstract asserts a 'superior trade-off' and 'strong safety performance' with 'minimal degradation' but supplies no quantitative numbers, baseline comparisons, or verification steps. Without these details in the experimental sections, it is impossible to determine whether the reported results actually support the central claim of negligible capability loss.

Authors: The experimental sections already report quantitative results, baseline comparisons against prior methods, and verification across the seven safety and five capability benchmarks. To make the abstract claims more self-contained, we will revise it to include key quantitative figures (e.g., safety scores and capability retention rates) while preserving conciseness. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with external benchmarks

full rationale

The paper presents an empirical method (activation-steered teacher + token selection + localized reverse KL) whose performance claims rest on held-out safety and capability benchmarks, not on any fitted parameter being renamed as a prediction or on self-referential definitions. No equations, self-citations, or uniqueness theorems are invoked in the provided text that reduce the central result to its own inputs by construction. The sparsity premise is an explicit modeling assumption whose validity is tested via downstream benchmarks rather than assumed tautologically.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; sparsity of safety features is treated as a domain assumption rather than a derived quantity.

pith-pipeline@v0.9.1-grok · 5770 in / 1006 out tokens · 18087 ms · 2026-06-28T14:36:58.918538+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

104 extracted references · 2 canonical work pages

[1]

Advances in Neural Information Processing Systems , editor=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

2022
[2]

arXiv preprint arXiv:2308.10792 , year=

Instruction tuning for large language models: A survey , author=. arXiv preprint arXiv:2308.10792 , year=

arXiv
[3]

2: Pushing the frontier of open large language models , author=

Deepseek-v3. 2: Pushing the frontier of open large language models , author=. arXiv preprint arXiv:2512.02556 , year=

Pith/arXiv arXiv
[4]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv
[5]

arXiv preprint arXiv:2602.15763 , year=

Glm-5: from vibe coding to agentic engineering , author=. arXiv preprint arXiv:2602.15763 , year=

Pith/arXiv arXiv
[6]

arXiv preprint arXiv:2503.00555 , year=

Safety tax: Safety alignment makes your large reasoning models less reasonable , author=. arXiv preprint arXiv:2503.00555 , year=

arXiv
[7]

International Conference on Learning Representations , volume=

Safe rlhf: Safe reinforcement learning from human feedback , author=. International Conference on Learning Representations , volume=
[8]

Advances in Neural Information Processing Systems , volume=

One-shot safety alignment for large language models via optimal dualization , author=. Advances in Neural Information Processing Systems , volume=
[9]

International Conference on Learning Representations , volume=

Bi-factorial preference optimization: Balancing safety-helpfulness in language models , author=. International Conference on Learning Representations , volume=
[10]

arXiv preprint arXiv:2512.11391 , year=

Mitigating the Safety Alignment Tax with Null-Space Constrained Policy Optimization , author=. arXiv preprint arXiv:2512.11391 , year=

arXiv
[11]

arXiv preprint arXiv:2503.03710 , year=

Improving llm safety alignment with dual-objective optimization , author=. arXiv preprint arXiv:2503.03710 , year=

arXiv
[12]

arXiv preprint arXiv:2603.07445 , year=

Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning , author=. arXiv preprint arXiv:2603.07445 , year=

Pith/arXiv arXiv
[13]

Thinking Machines Lab: Connectionism , year =

Kevin Lu and Thinking Machines Lab , title =. Thinking Machines Lab: Connectionism , year =
[14]

arXiv preprint arXiv:2604.00626 , year=

A survey of on-policy distillation for large language models , author=. arXiv preprint arXiv:2604.00626 , year=

Pith/arXiv arXiv
[15]

International Conference on Learning Representations , volume=

On-policy distillation of language models: Learning from self-generated mistakes , author=. International Conference on Learning Representations , volume=
[16]

International Conference on Learning Representations , volume=

Minillm: Knowledge distillation of large language models , author=. International Conference on Learning Representations , volume=
[17]

arXiv preprint arXiv:2601.19897 , year=

Self-Distillation Enables Continual Learning , author=. arXiv preprint arXiv:2601.19897 , year=

Pith/arXiv arXiv
[18]

arXiv preprint arXiv:2601.18734 , year=

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models , author=. arXiv preprint arXiv:2601.18734 , year=

Pith/arXiv arXiv
[19]

2023 , journal=

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

2023
[20]

arXiv preprint arXiv:2310.01405 , year=

Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

Pith/arXiv arXiv
[21]

arXiv preprint arXiv:2305.14233 , year=

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations , author=. arXiv preprint arXiv:2305.14233 , year=

Pith/arXiv arXiv
[22]

International Conference on Learning Representations , volume=

On the role of attention heads in large language model safety , author=. International Conference on Learning Representations , volume=
[23]

Advances in Neural Information Processing Systems , volume=

Llms encode harmfulness and refusal separately , author=. Advances in Neural Information Processing Systems , volume=
[24]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Layer-aware representation filtering: Purifying finetuning data to preserve llm safety alignment , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[25]

Jailbroken: How Does

Alexander Wei and Nika Haghtalab and Jacob Steinhardt , booktitle=. Jailbroken: How Does. 2023 , url=

2023
[26]

arXiv preprint arXiv:2303.17564 , year=

Bloomberggpt: A large language model for finance , author=. arXiv preprint arXiv:2303.17564 , year=

Pith/arXiv arXiv
[27]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

2022
[28]

2024 , cdate=

Xiangyu Qi and Yi Zeng and Tinghao Xie and Pin-Yu Chen and Ruoxi Jia and Prateek Mittal and Peter Henderson , title=. 2024 , cdate=

2024
[29]

Proceedings of the 40th International Conference on Machine Learning , pages=

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning ,author =. Proceedings of the 40th International Conference on Machine Learning , pages=. 2023 , URL=

2023
[30]

arXiv e-prints , pages=

Increased llm vulnerabilities from fine-tuning and quantization , author=. arXiv e-prints , pages=
[31]

arXiv preprint arXiv:2409.18169 , year=

Harmful fine-tuning attacks and defenses for large language models: A survey , author=. arXiv preprint arXiv:2409.18169 , year=

Pith/arXiv arXiv
[32]

arXiv preprint arXiv:2410.04524 , year=

Towards secure tuning: Mitigating security risks arising from benign instruction fine-tuning , author=. arXiv preprint arXiv:2410.04524 , year=

arXiv
[33]

The Thirteenth International Conference on Learning Representations , year=

Spurious Forgetting in Continual Learning of Language Models , author=. The Thirteenth International Conference on Learning Representations , year=
[34]

arXiv preprint arXiv:2412.19512 , year=

Safeguard Fine-Tuned LLMs Through Pre-and Post-Tuning Model Merging , author=. arXiv preprint arXiv:2412.19512 , year=

arXiv
[35]

Advances in Neural Information Processing Systems , volume=

Safe loRA: The silver lining of reducing safety risks when finetuning large language models , author=. Advances in Neural Information Processing Systems , volume=
[36]

Aladin Djuhera and Swanand Kadhe and Farhan Ahmed and Syed Zawad and Holger Boche , booktitle=. Safe. 2025 , url=

2025
[37]

Mingjie Li and Wai Man Si and Michael Backes and Yang Zhang and Yisen Wang , booktitle=. SaLo. 2025 , url=

2025
[38]

2024 , booktitle=

Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack , author=. 2024 , booktitle=

2024
[39]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Representation Noising: A Defence Mechanism Against Harmful Finetuning , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[40]

Understanding and Enhancing Safety Mechanisms of

Yiran Zhao and Wenxuan Zhang and Yuxi Xie and Anirudh Goyal and Kenji Kawaguchi and Michael Shieh , booktitle=. Understanding and Enhancing Safety Mechanisms of. 2025 , url=

2025
[41]

First Conference on Language Modeling , year=

What is in Your Safe Data? Identifying Benign Data that Breaks Safety , author=. First Conference on Language Modeling , year=
[42]

2025 , url=

Han Shen and Pin-Yu Chen and Payel Das and Tianyi Chen , booktitle=. 2025 , url=

2025
[43]

Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining , pages=

A new generation of perspective api: Efficient multilingual character-level transformers , author=. Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining , pages=
[44]

2024 , eprint =

The Llama 3 Herd of Models , author =. 2024 , eprint =

2024
[45]

2024 , eprint=

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs , author=. 2024 , eprint=

2024
[46]

arXiv preprint arXiv:2407.21772 , year=

Shieldgemma: Generative ai content moderation based on gemma , author=. arXiv preprint arXiv:2407.21772 , year=

Pith/arXiv arXiv
[47]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[48]

arXiv preprint arXiv:2502.11411 , year=

Detecting and Filtering Unsafe Training Data via Data Attribution , author=. arXiv preprint arXiv:2502.11411 , year=

arXiv
[49]

Safety Layers in Aligned Large Language Models: The Key to

Shen Li and Liuyi Yao and Lan Zhang and Yaliang Li , booktitle=. Safety Layers in Aligned Large Language Models: The Key to. 2025 , url=

2025
[50]

arXiv preprint arXiv:2502.09674 , year=

The hidden dimensions of llm alignment: A multi-dimensional safety analysis , author=. arXiv preprint arXiv:2502.09674 , year=

arXiv
[51]

2024 , eprint=

Improving Alignment and Robustness with Circuit Breakers , author=. 2024 , eprint=

2024
[52]

arXiv preprint arXiv:2410.10700 , year=

Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues , author=. arXiv preprint arXiv:2410.10700 , year=

arXiv
[53]

arXiv preprint arXiv:2307.15043 , year=

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

Pith/arXiv arXiv
[54]

Kaifeng Lyu and Haoyu Zhao and Xinran Gu and Dingli Yu and Anirudh Goyal and Sanjeev Arora , booktitle=. Keeping. 2024 , url=

2024
[55]

2024 , eprint=

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors , author=. 2024 , eprint=

2024
[56]

Advances in Neural Information Processing Systems , volume=

Jailbreakbench: An open robustness benchmark for jailbreaking large language models , author=. Advances in Neural Information Processing Systems , volume=
[57]

arXiv preprint arXiv:2308.09662 , year=

Red-teaming large language models using chain of utterances for safety-alignment , author=. arXiv preprint arXiv:2308.09662 , year=

arXiv
[58]

arXiv preprint arXiv:2404.08676 , year=

ALERT: A comprehensive benchmark for assessing large language models' safety through red teaming , author=. arXiv preprint arXiv:2404.08676 , year=

arXiv
[59]

arXiv preprint arXiv:2009.03300 , year=

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

Pith/arXiv arXiv 2009
[60]

Advances in Neural Information Processing Systems , volume=

Large language diffusion models , author=. Advances in Neural Information Processing Systems , volume=
[61]

arXiv preprint arXiv:2110.14168 , year=

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

Pith/arXiv arXiv
[62]

arXiv preprint arXiv:2305.20050 , year=

Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=

Pith/arXiv arXiv
[63]

arXiv preprint arXiv:2107.03374 , year=

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

Pith/arXiv arXiv
[64]

Hashimoto , title =

Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , month =

2023
[65]

arXiv preprint arXiv:2502.09990 , year=

X-boundary: Establishing exact safety boundary to shield llms from multi-turn jailbreaks without compromising usability , author=. arXiv preprint arXiv:2502.09990 , year=

arXiv
[66]

arXiv preprint arXiv:2403.12171 , year=

Easyjailbreak: A unified framework for jailbreaking large language models , author=. arXiv preprint arXiv:2403.12171 , year=

arXiv
[67]

arXiv preprint arXiv:2604.13016 , year=

Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe , author=. arXiv preprint arXiv:2604.13016 , year=

Pith/arXiv arXiv
[68]

arXiv preprint arXiv:2507.02844 , year=

Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection , author=. arXiv preprint arXiv:2507.02844 , year=

arXiv
[69]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

T2isafety: Benchmark for assessing fairness, toxicity, and privacy in image generation , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[70]

arXiv preprint arXiv:2507.05248 , year=

Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models , author=. arXiv preprint arXiv:2507.05248 , year=

arXiv
[71]

Lindsey, Jack and Gurnee, Wes and Ameisen, Emmanuel and Chen, Brian and Pearce, Adam and Turner, Nicholas L. and Citro, Craig and Abrahams, David and Carter, Shan and Hosmer, Basil and Marcus, Jonathan and Sklar, Michael and Templeton, Adly and Bricken, Trenton and McDougall, Callum and Cunningham, Hoagy and Henighan, Thomas and Jermyn, Adam and Jones, An...
[72]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025
[73]

arXiv preprint arXiv:2408.00118 , year=

Gemma 2: Improving open language models at a practical size , author=. arXiv preprint arXiv:2408.00118 , year=

Pith/arXiv arXiv
[74]

arXiv preprint arXiv:2402.05044 , year=

Salad-bench: A hierarchical and comprehensive safety benchmark for large language models , author=. arXiv preprint arXiv:2402.05044 , year=

arXiv
[75]

2024 , eprint=

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , author=. 2024 , eprint=

2024
[76]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
[77]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=. 2024 , url=

2024
[78]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Adversarial Representation Engineering: A General Model Editing Framework for Large Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[79]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
[80]

arXiv preprint arXiv:1707.06347 , year=

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

Pith/arXiv arXiv

Showing first 80 references.

[1] [1]

Advances in Neural Information Processing Systems , editor=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

2022

[2] [2]

arXiv preprint arXiv:2308.10792 , year=

Instruction tuning for large language models: A survey , author=. arXiv preprint arXiv:2308.10792 , year=

arXiv

[3] [3]

2: Pushing the frontier of open large language models , author=

Deepseek-v3. 2: Pushing the frontier of open large language models , author=. arXiv preprint arXiv:2512.02556 , year=

Pith/arXiv arXiv

[4] [4]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv

[5] [5]

arXiv preprint arXiv:2602.15763 , year=

Glm-5: from vibe coding to agentic engineering , author=. arXiv preprint arXiv:2602.15763 , year=

Pith/arXiv arXiv

[6] [6]

arXiv preprint arXiv:2503.00555 , year=

Safety tax: Safety alignment makes your large reasoning models less reasonable , author=. arXiv preprint arXiv:2503.00555 , year=

arXiv

[7] [7]

International Conference on Learning Representations , volume=

Safe rlhf: Safe reinforcement learning from human feedback , author=. International Conference on Learning Representations , volume=

[8] [8]

Advances in Neural Information Processing Systems , volume=

One-shot safety alignment for large language models via optimal dualization , author=. Advances in Neural Information Processing Systems , volume=

[9] [9]

International Conference on Learning Representations , volume=

Bi-factorial preference optimization: Balancing safety-helpfulness in language models , author=. International Conference on Learning Representations , volume=

[10] [10]

arXiv preprint arXiv:2512.11391 , year=

Mitigating the Safety Alignment Tax with Null-Space Constrained Policy Optimization , author=. arXiv preprint arXiv:2512.11391 , year=

arXiv

[11] [11]

arXiv preprint arXiv:2503.03710 , year=

Improving llm safety alignment with dual-objective optimization , author=. arXiv preprint arXiv:2503.03710 , year=

arXiv

[12] [12]

arXiv preprint arXiv:2603.07445 , year=

Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning , author=. arXiv preprint arXiv:2603.07445 , year=

Pith/arXiv arXiv

[13] [13]

Thinking Machines Lab: Connectionism , year =

Kevin Lu and Thinking Machines Lab , title =. Thinking Machines Lab: Connectionism , year =

[14] [14]

arXiv preprint arXiv:2604.00626 , year=

A survey of on-policy distillation for large language models , author=. arXiv preprint arXiv:2604.00626 , year=

Pith/arXiv arXiv

[15] [15]

International Conference on Learning Representations , volume=

On-policy distillation of language models: Learning from self-generated mistakes , author=. International Conference on Learning Representations , volume=

[16] [16]

International Conference on Learning Representations , volume=

Minillm: Knowledge distillation of large language models , author=. International Conference on Learning Representations , volume=

[17] [17]

arXiv preprint arXiv:2601.19897 , year=

Self-Distillation Enables Continual Learning , author=. arXiv preprint arXiv:2601.19897 , year=

Pith/arXiv arXiv

[18] [18]

arXiv preprint arXiv:2601.18734 , year=

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models , author=. arXiv preprint arXiv:2601.18734 , year=

Pith/arXiv arXiv

[19] [19]

2023 , journal=

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

2023

[20] [20]

arXiv preprint arXiv:2310.01405 , year=

Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

Pith/arXiv arXiv

[21] [21]

arXiv preprint arXiv:2305.14233 , year=

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations , author=. arXiv preprint arXiv:2305.14233 , year=

Pith/arXiv arXiv

[22] [22]

International Conference on Learning Representations , volume=

On the role of attention heads in large language model safety , author=. International Conference on Learning Representations , volume=

[23] [23]

Advances in Neural Information Processing Systems , volume=

Llms encode harmfulness and refusal separately , author=. Advances in Neural Information Processing Systems , volume=

[24] [24]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Layer-aware representation filtering: Purifying finetuning data to preserve llm safety alignment , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025

[25] [25]

Jailbroken: How Does

Alexander Wei and Nika Haghtalab and Jacob Steinhardt , booktitle=. Jailbroken: How Does. 2023 , url=

2023

[26] [26]

arXiv preprint arXiv:2303.17564 , year=

Bloomberggpt: A large language model for finance , author=. arXiv preprint arXiv:2303.17564 , year=

Pith/arXiv arXiv

[27] [27]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

2022

[28] [28]

2024 , cdate=

Xiangyu Qi and Yi Zeng and Tinghao Xie and Pin-Yu Chen and Ruoxi Jia and Prateek Mittal and Peter Henderson , title=. 2024 , cdate=

2024

[29] [29]

Proceedings of the 40th International Conference on Machine Learning , pages=

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning ,author =. Proceedings of the 40th International Conference on Machine Learning , pages=. 2023 , URL=

2023

[30] [30]

arXiv e-prints , pages=

Increased llm vulnerabilities from fine-tuning and quantization , author=. arXiv e-prints , pages=

[31] [31]

arXiv preprint arXiv:2409.18169 , year=

Harmful fine-tuning attacks and defenses for large language models: A survey , author=. arXiv preprint arXiv:2409.18169 , year=

Pith/arXiv arXiv

[32] [32]

arXiv preprint arXiv:2410.04524 , year=

Towards secure tuning: Mitigating security risks arising from benign instruction fine-tuning , author=. arXiv preprint arXiv:2410.04524 , year=

arXiv

[33] [33]

The Thirteenth International Conference on Learning Representations , year=

Spurious Forgetting in Continual Learning of Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

[34] [34]

arXiv preprint arXiv:2412.19512 , year=

Safeguard Fine-Tuned LLMs Through Pre-and Post-Tuning Model Merging , author=. arXiv preprint arXiv:2412.19512 , year=

arXiv

[35] [35]

Advances in Neural Information Processing Systems , volume=

Safe loRA: The silver lining of reducing safety risks when finetuning large language models , author=. Advances in Neural Information Processing Systems , volume=

[36] [36]

Aladin Djuhera and Swanand Kadhe and Farhan Ahmed and Syed Zawad and Holger Boche , booktitle=. Safe. 2025 , url=

2025

[37] [37]

Mingjie Li and Wai Man Si and Michael Backes and Yang Zhang and Yisen Wang , booktitle=. SaLo. 2025 , url=

2025

[38] [38]

2024 , booktitle=

Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack , author=. 2024 , booktitle=

2024

[39] [39]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Representation Noising: A Defence Mechanism Against Harmful Finetuning , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

[40] [40]

Understanding and Enhancing Safety Mechanisms of

Yiran Zhao and Wenxuan Zhang and Yuxi Xie and Anirudh Goyal and Kenji Kawaguchi and Michael Shieh , booktitle=. Understanding and Enhancing Safety Mechanisms of. 2025 , url=

2025

[41] [41]

First Conference on Language Modeling , year=

What is in Your Safe Data? Identifying Benign Data that Breaks Safety , author=. First Conference on Language Modeling , year=

[42] [42]

2025 , url=

Han Shen and Pin-Yu Chen and Payel Das and Tianyi Chen , booktitle=. 2025 , url=

2025

[43] [43]

Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining , pages=

A new generation of perspective api: Efficient multilingual character-level transformers , author=. Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining , pages=

[44] [44]

2024 , eprint =

The Llama 3 Herd of Models , author =. 2024 , eprint =

2024

[45] [45]

2024 , eprint=

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs , author=. 2024 , eprint=

2024

[46] [46]

arXiv preprint arXiv:2407.21772 , year=

Shieldgemma: Generative ai content moderation based on gemma , author=. arXiv preprint arXiv:2407.21772 , year=

Pith/arXiv arXiv

[47] [47]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[48] [48]

arXiv preprint arXiv:2502.11411 , year=

Detecting and Filtering Unsafe Training Data via Data Attribution , author=. arXiv preprint arXiv:2502.11411 , year=

arXiv

[49] [49]

Safety Layers in Aligned Large Language Models: The Key to

Shen Li and Liuyi Yao and Lan Zhang and Yaliang Li , booktitle=. Safety Layers in Aligned Large Language Models: The Key to. 2025 , url=

2025

[50] [50]

arXiv preprint arXiv:2502.09674 , year=

The hidden dimensions of llm alignment: A multi-dimensional safety analysis , author=. arXiv preprint arXiv:2502.09674 , year=

arXiv

[51] [51]

2024 , eprint=

Improving Alignment and Robustness with Circuit Breakers , author=. 2024 , eprint=

2024

[52] [52]

arXiv preprint arXiv:2410.10700 , year=

Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues , author=. arXiv preprint arXiv:2410.10700 , year=

arXiv

[53] [53]

arXiv preprint arXiv:2307.15043 , year=

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

Pith/arXiv arXiv

[54] [54]

Kaifeng Lyu and Haoyu Zhao and Xinran Gu and Dingli Yu and Anirudh Goyal and Sanjeev Arora , booktitle=. Keeping. 2024 , url=

2024

[55] [55]

2024 , eprint=

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors , author=. 2024 , eprint=

2024

[56] [56]

Advances in Neural Information Processing Systems , volume=

Jailbreakbench: An open robustness benchmark for jailbreaking large language models , author=. Advances in Neural Information Processing Systems , volume=

[57] [57]

arXiv preprint arXiv:2308.09662 , year=

Red-teaming large language models using chain of utterances for safety-alignment , author=. arXiv preprint arXiv:2308.09662 , year=

arXiv

[58] [58]

arXiv preprint arXiv:2404.08676 , year=

ALERT: A comprehensive benchmark for assessing large language models' safety through red teaming , author=. arXiv preprint arXiv:2404.08676 , year=

arXiv

[59] [59]

arXiv preprint arXiv:2009.03300 , year=

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

Pith/arXiv arXiv 2009

[60] [60]

Advances in Neural Information Processing Systems , volume=

Large language diffusion models , author=. Advances in Neural Information Processing Systems , volume=

[61] [61]

arXiv preprint arXiv:2110.14168 , year=

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

Pith/arXiv arXiv

[62] [62]

arXiv preprint arXiv:2305.20050 , year=

Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=

Pith/arXiv arXiv

[63] [63]

arXiv preprint arXiv:2107.03374 , year=

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

Pith/arXiv arXiv

[64] [64]

Hashimoto , title =

Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , month =

2023

[65] [65]

arXiv preprint arXiv:2502.09990 , year=

X-boundary: Establishing exact safety boundary to shield llms from multi-turn jailbreaks without compromising usability , author=. arXiv preprint arXiv:2502.09990 , year=

arXiv

[66] [66]

arXiv preprint arXiv:2403.12171 , year=

Easyjailbreak: A unified framework for jailbreaking large language models , author=. arXiv preprint arXiv:2403.12171 , year=

arXiv

[67] [67]

arXiv preprint arXiv:2604.13016 , year=

Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe , author=. arXiv preprint arXiv:2604.13016 , year=

Pith/arXiv arXiv

[68] [68]

arXiv preprint arXiv:2507.02844 , year=

Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection , author=. arXiv preprint arXiv:2507.02844 , year=

arXiv

[69] [69]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

T2isafety: Benchmark for assessing fairness, toxicity, and privacy in image generation , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

[70] [70]

arXiv preprint arXiv:2507.05248 , year=

Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models , author=. arXiv preprint arXiv:2507.05248 , year=

arXiv

[71] [71]

Lindsey, Jack and Gurnee, Wes and Ameisen, Emmanuel and Chen, Brian and Pearce, Adam and Turner, Nicholas L. and Citro, Craig and Abrahams, David and Carter, Shan and Hosmer, Basil and Marcus, Jonathan and Sklar, Michael and Templeton, Adly and Bricken, Trenton and McDougall, Callum and Cunningham, Hoagy and Henighan, Thomas and Jermyn, Adam and Jones, An...

[72] [72]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025

[73] [73]

arXiv preprint arXiv:2408.00118 , year=

Gemma 2: Improving open language models at a practical size , author=. arXiv preprint arXiv:2408.00118 , year=

Pith/arXiv arXiv

[74] [74]

arXiv preprint arXiv:2402.05044 , year=

Salad-bench: A hierarchical and comprehensive safety benchmark for large language models , author=. arXiv preprint arXiv:2402.05044 , year=

arXiv

[75] [75]

2024 , eprint=

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , author=. 2024 , eprint=

2024

[76] [76]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

[77] [77]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=. 2024 , url=

2024

[78] [78]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Adversarial Representation Engineering: A General Model Editing Framework for Large Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

[79] [79]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

[80] [80]

arXiv preprint arXiv:1707.06347 , year=

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

Pith/arXiv arXiv