pith. sign in

arxiv: 2606.00686 · v1 · pith:CIKF44YFnew · submitted 2026-05-30 · 💻 cs.LG

Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing

Pith reviewed 2026-06-28 19:21 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLM alignmentmixture of expertsLoRA expertssafety routingunsafe datadynamic gating
0
0 comments X

The pith

SafeMoE isolates unsafe domain knowledge in LoRA experts and routes it through a safety-trained gate to raise both safety and informativeness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard LLM alignment erases unsafe data and thereby limits the model's ability to give nuanced answers. It proposes instead to train separate LoRA experts only on harmful corpora so that domain knowledge remains available. A small gating network, trained on a minimal set of safe responses, then selects which experts to activate at inference time. This routing is claimed to produce higher rates of safe yet informative outputs and to generalize without further supervision. The central evidence is the reported improvement on safety benchmarks together with the observed zero-shot transfer.

Core claim

Training domain-specific LoRA experts exclusively on harmful corpora and orchestrating them with a lightweight gating network trained on a minimal set of safe-informative responses allows the model to harness unsafe knowledge for generation while enforcing safety constraints, yielding over 20 percent relative improvement in safe response rate and strong zero-shot generalization to unseen domains.

What carries the argument

SafeMoE Mixture-of-Experts architecture in which LoRA experts store unsafe domain knowledge and a gating network performs dynamic safety routing.

If this is right

  • Safe response rate rises by more than 15 percent absolute on stringent benchmarks while answers remain more informative.
  • The same routing mechanism transfers to new safety tasks without domain-specific retraining.
  • Unsafe corpora can be retained as a source of domain expertise rather than discarded.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the separation of knowledge storage and routing holds, alignment pipelines could treat safety as a modular control layer rather than a global filter.
  • The method invites direct tests on whether over-refusal rates drop on benign but sensitive queries compared with standard refusal training.
  • Extending the same split to non-text modalities would test whether the unsafe-knowledge-plus-router pattern generalizes beyond language.

Load-bearing premise

A router trained on only a minimal set of safe responses can reliably steer away from harmful outputs produced by experts trained solely on unsafe data.

What would settle it

A held-out harmful prompt from a domain absent from both expert and gate training that nevertheless elicits an unsafe or uninformative response from the routed model.

Figures

Figures reproduced from arXiv: 2606.00686 by Jerry Huang, Marc-Alexandre C\^ot\'e, Maryam Hashemzadeh, Minseon Kim, Sarath Chandar.

Figure 1
Figure 1. Figure 1: Example of unsafe, safe but uninforma￾tive, and safe-informative responses. Model safety often refers to the ability to avoid generating potentially harmful content, whether to oneself or others. However, safe responses can sometimes be vague or overly cautious, lack￾ing the detail needed to satisfy user intent. One such case is refusal, where the model declines to answer out of concern that it could lead … view at source ↗
Figure 2
Figure 2. Figure 2: Stages of SafeMoE. (1) Unsafe experts are trained using a large set of unsafe response data that can be split into domains. (2) Experts are used to train a router with a smaller set of safe response data. In the second stage, the experts are frozen, while only the router is trainable. We introduce SafeMoE, a frame￾work that adapts a base LLM to generate safe yet informative responses by leveraging unsafe d… view at source ↗
Figure 3
Figure 3. Figure 3: Safe samples per domain. To assess the efficiency of router tuning, we ablate on the volume of safe response data (|Ds |). The extreme class im￾balance, e.g. (|Dus| > 104 per category versus |Ds | ≤ 200 total), motivates us to investigate the marginal utility of safe demonstrations. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The performance of SafeMoE models on over-refusal is shown in the plots. As illustrated, SafeMoE models not only maintain high performance on hard categories but also improve safety on toxic categories. In both cases, the informativeness scores remain high. 4.2.7 Meta-Evaluation and Human Validation of the Judge [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Expert Activation Heatmap across safety domains. The non-uniform vertical clustering [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Mean expert activation percentage across all categories under First-Layer Routing. The [PITH_FULL_IMAGE:figures/full_fig_p034_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example of unsafe prompts per categories and safe-informative responses. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example of unsafe prompts per categories and safe-informative responses. [PITH_FULL_IMAGE:figures/full_fig_p036_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example of unsafe prompts per categories and safe-informative responses. [PITH_FULL_IMAGE:figures/full_fig_p037_9.png] view at source ↗
read the original abstract

The prevailing paradigm in large language model (LLM) alignment operates via erasure, filtering unsafe data or training models to strictly refuse harmful prompts. While effective at reducing immediate toxicity, this approach fundamentally constricts the model's epistemological scope, resulting in over-cautious systems that output uninformative blanket refusals to sensitive yet benign queries. In this work, we challenge the orthodoxy that unsafe data must be discarded. We propose a dialectical approach to alignment, positing that unsafe data encodes rich, domain specific knowledge critical for nuanced, safe, and informative generation. To operationalize this, we introduce SafeMoE, a Mixture-of-Experts (MoE) framework that isolates unsafe knowledge into domain-specific Low-Rank Adapters (LoRA experts) trained exclusively on harmful corpora. To synthesize safety from these unsafe primitives, we train a lightweight gating network using a minimal, highly curated set of safe-informative responses. During inference, this router dynamically orchestrates the unsafe experts, effectively steering the generation trajectory to harness their deep domain knowledge while strictly enforcing safety constraints. Extensive empirical evaluations across stringent safety benchmarks demonstrate that SafeMoE is not only safer, achieving over a 20% relative improvement in safe response rate (more than a 15% absolute gain), but also produces more informative responses when safety and harmfulness are of paramount concern. Furthermore, the routing mechanism exhibits strong zero-shot generalization to unseen domains and broader safety tasks without domain-specific supervision. Our findings suggest a paradigm shift in alignment: true safety requires not the masking of unsafe knowledge, but its controlled integration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes SafeMoE, a Mixture-of-Experts framework that trains domain-specific LoRA experts exclusively on harmful corpora to encode unsafe domain knowledge, then trains a lightweight gating network on a minimal set of safe-informative responses to dynamically route these experts at inference time. The central claims are that this yields over 20% relative (15% absolute) gains in safe response rate on safety benchmarks while producing more informative outputs, plus strong zero-shot generalization to unseen domains without domain-specific supervision, challenging erasure-based alignment paradigms.

Significance. If the performance and generalization results hold under proper controls, the work would be significant as an empirical demonstration that unsafe data can be harnessed for both safety and utility rather than discarded, potentially opening a new direction in alignment research focused on controlled knowledge integration rather than refusal training.

major comments (2)
  1. [Abstract] Abstract: the reported >20% relative and >15% absolute gains in safe response rate are presented without any information on baselines, statistical significance, number of runs, data exclusion rules, or experimental controls; these details are load-bearing for the central empirical claim and must be supplied before the gains can be evaluated.
  2. [Abstract] Abstract (and method description): no mechanism, loss term, or metric is described to guarantee that the gate trained on minimal safe responses prevents leakage of unsafe tokens or patterns from the harmful LoRA experts; this separation assumption is load-bearing for both the safety improvement and zero-shot routing claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while noting where clarifications or additions are warranted.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported >20% relative and >15% absolute gains in safe response rate are presented without any information on baselines, statistical significance, number of runs, data exclusion rules, or experimental controls; these details are load-bearing for the central empirical claim and must be supplied before the gains can be evaluated.

    Authors: Section 4 of the manuscript provides the full experimental protocol, including the specific baselines (refusal-trained Llama-2-7B, standard RLHF, and domain-specific fine-tuning), five independent runs with mean and standard deviation, paired t-tests for significance (p < 0.01), and explicit data exclusion rules (removal of prompts with >50% overlap to training). The abstract summarizes the outcome rather than the protocol due to length constraints. To make the central claim more evaluable on first reading, we have added one sentence to the abstract referencing the controlled evaluation setup and statistical reporting. revision: yes

  2. Referee: [Abstract] Abstract (and method description): no mechanism, loss term, or metric is described to guarantee that the gate trained on minimal safe responses prevents leakage of unsafe tokens or patterns from the harmful LoRA experts; this separation assumption is load-bearing for both the safety improvement and zero-shot routing claims.

    Authors: The gating network is trained exclusively on safe-informative response pairs; its objective is to select expert combinations that maximize the likelihood of those safe outputs, which by construction discourages routes that would surface unsafe patterns. We agree, however, that no auxiliary loss term (e.g., an explicit safety-classifier penalty) or post-generation leakage metric is stated in the current text. We have therefore expanded the method section with a precise formulation of the routing loss and added a short ablation quantifying token-level leakage (via keyword and classifier checks) on held-out harmful prompts, confirming the separation holds in practice. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The provided abstract and description contain no equations, derivations, or mathematical claims that reduce to fitted parameters or self-citations. The central proposal (SafeMoE framework) and reported gains are presented as empirical outcomes from experiments rather than definitional or self-referential constructs. No load-bearing steps match any enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; evaluation is limited to the high-level description given.

pith-pipeline@v0.9.1-grok · 5833 in / 1179 out tokens · 24345 ms · 2026-06-28T19:21:31.741628+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

85 extracted references · 39 canonical work pages · 10 internal anchors

  1. [1]

    Akiba, M

    T. Akiba, M. Shing, Y . Tang, Q. Sun, and D. Ha. Evolutionary optimization of model merging recipes.Nat. Mac. Intell., 7(2):195–204, 2025. doi: 10.1038/S42256-024-00975-8. URL https://doi.org/10.1038/s42256-024-00975-8

  2. [2]

    Arditi, O

    A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda. Refusal in language models is mediated by a single direction. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Ad- vances in Neural Information Processing Systems 38: Annual Conference on Neural In- formation Processing System...

  3. [3]

    G. Bai, J. Liu, X. Bu, Y . He, J. Liu, Z. Zhou, Z. Lin, W. Su, T. Ge, B. Zheng, and W. Ouyang. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. In L. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ...

  4. [4]

    Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. E. Showk, N. Elhage, Z. Hatfield- Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. B. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kapl...

  5. [5]

    Bavaresco, R

    A. Bavaresco, R. Bernardi, L. Bertolazzi, D. Elliott, R. Fernández, A. Gatt, E. Ghaleb, M. Giu- lianelli, M. Hanna, A. Koller, A. F. T. Martins, P. Mondorf, V . Neplenbroek, S. Pezzelle, B. Plank, D. Schlangen, A. Suglia, A. K. Surikuchi, E. Takmaz, and A. Testoni. Llms instead of human judges? A large scale empirical study across 20 NLP evaluation tasks....

  6. [6]

    Bhardwaj and S

    R. Bhardwaj and S. Poria. Red-teaming large language models using chain of utterances for safety-alignment, 2023. URLhttps://doi.org/10.48550/arXiv.2308.09662

  7. [7]

    Bianchi, M

    F. Bianchi, M. Suzgun, G. Attanasio, P. Röttger, D. Jurafsky, T. Hashimoto, and J. Zou. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/ f...

  8. [8]

    Bommasani, K

    R. Bommasani, K. Klyman, S. Kapoor, S. Longpre, B. Xiong, N. Maslej, and P. Liang. The 2024 foundation model transparency index.Trans. Mach. Learn. Res., 2025, 2025. URL https://openreview.net/forum?id=38cwP8xVxD. 10

  9. [9]

    L. Cao. Learn to refuse: Making large language models more controllable and reliable through knowledge scope limitation and refusal mechanism. In Y . Al-Onaizan, M. Bansal, and Y . Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Pro- cessing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, pages 3628–3646. Assoc...

  10. [10]

    Casper, K

    S. Casper, K. O’Brien, S. Longpre, E. Seger, K. Klyman, R. Bommasani, A. Nrusimha, I. Shu- mailov, S. Mindermann, S. Basart, F. Rudzicz, K. Pelrine, A. Ghosh, A. Strait, R. Kirk, D. Hendrycks, P. Henderson, J. Z. Kolter, G. Irving, Y . Gal, Y . Bengio, and D. Hadfield-Menell. Open technical problems in open-weight AI model risk management.Trans. Mach. Lea...

  11. [11]

    P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong. Jailbreaking black box large language models in twenty queries. InIEEE Conference on Secure and Trustworthy Machine Learning, SaTML 2025, Copenhagen, Denmark, April 9-11, 2025, pages 23–42. IEEE, 2025. doi: 10.1109/SATML64287.2025.00010. URL https://doi.org/10.1109/ SaTML64287.2025.00010

  12. [12]

    Y . Chen, H. Gao, G. Cui, F. Qi, L. Huang, Z. Liu, and M. Sun. Why should adversarial perturbations be imperceptible? rethink the research paradigm in adversarial NLP. In Y . Gold- berg, Z. Kozareva, and Y . Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, D...

  13. [13]

    P. F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei. Deep re- inforcement learning from human preferences. In I. Guyon, U. von Luxburg, S. Ben- gio, H. M. Wallach, R. Fergus, S. V . N. Vishwanathan, and R. Garnett, editors,Ad- vances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Syst...

  14. [14]

    J. Chu, Y . Liu, Z. Yang, X. Shen, M. Backes, and Y . Zhang. Jailbreakradar: Comprehensive assessment of jailbreak attacks against llms. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August ...

  15. [15]

    J. Cui, W. Chiang, I. Stoica, and C. Hsieh. Or-bench: An over-refusal benchmark for large language models, 2024. URLhttps://doi.org/10.48550/arXiv.2405.20947

  16. [16]

    D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y . Wu, Z. Xie, Y . K. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. In L. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Comp...

  17. [17]

    E. L. Deci, R. Koestner, and R. M. Ryan. A meta-analytic review of experiments examining the effects of extrinsic rewards on intrinsic motivation.Psychological Bulletin, 125(6):627–668, 1999

  18. [18]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024. URLhttps://doi.org/10.48550/arXiv.2405.04434

  19. [19]

    DeepSeek-V3 Technical Report

    DeepSeek-AI. Deepseek-v3 technical report, 2024. URLhttps://doi.org/10.48550/arXiv. 2412.19437. 11

  20. [20]

    DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. W...

  21. [21]

    Desmond, Z

    M. Desmond, Z. Ashktorab, W. Geyer, E. M. Daly, M. S. Cooper, Q. Pan, R. Nair, N. Wagner, and T. Pedapati. Evalassist: Llm-as-a-judge simplified. In T. Walsh, J. Shah, and Z. Kolter, editors,AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 29637–29639. AAAI Pres...

  22. [22]

    R. Duan, J. Liu, X. Jia, S. Zhao, R. Cheng, F. Wang, C. Wei, Y . Xie, C. Liu, D. Li, Y . Dong, Y . Zhang, Y . Chen, C. Wang, X. Ma, X. Wei, Y . Liu, H. Su, J. Zhu, X. Li, Y . Sun, J. Zhang, J. Hu, S. Xu, W. Yang, Y . Yang, X. Zhang, Y . Tan, J. Tao, and H. Xue. Oyster-i: Beyond refusal - constructive safety alignment for responsible language models, 2025....

  23. [23]

    S. Duan, X. Yi, P. Zhang, T. Lu, X. Xie, and N. Gu. Denevil: towards deciphering and navigating the ethical values of large language models via instruction learning. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

  24. [24]

    URLhttps://openreview.net/forum?id=m3RRWWFaVe

    OpenReview.net, 2024. URLhttps://openreview.net/forum?id=m3RRWWFaVe

  25. [25]

    Dubois, C

    Y . Dubois, C. X. Li, R. Taori, T. Zhang, I. Gulrajani, J. Ba, C. Guestrin, P. Liang, and T. B. Hashimoto. Alpacafarm: A simulation framework for methods that learn from hu- man feedback. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neu- ral Infor...

  26. [26]

    Elhage, T

    N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, R. Grosse, S. McCandlish, J. Kaplan, D. Amodei, M. Wattenberg, and C. Olah. Toy models of superposition, 2022. URL https://transformer-circuits. pub/2022/toy_model/index.html

  27. [27]

    Fedus, B

    W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.J. Mach. Learn. Res., 23:120:1–120:39, 2022. URL https: //jmlr.org/papers/v23/21-0998.html

  28. [28]

    W. Feng, C. Hao, Y . Zhang, Y . Han, and H. Wang. Mixture-of-loras: An efficient multitask tuning method for large language models. In N. Calzolari, M. Kan, V . Hoste, A. Lenci, S. Sakti, and N. Xue, editors,Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20- 25 May,...

  29. [29]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y . Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, A. Jones, S. Bowman, A. Chen, T. Conerly, N. DasSarma, D. Drain, N. Elhage, S. E. Showk, S. Fort, Z. Hatfield-Dodds, T. Henighan, D. Hernandez, T. Hume, J. Jacobson, S. Johnston, S. Kravec, C. Olsson, S. Ringer, E. Tran-Johnson, D. Amodei, T. Br...

  30. [30]

    J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y . Shen, S. Ma, H. Liu, Y . Wang, and J. Guo. A survey on llm-as-a-judge, 2024. URL https://doi.org/10.48550/arXiv.2411. 15594

  31. [31]

    S. Han, G. T. Junior, T. Balough, and W. Zhou. Judge’s verdict: A comprehensive analysis of LLM judge capability through human agreement, 2025. URL https://doi.org/10.48550/ arXiv.2510.09738

  32. [32]

    T. Han, A. Kumar, C. Agarwal, and H. Lakkaraju. Medsafetybench: Evaluating and improving the medical safety of large language models. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Informa- tion Processing Systems 2024, Neur...

  33. [33]

    C. Hsu, Y . Tsai, C. Lin, P. Chen, C. Yu, and C. Huang. Safe lora: The silver lin- ing of reducing safety risks when finetuning large language models. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Ad- vances in Neural Information Processing Systems 38: Annual Conference on Neural In- formation Processing...

  34. [34]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URLhttps://openreview.net/forum?id=nZeVKeeFYf9

  35. [35]

    Ilharco, M

    G. Ilharco, M. T. Ribeiro, M. Wortsman, L. Schmidt, H. Hajishirzi, and A. Farhadi. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Rep- resentations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/forum?id=6t0Kwf8-jrj

  36. [36]

    R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts.Neural Comput., 3(1):79–87, 1991. doi: 10.1162/NECO.1991.3.1.79. URL https: //doi.org/10.1162/neco.1991.3.1.79

  37. [37]

    E. Jan, N. AlDahoul, M. Ali, F. Ahmad, F. Zaffar, and Y . Zaki. Multitask mayhem: Unveiling and mitigating safety gaps in llms fine-tuning, 2024. URL https://doi.org/10.48550/ arXiv.2409.15361

  38. [38]

    J. Ji, M. Liu, J. Dai, X. Pan, C. Zhang, C. Bian, B. Chen, R. Sun, Y . Wang, and Y . Yang. Beavertails: Towards improved safety alignment of LLM via a human-preference dataset. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Ad- vances in Neural Information Processing Systems 36: Annual Conference on Neural In- formation Pr...

  39. [39]

    J. Ji, D. Hong, B. Zhang, B. Chen, J. Dai, B. Zheng, T. A. Qiu, J. Zhou, K. Wang, B. Li, S. Han, Y . Guo, and Y . Yang. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vo...

  40. [40]

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de Las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mistral 7b, 2023. URL https://doi.org/10.48550/arXiv.2310.06825. 13

  41. [41]

    A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de Las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mixtral of experts, 2024. URL https...

  42. [42]

    Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, and X. Lu. Pubmedqa: A dataset for biomedical research question answering. In K. Inui, J. Jiang, V . Ng, and X. Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna- tional Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kon...

  43. [43]

    URLhttps://doi.org/10.18653/v1/D19-1259

    doi: 10.18653/V1/D19-1259. URLhttps://doi.org/10.18653/v1/D19-1259

  44. [44]

    X. Jin, X. Ren, D. Preotiuc-Pietro, and P. Cheng. Dataless knowledge fusion by merging weights of language models. InThe Eleventh International Conference on Learning Rep- resentations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/forum?id=FCnohuR6AnM

  45. [45]

    H. R. Kirk, B. Vidgen, P. Röttger, and S. A. Hale. The benefits, risks and bounds of personalizing the alignment of large language models to individuals.Nat. Mac. Intell., 6(4):383–392, 2024. doi: 10.1038/S42256-024-00820-Y. URLhttps://doi.org/10.1038/s42256-024-00820-y

  46. [46]

    Kumarage, N

    T. Kumarage, N. Mehrabi, A. Ramakrishna, X. Zhao, R. S. Zemel, K. Chang, A. Galstyan, R. Gupta, and C. Peris. Towards safety reasoning in llms: Ai-agentic deliberation for policy- embedded cot data creation. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors, Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria,...

  47. [47]

    Y . Liu, P. Liu, and A. Cohan. On evaluating LLM alignment by evaluating llms as judges, 2025. URLhttps://doi.org/10.48550/arXiv.2511.20604

  48. [48]

    G. F. Loewenstein, E. U. Weber, C. K. Hsee, and N. Welch. Risk as feelings.Psychological bulletin, 127(2):267, 2001

  49. [49]

    Longpre, S

    S. Longpre, S. Biderman, A. Albalak, H. Schoelkopf, D. McDuff, S. Kapoor, K. Klyman, K. Lo, G. Ilharco, N. San, M. Rauh, A. Skowron, B. Vidgen, L. Weidinger, A. Narayanan, V . Sanh, D. I. Adelani, P. Liang, R. Bommasani, P. Henderson, S. Luccioni, Y . Jernite, and L. Soldaini. The responsible foundation model development cheatsheet: A review of tools & re...

  50. [50]

    Loshchilov and F

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URLhttps://openreview.net/forum?id=Bkg6RiCqY7

  51. [51]

    N. Lu, S. Liu, J. Wu, W. Chen, Z. Zhang, Y . Ong, Q. Wang, and K. Tang. Safe delta: Consistently preserving safety when fine-tuning llms on diverse datasets. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Forty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Cana...

  52. [52]

    Matena and C

    M. Matena and C. Raffel. Merging models with fisher-weighted averaging. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neu- ral Information Processing Systems 35: Annual Conference on Neural Information Pro- cessing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - Decem- ber 9, 2022, 2022. URL http...

  53. [53]

    Mather and N

    M. Mather and N. R. Lighthall. Risk and reward are processed differently in decisions made under stress.Current directions in psychological science, 21(1):36–41, 2012. 14

  54. [54]

    Mazeika, L

    M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, D. A. Forsyth, and D. Hendrycks. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URLhttps: //o...

  55. [55]

    GPT-4 technical report, 2023

    OpenAI. GPT-4 technical report, 2023. URL https://doi.org/10.48550/arXiv.2303. 08774

  56. [56]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Ch...

  57. [57]

    Phute, A

    M. Phute, A. Helbling, M. Hull, S. Peng, S. Szyller, C. Cornelius, and D. H. Chau. LLM self defense: By self examination, llms know they are being tricked. InThe Second Tiny Papers Track at ICLR 2024, Tiny Papers @ ICLR 2024, Vienna, Austria, May 11, 2024. OpenReview.net,

  58. [58]

    URLhttps://openreview.net/forum?id=YoqgcIA19o

  59. [59]

    Rafailov, A

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Di- rect preference optimization: Your language model is secretly a reward model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Ad- vances in Neural Information Processing Systems 36: Annual Conference on Neural In- formation Processing Systems 202...

  60. [60]

    Q. Ren, C. Gao, J. Shao, J. Yan, X. Tan, W. Lam, and L. Ma. Codeattack: Revealing safety generalization challenges of large language models via code completion. In L. Ku, A. Martins, and V . Srikumar, editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 11437–11452. A...

  61. [61]

    Reuel, B

    A. Reuel, B. Bucknall, S. Casper, T. Fist, L. Soder, O. Aarne, L. Hammond, L. Ibrahim, A. Chan, P. Wills, M. Anderljung, B. Garfinkel, L. Heim, A. Trask, G. Mukobi, R. Schaeffer, M. Baker, S. Hooker, I. Solaiman, S. Luccioni, N. Rajkumar, N. Moës, J. Ladish, D. Bau, P. Bricman, N. Guha, J. Newman, Y . Bengio, T. South, A. Pentland, S. Koyejo, M. J. Kochen...

  62. [62]

    Steering Llama 2 via Contrastive Activation Addition

    N. Rimsky, N. Gabrieli, J. Schulz, M. Tong, E. Hubinger, and A. M. Turner. Steering llama 2 via contrastive activation addition. In L. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 15504–15522....

  63. [63]

    Röttger, H

    P. Röttger, H. Kirk, B. Vidgen, G. Attanasio, F. Bianchi, and D. Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. In K. Duh, H. Gómez- Adorno, and S. Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologi...

  64. [64]

    M. Sap, S. Gabriel, L. Qin, D. Jurafsky, N. A. Smith, and Y . Choi. Social bias frames: Reasoning about social and power implications of language. In D. Jurafsky, J. Chai, N. Schluter, 15 and J. R. Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 5477–5490....

  65. [65]

    Shoemake

    K. Shoemake. Animating rotation with quaternion curves. In P. Cole, R. Heilman, and B. A. Barsky, editors,Proceedings of the 12th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1985, San Francisco, California, USA, July 22-26, 1985, pages 245–254. ACM, 1985. doi: 10.1145/325334.325242. URL https://doi.org/10.1145/ 325334.325242

  66. [66]

    Y . Sung, J. Cho, and M. Bansal. VL-ADAPTER: parameter-efficient transfer learning for vision- and-language tasks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 5217–5227. IEEE, 2022. doi: 10. 1109/CVPR52688.2022.00516. URLhttps://doi.org/10.1109/CVPR52688.2022.00516

  67. [67]

    Q. Team. Qwen1.5-moe: Matching 7b model performance with 1/3 activated parameters, February 2024. URLhttps://qwenlm.github.io/blog/qwen-moe/

  68. [68]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Ba- tra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. Canton-Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V . Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V . Kerkez, M. Khabsa, I. Kloumann, A....

  69. [69]

    Zephyr: Direct Distillation of LM Alignment

    L. Tunstall, E. Beeching, N. Lambert, N. Rajani, K. Rasul, Y . Belkada, S. Huang, L. von Werra, C. Fourrier, N. Habib, N. Sarrazin, O. Sanseviero, A. M. Rush, and T. Wolf. Zephyr: Direct distillation of LM alignment, 2023. URLhttps://doi.org/10.48550/arXiv.2310.16944

  70. [70]

    Y . Wang, Y . Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi. Self- instruct: Aligning language models with self-generated instructions. In A. Rogers, J. L. Boyd- Graber, and N. Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada,...

  71. [71]

    A. Wei, N. Haghtalab, and J. Steinhardt. Jailbroken: How does LLM safety training fail? In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, edi- tors,Advances in Neural Information Processing Systems 36: Annual Conference on Neu- ral Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, Decem- ber 10 - 16, 2023, 2023...

  72. [72]

    Weidinger, J

    L. Weidinger, J. Uesato, M. Rauh, C. Griffin, P. Huang, J. Mellor, A. Glaese, M. Cheng, B. Balle, A. Kasirzadeh, C. Biles, S. Brown, Z. Kenton, W. Hawkins, T. Stepleton, A. Birhane, L. A. Hen- dricks, L. Rimell, W. Isaac, J. Haas, S. Legassick, G. Irving, and I. Gabriel. Taxonomy of risks posed by language models. InFAccT ’22: 2022 ACM Conference on Fairn...

  73. [73]

    Wollschläger, J

    T. Wollschläger, J. Elstner, S. Geisler, V . Cohen-Addad, S. Günnemann, and J. Gasteiger. The geometry of refusal in large language models: Concept cones and representational independence. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Forty-second International Conference on Machine Learning,...

  74. [74]

    X. Wu, S. Huang, and F. Wei. Mixture of lora experts. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

  75. [75]

    URLhttps://openreview.net/forum?id=uWvKBCYh4S

  76. [76]

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, and Z. Qiu. Qwen2....

  77. [77]

    Y . Yuan, W. Jiao, W. Wang, J. Huang, P. He, S. Shi, and Z. Tu. GPT-4 is too smart to be safe: Stealthy chat with llms via cipher. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=MbfAK4s61A

  78. [78]

    Y . Yuan, W. Jiao, W. Wang, J. Huang, J. Xu, T. Liang, P. He, and Z. Tu. Refuse whenever you feel unsafe: Improving safety in llms via decoupled refusal training. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna...

  79. [79]

    Zhang, S

    J. Zhang, S. Chen, J. Liu, and J. He. Composing parameter-efficient modules with arith- metic operation. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neu- ral Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, Decem- ber 10 -...

  80. [80]

    Zhang, P

    W. Zhang, P. Torr, M. Elhoseiny, and A. Bibi. Bi-factorial preference optimization: Balancing safety-helpfulness in language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview.net/forum?id=GjM61KRiTG

Showing first 80 references.