Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing

Jerry Huang; Marc-Alexandre C\^ot\'e; Maryam Hashemzadeh; Minseon Kim; Sarath Chandar

arxiv: 2606.00686 · v1 · pith:CIKF44YFnew · submitted 2026-05-30 · 💻 cs.LG

Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing

Maryam Hashemzadeh , Jerry Huang , Minseon Kim , Marc-Alexandre C\^ot\'e , Sarath Chandar This is my paper

Pith reviewed 2026-06-28 19:21 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM alignmentmixture of expertsLoRA expertssafety routingunsafe datadynamic gating

0 comments

The pith

SafeMoE isolates unsafe domain knowledge in LoRA experts and routes it through a safety-trained gate to raise both safety and informativeness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard LLM alignment erases unsafe data and thereby limits the model's ability to give nuanced answers. It proposes instead to train separate LoRA experts only on harmful corpora so that domain knowledge remains available. A small gating network, trained on a minimal set of safe responses, then selects which experts to activate at inference time. This routing is claimed to produce higher rates of safe yet informative outputs and to generalize without further supervision. The central evidence is the reported improvement on safety benchmarks together with the observed zero-shot transfer.

Core claim

Training domain-specific LoRA experts exclusively on harmful corpora and orchestrating them with a lightweight gating network trained on a minimal set of safe-informative responses allows the model to harness unsafe knowledge for generation while enforcing safety constraints, yielding over 20 percent relative improvement in safe response rate and strong zero-shot generalization to unseen domains.

What carries the argument

SafeMoE Mixture-of-Experts architecture in which LoRA experts store unsafe domain knowledge and a gating network performs dynamic safety routing.

If this is right

Safe response rate rises by more than 15 percent absolute on stringent benchmarks while answers remain more informative.
The same routing mechanism transfers to new safety tasks without domain-specific retraining.
Unsafe corpora can be retained as a source of domain expertise rather than discarded.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the separation of knowledge storage and routing holds, alignment pipelines could treat safety as a modular control layer rather than a global filter.
The method invites direct tests on whether over-refusal rates drop on benign but sensitive queries compared with standard refusal training.
Extending the same split to non-text modalities would test whether the unsafe-knowledge-plus-router pattern generalizes beyond language.

Load-bearing premise

A router trained on only a minimal set of safe responses can reliably steer away from harmful outputs produced by experts trained solely on unsafe data.

What would settle it

A held-out harmful prompt from a domain absent from both expert and gate training that nevertheless elicits an unsafe or uninformative response from the routed model.

Figures

Figures reproduced from arXiv: 2606.00686 by Jerry Huang, Marc-Alexandre C\^ot\'e, Maryam Hashemzadeh, Minseon Kim, Sarath Chandar.

**Figure 1.** Figure 1: Example of unsafe, safe but uninformative, and safe-informative responses. Model safety often refers to the ability to avoid generating potentially harmful content, whether to oneself or others. However, safe responses can sometimes be vague or overly cautious, lacking the detail needed to satisfy user intent. One such case is refusal, where the model declines to answer out of concern that it could lead … view at source ↗

**Figure 2.** Figure 2: Stages of SafeMoE. (1) Unsafe experts are trained using a large set of unsafe response data that can be split into domains. (2) Experts are used to train a router with a smaller set of safe response data. In the second stage, the experts are frozen, while only the router is trainable. We introduce SafeMoE, a framework that adapts a base LLM to generate safe yet informative responses by leveraging unsafe d… view at source ↗

**Figure 3.** Figure 3: Safe samples per domain. To assess the efficiency of router tuning, we ablate on the volume of safe response data (|Ds |). The extreme class imbalance, e.g. (|Dus| > 104 per category versus |Ds | ≤ 200 total), motivates us to investigate the marginal utility of safe demonstrations. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: The performance of SafeMoE models on over-refusal is shown in the plots. As illustrated, SafeMoE models not only maintain high performance on hard categories but also improve safety on toxic categories. In both cases, the informativeness scores remain high. 4.2.7 Meta-Evaluation and Human Validation of the Judge [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Expert Activation Heatmap across safety domains. The non-uniform vertical clustering [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗

**Figure 6.** Figure 6: Mean expert activation percentage across all categories under First-Layer Routing. The [PITH_FULL_IMAGE:figures/full_fig_p034_6.png] view at source ↗

**Figure 7.** Figure 7: Example of unsafe prompts per categories and safe-informative responses. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_7.png] view at source ↗

**Figure 8.** Figure 8: Example of unsafe prompts per categories and safe-informative responses. [PITH_FULL_IMAGE:figures/full_fig_p036_8.png] view at source ↗

**Figure 9.** Figure 9: Example of unsafe prompts per categories and safe-informative responses. [PITH_FULL_IMAGE:figures/full_fig_p037_9.png] view at source ↗

read the original abstract

The prevailing paradigm in large language model (LLM) alignment operates via erasure, filtering unsafe data or training models to strictly refuse harmful prompts. While effective at reducing immediate toxicity, this approach fundamentally constricts the model's epistemological scope, resulting in over-cautious systems that output uninformative blanket refusals to sensitive yet benign queries. In this work, we challenge the orthodoxy that unsafe data must be discarded. We propose a dialectical approach to alignment, positing that unsafe data encodes rich, domain specific knowledge critical for nuanced, safe, and informative generation. To operationalize this, we introduce SafeMoE, a Mixture-of-Experts (MoE) framework that isolates unsafe knowledge into domain-specific Low-Rank Adapters (LoRA experts) trained exclusively on harmful corpora. To synthesize safety from these unsafe primitives, we train a lightweight gating network using a minimal, highly curated set of safe-informative responses. During inference, this router dynamically orchestrates the unsafe experts, effectively steering the generation trajectory to harness their deep domain knowledge while strictly enforcing safety constraints. Extensive empirical evaluations across stringent safety benchmarks demonstrate that SafeMoE is not only safer, achieving over a 20% relative improvement in safe response rate (more than a 15% absolute gain), but also produces more informative responses when safety and harmfulness are of paramount concern. Furthermore, the routing mechanism exhibits strong zero-shot generalization to unseen domains and broader safety tasks without domain-specific supervision. Our findings suggest a paradigm shift in alignment: true safety requires not the masking of unsafe knowledge, but its controlled integration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SafeMoE keeps harmful data in LoRA experts and routes via a small safe gate, but the abstract gives no mechanism or checks to show the gate actually prevents leakage.

read the letter

The paper's main move is to train separate LoRA experts only on harmful corpora for each domain, then train a lightweight router on a minimal set of safe responses so that at inference the model can pull domain knowledge without defaulting to refusals. This is presented as a direct alternative to erasure-style alignment.

The construction itself is new in the alignment literature as far as the abstract shows: an explicit MoE where the experts are deliberately unsafe and the safety signal lives only in the gate. The reported numbers—an over 20% relative lift in safe response rate plus more informative outputs, with zero-shot transfer to new domains—are the concrete claim.

The soft spot is exactly where the stress-test note lands. Nothing in the abstract describes how the gate is prevented from letting unsafe tokens through, whether any leakage metric was tracked, or what the experts actually encode beyond the training data. The performance numbers are given without baselines, controls, or significance tests, so it is impossible to tell whether the gains come from the routing idea or from other unstated choices. The zero-shot claim is also left without detail on how the router was evaluated on unseen domains.

This is for people working on practical LLM safety who want to move past blanket refusals. If the full paper supplies the missing controls, ablations on the gate, and checks that unsafe content does not leak, the idea is worth referee time. If those pieces are absent, the central mechanism remains untested.

I would send it to review rather than desk reject, mainly to see whether the experiments close the gap the abstract leaves open.

Referee Report

2 major / 0 minor

Summary. The paper proposes SafeMoE, a Mixture-of-Experts framework that trains domain-specific LoRA experts exclusively on harmful corpora to encode unsafe domain knowledge, then trains a lightweight gating network on a minimal set of safe-informative responses to dynamically route these experts at inference time. The central claims are that this yields over 20% relative (15% absolute) gains in safe response rate on safety benchmarks while producing more informative outputs, plus strong zero-shot generalization to unseen domains without domain-specific supervision, challenging erasure-based alignment paradigms.

Significance. If the performance and generalization results hold under proper controls, the work would be significant as an empirical demonstration that unsafe data can be harnessed for both safety and utility rather than discarded, potentially opening a new direction in alignment research focused on controlled knowledge integration rather than refusal training.

major comments (2)

[Abstract] Abstract: the reported >20% relative and >15% absolute gains in safe response rate are presented without any information on baselines, statistical significance, number of runs, data exclusion rules, or experimental controls; these details are load-bearing for the central empirical claim and must be supplied before the gains can be evaluated.
[Abstract] Abstract (and method description): no mechanism, loss term, or metric is described to guarantee that the gate trained on minimal safe responses prevents leakage of unsafe tokens or patterns from the harmful LoRA experts; this separation assumption is load-bearing for both the safety improvement and zero-shot routing claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while noting where clarifications or additions are warranted.

read point-by-point responses

Referee: [Abstract] Abstract: the reported >20% relative and >15% absolute gains in safe response rate are presented without any information on baselines, statistical significance, number of runs, data exclusion rules, or experimental controls; these details are load-bearing for the central empirical claim and must be supplied before the gains can be evaluated.

Authors: Section 4 of the manuscript provides the full experimental protocol, including the specific baselines (refusal-trained Llama-2-7B, standard RLHF, and domain-specific fine-tuning), five independent runs with mean and standard deviation, paired t-tests for significance (p < 0.01), and explicit data exclusion rules (removal of prompts with >50% overlap to training). The abstract summarizes the outcome rather than the protocol due to length constraints. To make the central claim more evaluable on first reading, we have added one sentence to the abstract referencing the controlled evaluation setup and statistical reporting. revision: yes
Referee: [Abstract] Abstract (and method description): no mechanism, loss term, or metric is described to guarantee that the gate trained on minimal safe responses prevents leakage of unsafe tokens or patterns from the harmful LoRA experts; this separation assumption is load-bearing for both the safety improvement and zero-shot routing claims.

Authors: The gating network is trained exclusively on safe-informative response pairs; its objective is to select expert combinations that maximize the likelihood of those safe outputs, which by construction discourages routes that would surface unsafe patterns. We agree, however, that no auxiliary loss term (e.g., an explicit safety-classifier penalty) or post-generation leakage metric is stated in the current text. We have therefore expanded the method section with a precise formulation of the routing loss and added a short ablation quantifying token-level leakage (via keyword and classifier checks) on held-out harmful prompts, confirming the separation holds in practice. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The provided abstract and description contain no equations, derivations, or mathematical claims that reduce to fitted parameters or self-citations. The central proposal (SafeMoE framework) and reported gains are presented as empirical outcomes from experiments rather than definitional or self-referential constructs. No load-bearing steps match any enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; evaluation is limited to the high-level description given.

pith-pipeline@v0.9.1-grok · 5833 in / 1179 out tokens · 24345 ms · 2026-06-28T19:21:31.741628+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

85 extracted references · 39 canonical work pages · 10 internal anchors

[1]

Akiba, M

T. Akiba, M. Shing, Y . Tang, Q. Sun, and D. Ha. Evolutionary optimization of model merging recipes.Nat. Mac. Intell., 7(2):195–204, 2025. doi: 10.1038/S42256-024-00975-8. URL https://doi.org/10.1038/s42256-024-00975-8

work page doi:10.1038/s42256-024-00975-8 2025
[2]

Arditi, O

A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda. Refusal in language models is mediated by a single direction. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Ad- vances in Neural Information Processing Systems 38: Annual Conference on Neural In- formation Processing System...

2024
[3]

G. Bai, J. Liu, X. Bu, Y . He, J. Liu, Z. Zhou, Z. Lin, W. Su, T. Ge, B. Zheng, and W. Ouyang. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. In L. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ...

work page doi:10.18653/v1/2024.acl-long.401 2024
[4]

Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. E. Showk, N. Elhage, Z. Hatfield- Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. B. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kapl...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.05862 2022
[5]

Bavaresco, R

A. Bavaresco, R. Bernardi, L. Bertolazzi, D. Elliott, R. Fernández, A. Gatt, E. Ghaleb, M. Giu- lianelli, M. Hanna, A. Koller, A. F. T. Martins, P. Mondorf, V . Neplenbroek, S. Pezzelle, B. Plank, D. Schlangen, A. Suglia, A. K. Surikuchi, E. Takmaz, and A. Testoni. Llms instead of human judges? A large scale empirical study across 20 NLP evaluation tasks....

work page doi:10.18653/v1/2025.acl-short.20 2025
[6]

Bhardwaj and S

R. Bhardwaj and S. Poria. Red-teaming large language models using chain of utterances for safety-alignment, 2023. URLhttps://doi.org/10.48550/arXiv.2308.09662

work page doi:10.48550/arxiv.2308.09662 2023
[7]

Bianchi, M

F. Bianchi, M. Suzgun, G. Attanasio, P. Röttger, D. Jurafsky, T. Hashimoto, and J. Zou. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/ f...

2024
[8]

Bommasani, K

R. Bommasani, K. Klyman, S. Kapoor, S. Longpre, B. Xiong, N. Maslej, and P. Liang. The 2024 foundation model transparency index.Trans. Mach. Learn. Res., 2025, 2025. URL https://openreview.net/forum?id=38cwP8xVxD. 10

2024
[9]

L. Cao. Learn to refuse: Making large language models more controllable and reliable through knowledge scope limitation and refusal mechanism. In Y . Al-Onaizan, M. Bansal, and Y . Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Pro- cessing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, pages 3628–3646. Assoc...

work page doi:10.18653/v1/2024.emnlp-main.212 2024
[10]

Casper, K

S. Casper, K. O’Brien, S. Longpre, E. Seger, K. Klyman, R. Bommasani, A. Nrusimha, I. Shu- mailov, S. Mindermann, S. Basart, F. Rudzicz, K. Pelrine, A. Ghosh, A. Strait, R. Kirk, D. Hendrycks, P. Henderson, J. Z. Kolter, G. Irving, Y . Gal, Y . Bengio, and D. Hadfield-Menell. Open technical problems in open-weight AI model risk management.Trans. Mach. Lea...

2026
[11]

P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong. Jailbreaking black box large language models in twenty queries. InIEEE Conference on Secure and Trustworthy Machine Learning, SaTML 2025, Copenhagen, Denmark, April 9-11, 2025, pages 23–42. IEEE, 2025. doi: 10.1109/SATML64287.2025.00010. URL https://doi.org/10.1109/ SaTML64287.2025.00010

work page doi:10.1109/satml64287.2025.00010 2025
[12]

Y . Chen, H. Gao, G. Cui, F. Qi, L. Huang, Z. Liu, and M. Sun. Why should adversarial perturbations be imperceptible? rethink the research paradigm in adversarial NLP. In Y . Gold- berg, Z. Kozareva, and Y . Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, D...

work page doi:10.18653/v1/2022.emnlp-main.771 2022
[13]

P. F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei. Deep re- inforcement learning from human preferences. In I. Guyon, U. von Luxburg, S. Ben- gio, H. M. Wallach, R. Fergus, S. V . N. Vishwanathan, and R. Garnett, editors,Ad- vances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Syst...

2017
[14]

J. Chu, Y . Liu, Z. Yang, X. Shen, M. Backes, and Y . Zhang. Jailbreakradar: Comprehensive assessment of jailbreak attacks against llms. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August ...

2025
[15]

J. Cui, W. Chiang, I. Stoica, and C. Hsieh. Or-bench: An over-refusal benchmark for large language models, 2024. URLhttps://doi.org/10.48550/arXiv.2405.20947

work page doi:10.48550/arxiv.2405.20947 2024
[16]

D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y . Wu, Z. Xie, Y . K. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. In L. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Comp...

work page doi:10.18653/v1/2024.acl-long.70 2024
[17]

E. L. Deci, R. Koestner, and R. M. Ryan. A meta-analytic review of experiments examining the effects of extrinsic rewards on intrinsic motivation.Psychological Bulletin, 125(6):627–668, 1999

1999
[18]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024. URLhttps://doi.org/10.48550/arXiv.2405.04434

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.04434 2024
[19]

DeepSeek-V3 Technical Report

DeepSeek-AI. Deepseek-v3 technical report, 2024. URLhttps://doi.org/10.48550/arXiv. 2412.19437. 11

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2024
[20]

DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. W...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025
[21]

Desmond, Z

M. Desmond, Z. Ashktorab, W. Geyer, E. M. Daly, M. S. Cooper, Q. Pan, R. Nair, N. Wagner, and T. Pedapati. Evalassist: Llm-as-a-judge simplified. In T. Walsh, J. Shah, and Z. Kolter, editors,AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 29637–29639. AAAI Pres...

work page doi:10.1609/aaai.v39i28.35351 2025
[22]

R. Duan, J. Liu, X. Jia, S. Zhao, R. Cheng, F. Wang, C. Wei, Y . Xie, C. Liu, D. Li, Y . Dong, Y . Zhang, Y . Chen, C. Wang, X. Ma, X. Wei, Y . Liu, H. Su, J. Zhu, X. Li, Y . Sun, J. Zhang, J. Hu, S. Xu, W. Yang, Y . Yang, X. Zhang, Y . Tan, J. Tao, and H. Xue. Oyster-i: Beyond refusal - constructive safety alignment for responsible language models, 2025....

work page doi:10.48550/arxiv.2509.01909 2025
[23]

S. Duan, X. Yi, P. Zhang, T. Lu, X. Xie, and N. Gu. Denevil: towards deciphering and navigating the ethical values of large language models via instruction learning. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

2024
[24]

URLhttps://openreview.net/forum?id=m3RRWWFaVe

OpenReview.net, 2024. URLhttps://openreview.net/forum?id=m3RRWWFaVe

2024
[25]

Dubois, C

Y . Dubois, C. X. Li, R. Taori, T. Zhang, I. Gulrajani, J. Ba, C. Guestrin, P. Liang, and T. B. Hashimoto. Alpacafarm: A simulation framework for methods that learn from hu- man feedback. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neu- ral Infor...

2023
[26]

Elhage, T

N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, R. Grosse, S. McCandlish, J. Kaplan, D. Amodei, M. Wattenberg, and C. Olah. Toy models of superposition, 2022. URL https://transformer-circuits. pub/2022/toy_model/index.html

2022
[27]

Fedus, B

W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.J. Mach. Learn. Res., 23:120:1–120:39, 2022. URL https: //jmlr.org/papers/v23/21-0998.html

2022
[28]

W. Feng, C. Hao, Y . Zhang, Y . Han, and H. Wang. Mixture-of-loras: An efficient multitask tuning method for large language models. In N. Calzolari, M. Kan, V . Hoste, A. Lenci, S. Sakti, and N. Xue, editors,Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20- 25 May,...

2024
[29]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y . Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, A. Jones, S. Bowman, A. Chen, T. Conerly, N. DasSarma, D. Drain, N. Elhage, S. E. Showk, S. Fort, Z. Hatfield-Dodds, T. Henighan, D. Hernandez, T. Hume, J. Jacobson, S. Johnston, S. Kravec, C. Olsson, S. Ringer, E. Tran-Johnson, D. Amodei, T. Br...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y . Shen, S. Ma, H. Liu, Y . Wang, and J. Guo. A survey on llm-as-a-judge, 2024. URL https://doi.org/10.48550/arXiv.2411. 15594

work page doi:10.48550/arxiv.2411 2024
[31]

S. Han, G. T. Junior, T. Balough, and W. Zhou. Judge’s verdict: A comprehensive analysis of LLM judge capability through human agreement, 2025. URL https://doi.org/10.48550/ arXiv.2510.09738

work page arXiv 2025
[32]

T. Han, A. Kumar, C. Agarwal, and H. Lakkaraju. Medsafetybench: Evaluating and improving the medical safety of large language models. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Informa- tion Processing Systems 2024, Neur...

2024
[33]

C. Hsu, Y . Tsai, C. Lin, P. Chen, C. Yu, and C. Huang. Safe lora: The silver lin- ing of reducing safety risks when finetuning large language models. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Ad- vances in Neural Information Processing Systems 38: Annual Conference on Neural In- formation Processing...

2024
[34]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URLhttps://openreview.net/forum?id=nZeVKeeFYf9

2022
[35]

Ilharco, M

G. Ilharco, M. T. Ribeiro, M. Wortsman, L. Schmidt, H. Hajishirzi, and A. Farhadi. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Rep- resentations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/forum?id=6t0Kwf8-jrj

2023
[36]

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts.Neural Comput., 3(1):79–87, 1991. doi: 10.1162/NECO.1991.3.1.79. URL https: //doi.org/10.1162/neco.1991.3.1.79

work page doi:10.1162/neco.1991.3.1.79 1991
[37]

E. Jan, N. AlDahoul, M. Ali, F. Ahmad, F. Zaffar, and Y . Zaki. Multitask mayhem: Unveiling and mitigating safety gaps in llms fine-tuning, 2024. URL https://doi.org/10.48550/ arXiv.2409.15361

work page arXiv 2024
[38]

J. Ji, M. Liu, J. Dai, X. Pan, C. Zhang, C. Bian, B. Chen, R. Sun, Y . Wang, and Y . Yang. Beavertails: Towards improved safety alignment of LLM via a human-preference dataset. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Ad- vances in Neural Information Processing Systems 36: Annual Conference on Neural In- formation Pr...

2023
[39]

J. Ji, D. Hong, B. Zhang, B. Chen, J. Dai, B. Zheng, T. A. Qiu, J. Zhou, K. Wang, B. Li, S. Han, Y . Guo, and Y . Yang. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vo...

2025
[40]

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de Las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mistral 7b, 2023. URL https://doi.org/10.48550/arXiv.2310.06825. 13

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825 2023
[41]

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de Las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mixtral of experts, 2024. URL https...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.04088 2024
[42]

Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, and X. Lu. Pubmedqa: A dataset for biomedical research question answering. In K. Inui, J. Jiang, V . Ng, and X. Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna- tional Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kon...

2019
[43]

URLhttps://doi.org/10.18653/v1/D19-1259

doi: 10.18653/V1/D19-1259. URLhttps://doi.org/10.18653/v1/D19-1259

work page doi:10.18653/v1/d19-1259
[44]

X. Jin, X. Ren, D. Preotiuc-Pietro, and P. Cheng. Dataless knowledge fusion by merging weights of language models. InThe Eleventh International Conference on Learning Rep- resentations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/forum?id=FCnohuR6AnM

2023
[45]

H. R. Kirk, B. Vidgen, P. Röttger, and S. A. Hale. The benefits, risks and bounds of personalizing the alignment of large language models to individuals.Nat. Mac. Intell., 6(4):383–392, 2024. doi: 10.1038/S42256-024-00820-Y. URLhttps://doi.org/10.1038/s42256-024-00820-y

work page doi:10.1038/s42256-024-00820-y 2024
[46]

Kumarage, N

T. Kumarage, N. Mehrabi, A. Ramakrishna, X. Zhao, R. S. Zemel, K. Chang, A. Galstyan, R. Gupta, and C. Peris. Towards safety reasoning in llms: Ai-agentic deliberation for policy- embedded cot data creation. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors, Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria,...

2025
[47]

Y . Liu, P. Liu, and A. Cohan. On evaluating LLM alignment by evaluating llms as judges, 2025. URLhttps://doi.org/10.48550/arXiv.2511.20604

work page doi:10.48550/arxiv.2511.20604 2025
[48]

G. F. Loewenstein, E. U. Weber, C. K. Hsee, and N. Welch. Risk as feelings.Psychological bulletin, 127(2):267, 2001

2001
[49]

Longpre, S

S. Longpre, S. Biderman, A. Albalak, H. Schoelkopf, D. McDuff, S. Kapoor, K. Klyman, K. Lo, G. Ilharco, N. San, M. Rauh, A. Skowron, B. Vidgen, L. Weidinger, A. Narayanan, V . Sanh, D. I. Adelani, P. Liang, R. Bommasani, P. Henderson, S. Luccioni, Y . Jernite, and L. Soldaini. The responsible foundation model development cheatsheet: A review of tools & re...

2024
[50]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URLhttps://openreview.net/forum?id=Bkg6RiCqY7

2019
[51]

N. Lu, S. Liu, J. Wu, W. Chen, Z. Zhang, Y . Ong, Q. Wang, and K. Tang. Safe delta: Consistently preserving safety when fine-tuning llms on diverse datasets. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Forty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Cana...

2025
[52]

Matena and C

M. Matena and C. Raffel. Merging models with fisher-weighted averaging. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neu- ral Information Processing Systems 35: Annual Conference on Neural Information Pro- cessing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - Decem- ber 9, 2022, 2022. URL http...

2022
[53]

Mather and N

M. Mather and N. R. Lighthall. Risk and reward are processed differently in decisions made under stress.Current directions in psychological science, 21(1):36–41, 2012. 14

2012
[54]

Mazeika, L

M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, D. A. Forsyth, and D. Hendrycks. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URLhttps: //o...

2024
[55]

GPT-4 technical report, 2023

OpenAI. GPT-4 technical report, 2023. URL https://doi.org/10.48550/arXiv.2303. 08774

work page doi:10.48550/arxiv.2303 2023
[56]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Ch...

2022
[57]

Phute, A

M. Phute, A. Helbling, M. Hull, S. Peng, S. Szyller, C. Cornelius, and D. H. Chau. LLM self defense: By self examination, llms know they are being tricked. InThe Second Tiny Papers Track at ICLR 2024, Tiny Papers @ ICLR 2024, Vienna, Austria, May 11, 2024. OpenReview.net,

2024
[58]

URLhttps://openreview.net/forum?id=YoqgcIA19o
[59]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Di- rect preference optimization: Your language model is secretly a reward model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Ad- vances in Neural Information Processing Systems 36: Annual Conference on Neural In- formation Processing Systems 202...

2023
[60]

Q. Ren, C. Gao, J. Shao, J. Yan, X. Tan, W. Lam, and L. Ma. Codeattack: Revealing safety generalization challenges of large language models via code completion. In L. Ku, A. Martins, and V . Srikumar, editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 11437–11452. A...

work page doi:10.18653/v1/2024.findings-acl.679 2024
[61]

Reuel, B

A. Reuel, B. Bucknall, S. Casper, T. Fist, L. Soder, O. Aarne, L. Hammond, L. Ibrahim, A. Chan, P. Wills, M. Anderljung, B. Garfinkel, L. Heim, A. Trask, G. Mukobi, R. Schaeffer, M. Baker, S. Hooker, I. Solaiman, S. Luccioni, N. Rajkumar, N. Moës, J. Ladish, D. Bau, P. Bricman, N. Guha, J. Newman, Y . Bengio, T. South, A. Pentland, S. Koyejo, M. J. Kochen...

2025
[62]

Steering Llama 2 via Contrastive Activation Addition

N. Rimsky, N. Gabrieli, J. Schulz, M. Tong, E. Hubinger, and A. M. Turner. Steering llama 2 via contrastive activation addition. In L. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 15504–15522....

work page doi:10.18653/v1/2024.acl-long.828 2024
[63]

Röttger, H

P. Röttger, H. Kirk, B. Vidgen, G. Attanasio, F. Bianchi, and D. Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. In K. Duh, H. Gómez- Adorno, and S. Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologi...

work page doi:10.18653/v1/2024.naacl-long.301 2024
[64]

M. Sap, S. Gabriel, L. Qin, D. Jurafsky, N. A. Smith, and Y . Choi. Social bias frames: Reasoning about social and power implications of language. In D. Jurafsky, J. Chai, N. Schluter, 15 and J. R. Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 5477–5490....

work page doi:10.18653/v1/2020.acl-main.486 2020
[65]

Shoemake

K. Shoemake. Animating rotation with quaternion curves. In P. Cole, R. Heilman, and B. A. Barsky, editors,Proceedings of the 12th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1985, San Francisco, California, USA, July 22-26, 1985, pages 245–254. ACM, 1985. doi: 10.1145/325334.325242. URL https://doi.org/10.1145/ 325334.325242

work page doi:10.1145/325334.325242 1985
[66]

Y . Sung, J. Cho, and M. Bansal. VL-ADAPTER: parameter-efficient transfer learning for vision- and-language tasks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 5217–5227. IEEE, 2022. doi: 10. 1109/CVPR52688.2022.00516. URLhttps://doi.org/10.1109/CVPR52688.2022.00516

work page doi:10.1109/cvpr52688.2022.00516 2022
[67]

Q. Team. Qwen1.5-moe: Matching 7b model performance with 1/3 activated parameters, February 2024. URLhttps://qwenlm.github.io/blog/qwen-moe/

2024
[68]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Ba- tra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. Canton-Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V . Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V . Kerkez, M. Khabsa, I. Kloumann, A....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023
[69]

Zephyr: Direct Distillation of LM Alignment

L. Tunstall, E. Beeching, N. Lambert, N. Rajani, K. Rasul, Y . Belkada, S. Huang, L. von Werra, C. Fourrier, N. Habib, N. Sarrazin, O. Sanseviero, A. M. Rush, and T. Wolf. Zephyr: Direct distillation of LM alignment, 2023. URLhttps://doi.org/10.48550/arXiv.2310.16944

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.16944 2023
[70]

Y . Wang, Y . Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi. Self- instruct: Aligning language models with self-generated instructions. In A. Rogers, J. L. Boyd- Graber, and N. Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada,...

work page doi:10.18653/v1/2023.acl-long.754 2023
[71]

A. Wei, N. Haghtalab, and J. Steinhardt. Jailbroken: How does LLM safety training fail? In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, edi- tors,Advances in Neural Information Processing Systems 36: Annual Conference on Neu- ral Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, Decem- ber 10 - 16, 2023, 2023...

2023
[72]

Weidinger, J

L. Weidinger, J. Uesato, M. Rauh, C. Griffin, P. Huang, J. Mellor, A. Glaese, M. Cheng, B. Balle, A. Kasirzadeh, C. Biles, S. Brown, Z. Kenton, W. Hawkins, T. Stepleton, A. Birhane, L. A. Hen- dricks, L. Rimell, W. Isaac, J. Haas, S. Legassick, G. Irving, and I. Gabriel. Taxonomy of risks posed by language models. InFAccT ’22: 2022 ACM Conference on Fairn...

work page doi:10.1145/3531146.3533088 2022
[73]

Wollschläger, J

T. Wollschläger, J. Elstner, S. Geisler, V . Cohen-Addad, S. Günnemann, and J. Gasteiger. The geometry of refusal in large language models: Concept cones and representational independence. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Forty-second International Conference on Machine Learning,...

2025
[74]

X. Wu, S. Huang, and F. Wei. Mixture of lora experts. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

2024
[75]

URLhttps://openreview.net/forum?id=uWvKBCYh4S
[76]

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, and Z. Qiu. Qwen2....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2024
[77]

Y . Yuan, W. Jiao, W. Wang, J. Huang, P. He, S. Shi, and Z. Tu. GPT-4 is too smart to be safe: Stealthy chat with llms via cipher. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=MbfAK4s61A

2024
[78]

Y . Yuan, W. Jiao, W. Wang, J. Huang, J. Xu, T. Liang, P. He, and Z. Tu. Refuse whenever you feel unsafe: Improving safety in llms via decoupled refusal training. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna...

2025
[79]

Zhang, S

J. Zhang, S. Chen, J. Liu, and J. He. Composing parameter-efficient modules with arith- metic operation. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neu- ral Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, Decem- ber 10 -...

2023
[80]

Zhang, P

W. Zhang, P. Torr, M. Elhoseiny, and A. Bibi. Bi-factorial preference optimization: Balancing safety-helpfulness in language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview.net/forum?id=GjM61KRiTG

2025

Showing first 80 references.

[1] [1]

Akiba, M

T. Akiba, M. Shing, Y . Tang, Q. Sun, and D. Ha. Evolutionary optimization of model merging recipes.Nat. Mac. Intell., 7(2):195–204, 2025. doi: 10.1038/S42256-024-00975-8. URL https://doi.org/10.1038/s42256-024-00975-8

work page doi:10.1038/s42256-024-00975-8 2025

[2] [2]

Arditi, O

A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda. Refusal in language models is mediated by a single direction. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Ad- vances in Neural Information Processing Systems 38: Annual Conference on Neural In- formation Processing System...

2024

[3] [3]

G. Bai, J. Liu, X. Bu, Y . He, J. Liu, Z. Zhou, Z. Lin, W. Su, T. Ge, B. Zheng, and W. Ouyang. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. In L. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ...

work page doi:10.18653/v1/2024.acl-long.401 2024

[4] [4]

Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. E. Showk, N. Elhage, Z. Hatfield- Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. B. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kapl...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.05862 2022

[5] [5]

Bavaresco, R

A. Bavaresco, R. Bernardi, L. Bertolazzi, D. Elliott, R. Fernández, A. Gatt, E. Ghaleb, M. Giu- lianelli, M. Hanna, A. Koller, A. F. T. Martins, P. Mondorf, V . Neplenbroek, S. Pezzelle, B. Plank, D. Schlangen, A. Suglia, A. K. Surikuchi, E. Takmaz, and A. Testoni. Llms instead of human judges? A large scale empirical study across 20 NLP evaluation tasks....

work page doi:10.18653/v1/2025.acl-short.20 2025

[6] [6]

Bhardwaj and S

R. Bhardwaj and S. Poria. Red-teaming large language models using chain of utterances for safety-alignment, 2023. URLhttps://doi.org/10.48550/arXiv.2308.09662

work page doi:10.48550/arxiv.2308.09662 2023

[7] [7]

Bianchi, M

F. Bianchi, M. Suzgun, G. Attanasio, P. Röttger, D. Jurafsky, T. Hashimoto, and J. Zou. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/ f...

2024

[8] [8]

Bommasani, K

R. Bommasani, K. Klyman, S. Kapoor, S. Longpre, B. Xiong, N. Maslej, and P. Liang. The 2024 foundation model transparency index.Trans. Mach. Learn. Res., 2025, 2025. URL https://openreview.net/forum?id=38cwP8xVxD. 10

2024

[9] [9]

L. Cao. Learn to refuse: Making large language models more controllable and reliable through knowledge scope limitation and refusal mechanism. In Y . Al-Onaizan, M. Bansal, and Y . Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Pro- cessing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, pages 3628–3646. Assoc...

work page doi:10.18653/v1/2024.emnlp-main.212 2024

[10] [10]

Casper, K

S. Casper, K. O’Brien, S. Longpre, E. Seger, K. Klyman, R. Bommasani, A. Nrusimha, I. Shu- mailov, S. Mindermann, S. Basart, F. Rudzicz, K. Pelrine, A. Ghosh, A. Strait, R. Kirk, D. Hendrycks, P. Henderson, J. Z. Kolter, G. Irving, Y . Gal, Y . Bengio, and D. Hadfield-Menell. Open technical problems in open-weight AI model risk management.Trans. Mach. Lea...

2026

[11] [11]

P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong. Jailbreaking black box large language models in twenty queries. InIEEE Conference on Secure and Trustworthy Machine Learning, SaTML 2025, Copenhagen, Denmark, April 9-11, 2025, pages 23–42. IEEE, 2025. doi: 10.1109/SATML64287.2025.00010. URL https://doi.org/10.1109/ SaTML64287.2025.00010

work page doi:10.1109/satml64287.2025.00010 2025

[12] [12]

Y . Chen, H. Gao, G. Cui, F. Qi, L. Huang, Z. Liu, and M. Sun. Why should adversarial perturbations be imperceptible? rethink the research paradigm in adversarial NLP. In Y . Gold- berg, Z. Kozareva, and Y . Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, D...

work page doi:10.18653/v1/2022.emnlp-main.771 2022

[13] [13]

P. F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei. Deep re- inforcement learning from human preferences. In I. Guyon, U. von Luxburg, S. Ben- gio, H. M. Wallach, R. Fergus, S. V . N. Vishwanathan, and R. Garnett, editors,Ad- vances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Syst...

2017

[14] [14]

J. Chu, Y . Liu, Z. Yang, X. Shen, M. Backes, and Y . Zhang. Jailbreakradar: Comprehensive assessment of jailbreak attacks against llms. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August ...

2025

[15] [15]

J. Cui, W. Chiang, I. Stoica, and C. Hsieh. Or-bench: An over-refusal benchmark for large language models, 2024. URLhttps://doi.org/10.48550/arXiv.2405.20947

work page doi:10.48550/arxiv.2405.20947 2024

[16] [16]

D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y . Wu, Z. Xie, Y . K. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. In L. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Comp...

work page doi:10.18653/v1/2024.acl-long.70 2024

[17] [17]

E. L. Deci, R. Koestner, and R. M. Ryan. A meta-analytic review of experiments examining the effects of extrinsic rewards on intrinsic motivation.Psychological Bulletin, 125(6):627–668, 1999

1999

[18] [18]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024. URLhttps://doi.org/10.48550/arXiv.2405.04434

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.04434 2024

[19] [19]

DeepSeek-V3 Technical Report

DeepSeek-AI. Deepseek-v3 technical report, 2024. URLhttps://doi.org/10.48550/arXiv. 2412.19437. 11

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2024

[20] [20]

DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. W...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025

[21] [21]

Desmond, Z

M. Desmond, Z. Ashktorab, W. Geyer, E. M. Daly, M. S. Cooper, Q. Pan, R. Nair, N. Wagner, and T. Pedapati. Evalassist: Llm-as-a-judge simplified. In T. Walsh, J. Shah, and Z. Kolter, editors,AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 29637–29639. AAAI Pres...

work page doi:10.1609/aaai.v39i28.35351 2025

[22] [22]

R. Duan, J. Liu, X. Jia, S. Zhao, R. Cheng, F. Wang, C. Wei, Y . Xie, C. Liu, D. Li, Y . Dong, Y . Zhang, Y . Chen, C. Wang, X. Ma, X. Wei, Y . Liu, H. Su, J. Zhu, X. Li, Y . Sun, J. Zhang, J. Hu, S. Xu, W. Yang, Y . Yang, X. Zhang, Y . Tan, J. Tao, and H. Xue. Oyster-i: Beyond refusal - constructive safety alignment for responsible language models, 2025....

work page doi:10.48550/arxiv.2509.01909 2025

[23] [23]

S. Duan, X. Yi, P. Zhang, T. Lu, X. Xie, and N. Gu. Denevil: towards deciphering and navigating the ethical values of large language models via instruction learning. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

2024

[24] [24]

URLhttps://openreview.net/forum?id=m3RRWWFaVe

OpenReview.net, 2024. URLhttps://openreview.net/forum?id=m3RRWWFaVe

2024

[25] [25]

Dubois, C

Y . Dubois, C. X. Li, R. Taori, T. Zhang, I. Gulrajani, J. Ba, C. Guestrin, P. Liang, and T. B. Hashimoto. Alpacafarm: A simulation framework for methods that learn from hu- man feedback. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neu- ral Infor...

2023

[26] [26]

Elhage, T

N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, R. Grosse, S. McCandlish, J. Kaplan, D. Amodei, M. Wattenberg, and C. Olah. Toy models of superposition, 2022. URL https://transformer-circuits. pub/2022/toy_model/index.html

2022

[27] [27]

Fedus, B

W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.J. Mach. Learn. Res., 23:120:1–120:39, 2022. URL https: //jmlr.org/papers/v23/21-0998.html

2022

[28] [28]

W. Feng, C. Hao, Y . Zhang, Y . Han, and H. Wang. Mixture-of-loras: An efficient multitask tuning method for large language models. In N. Calzolari, M. Kan, V . Hoste, A. Lenci, S. Sakti, and N. Xue, editors,Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20- 25 May,...

2024

[29] [29]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y . Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, A. Jones, S. Bowman, A. Chen, T. Conerly, N. DasSarma, D. Drain, N. Elhage, S. E. Showk, S. Fort, Z. Hatfield-Dodds, T. Henighan, D. Hernandez, T. Hume, J. Jacobson, S. Johnston, S. Kravec, C. Olsson, S. Ringer, E. Tran-Johnson, D. Amodei, T. Br...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[30] [30]

J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y . Shen, S. Ma, H. Liu, Y . Wang, and J. Guo. A survey on llm-as-a-judge, 2024. URL https://doi.org/10.48550/arXiv.2411. 15594

work page doi:10.48550/arxiv.2411 2024

[31] [31]

S. Han, G. T. Junior, T. Balough, and W. Zhou. Judge’s verdict: A comprehensive analysis of LLM judge capability through human agreement, 2025. URL https://doi.org/10.48550/ arXiv.2510.09738

work page arXiv 2025

[32] [32]

T. Han, A. Kumar, C. Agarwal, and H. Lakkaraju. Medsafetybench: Evaluating and improving the medical safety of large language models. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Informa- tion Processing Systems 2024, Neur...

2024

[33] [33]

C. Hsu, Y . Tsai, C. Lin, P. Chen, C. Yu, and C. Huang. Safe lora: The silver lin- ing of reducing safety risks when finetuning large language models. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Ad- vances in Neural Information Processing Systems 38: Annual Conference on Neural In- formation Processing...

2024

[34] [34]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URLhttps://openreview.net/forum?id=nZeVKeeFYf9

2022

[35] [35]

Ilharco, M

G. Ilharco, M. T. Ribeiro, M. Wortsman, L. Schmidt, H. Hajishirzi, and A. Farhadi. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Rep- resentations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/forum?id=6t0Kwf8-jrj

2023

[36] [36]

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts.Neural Comput., 3(1):79–87, 1991. doi: 10.1162/NECO.1991.3.1.79. URL https: //doi.org/10.1162/neco.1991.3.1.79

work page doi:10.1162/neco.1991.3.1.79 1991

[37] [37]

E. Jan, N. AlDahoul, M. Ali, F. Ahmad, F. Zaffar, and Y . Zaki. Multitask mayhem: Unveiling and mitigating safety gaps in llms fine-tuning, 2024. URL https://doi.org/10.48550/ arXiv.2409.15361

work page arXiv 2024

[38] [38]

J. Ji, M. Liu, J. Dai, X. Pan, C. Zhang, C. Bian, B. Chen, R. Sun, Y . Wang, and Y . Yang. Beavertails: Towards improved safety alignment of LLM via a human-preference dataset. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Ad- vances in Neural Information Processing Systems 36: Annual Conference on Neural In- formation Pr...

2023

[39] [39]

J. Ji, D. Hong, B. Zhang, B. Chen, J. Dai, B. Zheng, T. A. Qiu, J. Zhou, K. Wang, B. Li, S. Han, Y . Guo, and Y . Yang. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vo...

2025

[40] [40]

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de Las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mistral 7b, 2023. URL https://doi.org/10.48550/arXiv.2310.06825. 13

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825 2023

[41] [41]

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de Las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mixtral of experts, 2024. URL https...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.04088 2024

[42] [42]

Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, and X. Lu. Pubmedqa: A dataset for biomedical research question answering. In K. Inui, J. Jiang, V . Ng, and X. Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna- tional Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kon...

2019

[43] [43]

URLhttps://doi.org/10.18653/v1/D19-1259

doi: 10.18653/V1/D19-1259. URLhttps://doi.org/10.18653/v1/D19-1259

work page doi:10.18653/v1/d19-1259

[44] [44]

X. Jin, X. Ren, D. Preotiuc-Pietro, and P. Cheng. Dataless knowledge fusion by merging weights of language models. InThe Eleventh International Conference on Learning Rep- resentations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/forum?id=FCnohuR6AnM

2023

[45] [45]

H. R. Kirk, B. Vidgen, P. Röttger, and S. A. Hale. The benefits, risks and bounds of personalizing the alignment of large language models to individuals.Nat. Mac. Intell., 6(4):383–392, 2024. doi: 10.1038/S42256-024-00820-Y. URLhttps://doi.org/10.1038/s42256-024-00820-y

work page doi:10.1038/s42256-024-00820-y 2024

[46] [46]

Kumarage, N

T. Kumarage, N. Mehrabi, A. Ramakrishna, X. Zhao, R. S. Zemel, K. Chang, A. Galstyan, R. Gupta, and C. Peris. Towards safety reasoning in llms: Ai-agentic deliberation for policy- embedded cot data creation. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors, Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria,...

2025

[47] [47]

Y . Liu, P. Liu, and A. Cohan. On evaluating LLM alignment by evaluating llms as judges, 2025. URLhttps://doi.org/10.48550/arXiv.2511.20604

work page doi:10.48550/arxiv.2511.20604 2025

[48] [48]

G. F. Loewenstein, E. U. Weber, C. K. Hsee, and N. Welch. Risk as feelings.Psychological bulletin, 127(2):267, 2001

2001

[49] [49]

Longpre, S

S. Longpre, S. Biderman, A. Albalak, H. Schoelkopf, D. McDuff, S. Kapoor, K. Klyman, K. Lo, G. Ilharco, N. San, M. Rauh, A. Skowron, B. Vidgen, L. Weidinger, A. Narayanan, V . Sanh, D. I. Adelani, P. Liang, R. Bommasani, P. Henderson, S. Luccioni, Y . Jernite, and L. Soldaini. The responsible foundation model development cheatsheet: A review of tools & re...

2024

[50] [50]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URLhttps://openreview.net/forum?id=Bkg6RiCqY7

2019

[51] [51]

N. Lu, S. Liu, J. Wu, W. Chen, Z. Zhang, Y . Ong, Q. Wang, and K. Tang. Safe delta: Consistently preserving safety when fine-tuning llms on diverse datasets. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Forty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Cana...

2025

[52] [52]

Matena and C

M. Matena and C. Raffel. Merging models with fisher-weighted averaging. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neu- ral Information Processing Systems 35: Annual Conference on Neural Information Pro- cessing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - Decem- ber 9, 2022, 2022. URL http...

2022

[53] [53]

Mather and N

M. Mather and N. R. Lighthall. Risk and reward are processed differently in decisions made under stress.Current directions in psychological science, 21(1):36–41, 2012. 14

2012

[54] [54]

Mazeika, L

M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, D. A. Forsyth, and D. Hendrycks. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URLhttps: //o...

2024

[55] [55]

GPT-4 technical report, 2023

OpenAI. GPT-4 technical report, 2023. URL https://doi.org/10.48550/arXiv.2303. 08774

work page doi:10.48550/arxiv.2303 2023

[56] [56]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Ch...

2022

[57] [57]

Phute, A

M. Phute, A. Helbling, M. Hull, S. Peng, S. Szyller, C. Cornelius, and D. H. Chau. LLM self defense: By self examination, llms know they are being tricked. InThe Second Tiny Papers Track at ICLR 2024, Tiny Papers @ ICLR 2024, Vienna, Austria, May 11, 2024. OpenReview.net,

2024

[58] [58]

URLhttps://openreview.net/forum?id=YoqgcIA19o

[59] [59]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Di- rect preference optimization: Your language model is secretly a reward model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Ad- vances in Neural Information Processing Systems 36: Annual Conference on Neural In- formation Processing Systems 202...

2023

[60] [60]

Q. Ren, C. Gao, J. Shao, J. Yan, X. Tan, W. Lam, and L. Ma. Codeattack: Revealing safety generalization challenges of large language models via code completion. In L. Ku, A. Martins, and V . Srikumar, editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 11437–11452. A...

work page doi:10.18653/v1/2024.findings-acl.679 2024

[61] [61]

Reuel, B

A. Reuel, B. Bucknall, S. Casper, T. Fist, L. Soder, O. Aarne, L. Hammond, L. Ibrahim, A. Chan, P. Wills, M. Anderljung, B. Garfinkel, L. Heim, A. Trask, G. Mukobi, R. Schaeffer, M. Baker, S. Hooker, I. Solaiman, S. Luccioni, N. Rajkumar, N. Moës, J. Ladish, D. Bau, P. Bricman, N. Guha, J. Newman, Y . Bengio, T. South, A. Pentland, S. Koyejo, M. J. Kochen...

2025

[62] [62]

Steering Llama 2 via Contrastive Activation Addition

N. Rimsky, N. Gabrieli, J. Schulz, M. Tong, E. Hubinger, and A. M. Turner. Steering llama 2 via contrastive activation addition. In L. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 15504–15522....

work page doi:10.18653/v1/2024.acl-long.828 2024

[63] [63]

Röttger, H

P. Röttger, H. Kirk, B. Vidgen, G. Attanasio, F. Bianchi, and D. Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. In K. Duh, H. Gómez- Adorno, and S. Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologi...

work page doi:10.18653/v1/2024.naacl-long.301 2024

[64] [64]

M. Sap, S. Gabriel, L. Qin, D. Jurafsky, N. A. Smith, and Y . Choi. Social bias frames: Reasoning about social and power implications of language. In D. Jurafsky, J. Chai, N. Schluter, 15 and J. R. Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 5477–5490....

work page doi:10.18653/v1/2020.acl-main.486 2020

[65] [65]

Shoemake

K. Shoemake. Animating rotation with quaternion curves. In P. Cole, R. Heilman, and B. A. Barsky, editors,Proceedings of the 12th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1985, San Francisco, California, USA, July 22-26, 1985, pages 245–254. ACM, 1985. doi: 10.1145/325334.325242. URL https://doi.org/10.1145/ 325334.325242

work page doi:10.1145/325334.325242 1985

[66] [66]

Y . Sung, J. Cho, and M. Bansal. VL-ADAPTER: parameter-efficient transfer learning for vision- and-language tasks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 5217–5227. IEEE, 2022. doi: 10. 1109/CVPR52688.2022.00516. URLhttps://doi.org/10.1109/CVPR52688.2022.00516

work page doi:10.1109/cvpr52688.2022.00516 2022

[67] [67]

Q. Team. Qwen1.5-moe: Matching 7b model performance with 1/3 activated parameters, February 2024. URLhttps://qwenlm.github.io/blog/qwen-moe/

2024

[68] [68]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Ba- tra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. Canton-Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V . Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V . Kerkez, M. Khabsa, I. Kloumann, A....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023

[69] [69]

Zephyr: Direct Distillation of LM Alignment

L. Tunstall, E. Beeching, N. Lambert, N. Rajani, K. Rasul, Y . Belkada, S. Huang, L. von Werra, C. Fourrier, N. Habib, N. Sarrazin, O. Sanseviero, A. M. Rush, and T. Wolf. Zephyr: Direct distillation of LM alignment, 2023. URLhttps://doi.org/10.48550/arXiv.2310.16944

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.16944 2023

[70] [70]

Y . Wang, Y . Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi. Self- instruct: Aligning language models with self-generated instructions. In A. Rogers, J. L. Boyd- Graber, and N. Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada,...

work page doi:10.18653/v1/2023.acl-long.754 2023

[71] [71]

A. Wei, N. Haghtalab, and J. Steinhardt. Jailbroken: How does LLM safety training fail? In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, edi- tors,Advances in Neural Information Processing Systems 36: Annual Conference on Neu- ral Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, Decem- ber 10 - 16, 2023, 2023...

2023

[72] [72]

Weidinger, J

L. Weidinger, J. Uesato, M. Rauh, C. Griffin, P. Huang, J. Mellor, A. Glaese, M. Cheng, B. Balle, A. Kasirzadeh, C. Biles, S. Brown, Z. Kenton, W. Hawkins, T. Stepleton, A. Birhane, L. A. Hen- dricks, L. Rimell, W. Isaac, J. Haas, S. Legassick, G. Irving, and I. Gabriel. Taxonomy of risks posed by language models. InFAccT ’22: 2022 ACM Conference on Fairn...

work page doi:10.1145/3531146.3533088 2022

[73] [73]

Wollschläger, J

T. Wollschläger, J. Elstner, S. Geisler, V . Cohen-Addad, S. Günnemann, and J. Gasteiger. The geometry of refusal in large language models: Concept cones and representational independence. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Forty-second International Conference on Machine Learning,...

2025

[74] [74]

X. Wu, S. Huang, and F. Wei. Mixture of lora experts. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

2024

[75] [75]

URLhttps://openreview.net/forum?id=uWvKBCYh4S

[76] [76]

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, and Z. Qiu. Qwen2....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2024

[77] [77]

Y . Yuan, W. Jiao, W. Wang, J. Huang, P. He, S. Shi, and Z. Tu. GPT-4 is too smart to be safe: Stealthy chat with llms via cipher. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=MbfAK4s61A

2024

[78] [78]

Y . Yuan, W. Jiao, W. Wang, J. Huang, J. Xu, T. Liang, P. He, and Z. Tu. Refuse whenever you feel unsafe: Improving safety in llms via decoupled refusal training. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna...

2025

[79] [79]

Zhang, S

J. Zhang, S. Chen, J. Liu, and J. He. Composing parameter-efficient modules with arith- metic operation. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neu- ral Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, Decem- ber 10 -...

2023

[80] [80]

Zhang, P

W. Zhang, P. Torr, M. Elhoseiny, and A. Bibi. Bi-factorial preference optimization: Balancing safety-helpfulness in language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview.net/forum?id=GjM61KRiTG

2025