pith. machine review for the scientific record. sign in

arxiv: 2605.11679 · v2 · submitted 2026-05-12 · 💻 cs.AI

Recognition: unknown

Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:08 UTC · model grok-4.3

classification 💻 cs.AI
keywords multi-objective alignmentLLM safetyhelpfulness trade-offprompt rewritingreward dimensionspreference expansionMORA
0
0 comments X

The pith

The safety-helpfulness conflict in LLMs stems from prompts that inherently limit achievable multi-dimensional rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that tensions between objectives such as helpfulness and harmlessness are not fundamental but arise because a given prompt restricts the range of rewards the model can attain across dimensions. By scaling rollouts and inspecting outputs, the authors conclude that rewriting prompts to embed multiple intents expands the attainable reward space. This matters to a sympathetic reader because it reframes alignment away from forced training compromises and toward input redesign that can raise performance on several metrics simultaneously. The proposed method, MORA, isolates single-reward prompts and performs the rewrite step, then demonstrates concrete gains in both sequential and joint alignment settings.

Core claim

The conflict among multiple objectives stems from the fact that the prompt itself inherently restricts the achievable multi-dimensional rewards. MORA isolates single-reward prompts through pre-sampling and expands their reward diversity by rewriting the original questions to incorporate multi-dimensional intents, yielding single-preference gains of 5% to 12.4% in sequential alignment and a 4.6% average overall reward improvement in simultaneous alignment.

What carries the argument

MORA (Multi-Objective Reward Assimilation), the process of pre-sampling to identify single-reward prompts and then rewriting those prompts to embed multiple preference dimensions so the model can reach higher combined rewards.

If this is right

  • Sequential multi-preference alignment produces single-preference gains between 5% and 12.4%, with the largest lifts in harmlessness.
  • Simultaneous alignment across helpful, harmless, and truthful dimensions raises average overall reward by 4.6%.
  • The Pareto frontier for competing preferences is not fixed but can be moved outward by prompt-level expansion of reward diversity.
  • Aggressive optimization on one objective no longer forces large penalties on others once prompts allow simultaneous access to multiple reward dimensions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Prompt rewriting may serve as a lightweight, training-free complement to existing data-selection or merging techniques for multi-objective alignment.
  • The same expansion logic could be tested on other conflicting pairs, such as creativity versus factual accuracy or brevity versus completeness.
  • If prompt content is the binding constraint, automated intent-augmented rewriting systems could become a standard preprocessing layer before any alignment training run.

Load-bearing premise

Rewriting original questions to incorporate multi-dimensional intents will reliably expand reward diversity without introducing new biases, reducing coherence, or creating unintended side effects in model outputs.

What would settle it

Measure whether outputs from the rewritten prompts achieve strictly higher combined scores across reward dimensions than outputs from the original prompts; if the gains vanish when the rewriting step is ablated, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.11679 by An Zhang, Junhao Dong, Kaiwen Luo, Kun Wang, Liang Lin, ShiYing Huang, Yuer Li, Zhenhong Zhou, Zhigang Zeng.

Figure 1
Figure 1. Figure 1: Performance and conceptual overview of MORA. typically induces severe performance regression in unoptimized dimensions, exposing a fundamental trade-off among different human values [9, 12, 13, 14, 15]. Therefore, moving beyond single-intent optimization towards a multi-objective alignment paradigm has emerged as a pressing necessity. To resolve the inherent tensions among competing dimensions, Multi-prefe… view at source ↗
Figure 2
Figure 2. Figure 2: Model Performance Profiles on Helpful vs. Safety Prompts. We show helpfulness score distribu￾tions (bars) and safety pass rates (lines) at Pass@N (N ∈ {16, 32, 64}). (a) On pure helpful prompts, the model estab￾lishes a static, high-performing profile. (b) On pure safety prompts, a severe alignment tax is observed: the model main￾tains safety by sacrificing helpfulness (over 30% are Score 1). Crucially, in… view at source ↗
Figure 3
Figure 3. Figure 3: Max-Margin Synthesis Pipeline via Self-Play. We address the safety-helpfulness dilemma through five steps: target mining, intent fusion, self-play & dual-feedback, max-margin selection, and DPO pairing. This constructs optimal preference data, achieving a new SOTA in both dimensions. • Observation 1 : Performance stagnation and scaling failure on pure helpfulness prompts. On pure helpfulness prompts, the m… view at source ↗
Figure 4
Figure 4. Figure 4: Two-objective sequential alignment results for helpfulness and harmlessness. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Data Scaling Effect on Safety￾Helpfulness Trade-off. The performance trajec￾tory shows that as the volume of additional MORA￾synthesized data increases: (a) it triggers a rapid surge and stabilization in safety, and (b) simultane￾ously drives a steady breakthrough in helpfulness. To further validate the effectiveness of our pro￾posed framework, we design a data scaling ab￾lation within a sequential tuning … view at source ↗
Figure 6
Figure 6. Figure 6: Reward score distributions across varying helpfulness levels for (a) original data on Pure [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The evaluation prompt for helpfulness. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The evaluation prompt for helpfulness. A.2 Details Of Fine-grained helpfulness evaluation For helpfulness, we use the prompt in [43], see [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The prompt template used for synthesizing multi-intent fusion data. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
read the original abstract

In the realm of multi-objective alignment for large language models, balancing disparate human preferences often manifests as a zero-sum conflict. Specifically, the intrinsic tension between competing goals dictates that aggressively optimizing for one metric (e.g., helpfulness) frequently incurs a substantial penalty on another (e.g., harmlessness). While prior work mainly focuses on data selection, parameter merging, or algorithmic balancing during training, these approaches merely force compromises between divergent preferences along a fixed Pareto frontier, failing to fundamentally resolve the inherent trade-off. In this work, we approach this problem from a novel perspective of multi-dimensional rewards. By scaling up the model's rollouts and analyzing the outputs across different reward dimensions, we arrive at a critical conclusion: the conflict among multiple objectives stems from the fact that the prompt itself inherently restricts the achievable multi-dimensional rewards. Based on this core observation, we propose MORA: Multi-Objective Reward Assimilation. Specifically, MORA isolates single-reward prompts through pre-sampling and expands their reward diversity by rewriting the original questions to incorporate multi-dimensional intents. Extensive experiments demonstrate that: (1) in sequential alignment, MORA achieves single-preference improvements ranging from 5% to 12.4%, with exceptional gains in harmlessness, after multiple-preference alignment across helpful, harmless, and truthful dimensions. (2) In simultaneous alignment, MORA achieves an average overall reward improvement of 4.6%. Our codes are available at https://github.com/Shiying-Huang/MORA-MPA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript claims that the safety-helpfulness trade-off in LLM alignment arises because the original prompt inherently restricts the diversity of achievable multi-dimensional rewards. It proposes MORA (Multi-Objective Reward Assimilation), which identifies single-reward prompts via pre-sampling and rewrites them to embed multiple intents (helpfulness, harmlessness, truthfulness). This yields reported gains of 5–12.4% in sequential multi-preference alignment and 4.6% average overall reward improvement in simultaneous alignment.

Significance. If the attribution to dimensional expansion holds, the work offers a training-free, prompt-based intervention that could expand the Pareto frontier for multi-objective alignment in a practical way. The release of code is a positive for reproducibility, though the absence of a mechanistic derivation or quantitative characterization of the claimed prompt restriction limits deeper theoretical contribution.

major comments (3)
  1. [Abstract] Abstract and experimental results: the reported 5–12.4% sequential and 4.6% simultaneous gains are presented without baselines, statistical tests, data splits, or variance measures. This prevents evaluation of whether the improvements exceed what would be expected from generic prompt elaboration.
  2. [Method] Method and experiments: the central claim that gains stem specifically from multi-dimensional intent expansion rests on comparisons only to unmodified original prompts. No ablation is described that holds prompt length and elaboration constant while varying only the number of reward dimensions (e.g., single-objective rewrites that add detail while preserving focus on helpfulness alone).
  3. [Introduction] Core observation: the assertion that 'the prompt itself inherently restricts the achievable multi-dimensional rewards' is supported only by qualitative rollout analysis. No quantitative metric (e.g., reward variance or coverage before/after rewriting) or equation formalizing the restriction is provided, weakening the justification for the MORA intervention.
minor comments (1)
  1. [Method] The manuscript would benefit from explicit notation distinguishing the original prompt P from the MORA-rewritten prompt P' in the method description and any pseudocode.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important opportunities to strengthen the empirical rigor and theoretical grounding of our claims. We respond to each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental results: the reported 5–12.4% sequential and 4.6% simultaneous gains are presented without baselines, statistical tests, data splits, or variance measures. This prevents evaluation of whether the improvements exceed what would be expected from generic prompt elaboration.

    Authors: We agree that additional statistical detail and baselines are needed to substantiate the reported gains. In the revised manuscript we will add (i) explicit comparisons against generic prompt-elaboration baselines that increase length without introducing multi-dimensional intents, (ii) statistical significance tests (paired t-tests with p-values) across the reported metrics, (iii) a clear description of the data splits and evaluation protocol, and (iv) standard-deviation bars computed over five independent runs. These additions will allow readers to assess whether the observed improvements exceed those attributable to elaboration alone. revision: yes

  2. Referee: [Method] Method and experiments: the central claim that gains stem specifically from multi-dimensional intent expansion rests on comparisons only to unmodified original prompts. No ablation is described that holds prompt length and elaboration constant while varying only the number of reward dimensions (e.g., single-objective rewrites that add detail while preserving focus on helpfulness alone).

    Authors: We acknowledge that the current experimental design does not isolate the contribution of dimensional expansion from simple elaboration. We will add a controlled ablation in which single-objective rewrites are generated that preserve the same approximate token length and level of detail but focus exclusively on one dimension (e.g., helpfulness). Performance of these single-objective elaborations will be compared directly against the multi-dimensional MORA rewrites on the same downstream alignment tasks. This ablation will be reported in the revised experiments section. revision: yes

  3. Referee: [Introduction] Core observation: the assertion that 'the prompt itself inherently restricts the achievable multi-dimensional rewards' is supported only by qualitative rollout analysis. No quantitative metric (e.g., reward variance or coverage before/after rewriting) or equation formalizing the restriction is provided, weakening the justification for the MORA intervention.

    Authors: The core observation is currently supported by qualitative rollout examples. To address this limitation we will introduce two quantitative metrics—reward-dimension variance and Pareto-coverage ratio—computed before and after rewriting, and we will include a concise formal statement (an equation) characterizing the prompt-induced restriction on the attainable reward manifold. These additions will appear in the revised introduction and method sections, providing a more rigorous justification for the MORA intervention. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation drives proposal without self-referential reduction

full rationale

The paper's core claim rests on rollout analysis yielding the observation that prompts restrict multi-dimensional rewards, followed by the empirical intervention MORA (rewriting prompts to embed multiple intents). No equations, fitted parameters, or derivations are present that reduce claimed gains to inputs by construction. No self-citations are load-bearing for uniqueness or ansatz; the work is self-contained as an experimental method without renaming known results or smuggling assumptions via prior author work. This matches the default non-circular case for empirical alignment papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on standard assumptions from RLHF and multi-objective optimization literature plus the new MORA procedure; no free parameters or invented physical entities are introduced in the abstract.

axioms (1)
  • domain assumption LLM outputs can be meaningfully scored along independent reward dimensions such as helpfulness, harmlessness, and truthfulness
    Invoked when analyzing rollouts across different reward dimensions
invented entities (1)
  • MORA (Multi-Objective Reward Assimilation) no independent evidence
    purpose: Technique that isolates single-reward prompts and rewrites them to expand achievable multi-dimensional rewards
    New algorithmic procedure proposed in the paper

pith-pipeline@v0.9.0 · 5596 in / 1140 out tokens · 38702 ms · 2026-05-14T21:08:40.273167+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

102 extracted references · 55 canonical work pages · 15 internal anchors

  1. [1]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 1(2), 2023

  2. [2]

    A survey on evaluation of large language models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024

    Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024

  3. [3]

    The llama 4 herd: Architecture, training, evaluation, and deployment notes.arXiv preprint arXiv:2601.11659, 2026

    Aaron Adcock, Aayushi Srivastava, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pande, Ab- hinav Pandey, Abhinav Sharma, Abhishek Kadian, Abhishek Kumawat, Adam Kelsey, et al. The llama 4 herd: Architecture, training, evaluation, and deployment notes.arXiv preprint arXiv:2601.11659, 2026

  4. [4]

    Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

  5. [6]

    A survey on trustworthy llm agents: Threats and countermeasures

    Miao Yu, Fanci Meng, Xinyun Zhou, Shilong Wang, Junyuan Mao, Linsey Pan, Tianlong Chen, Kun Wang, Xinfeng Li, Yongfeng Zhang, et al. A survey on trustworthy llm agents: Threats and countermeasures. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 6216–6226, 2025

  6. [7]

    A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment.arXiv preprint arXiv:2504.15585, 2025

    Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Hanjun Luo, et al. A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment.arXiv preprint arXiv:2504.15585, 2025

  7. [8]

    From System 1 to System 2: A Survey of Reasoning Large Language Models

    Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From system 1 to system 2: A survey of reasoning large language models.arXiv preprint arXiv:2502.17419, 2025

  8. [9]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

  9. [10]

    OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

    Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, et al. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143, 2024

  10. [11]

    Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719, 2024

    Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, and Yi Wu. Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719, 2024

  11. [12]

    Towards friendly ai: A comprehensive review and new perspectives on human-ai alignment.arXiv preprint arXiv:2412.15114, 2024

    Qiyang Sun, Yupei Li, Emran Alturki, Sunil Munthumoduku Krishna Murthy, and Björn W Schuller. Towards friendly ai: A comprehensive review and new perspectives on human-ai alignment.arXiv preprint arXiv:2412.15114, 2024. 10

  12. [13]

    Beyond Compromise: Pareto-Lenient Consensus for Efficient Multi-Preference LLM Alignment

    Renxuan Tan, Rongpeng Li, Zhifeng Zhao, and Honggang Zhang. Beyond compromise: Pareto- lenient consensus for efficient multi-preference llm alignment.arXiv preprint arXiv:2604.05965, 2026

  13. [14]

    Towards acyclic preference evaluation of language models via multiple evaluators

    Zhengyu Hu, Jieyu Zhang, Zhihan Xiong, Alexander Ratner, Kaize Ding, and Ranjay Krishna. Towards acyclic preference evaluation of language models via multiple evaluators. InPro- ceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 21903–21911, 2026

  14. [15]

    Adaptive helpfulness– harmlessness alignment with preference vectors

    Ren-Wei Liang, Chin Ting Hsu, Chan-Hung Yu, Saransh Agrawal, Shih-Cheng Huang, Chieh- Yen Lin, Shang-Tse Chen, Kuan-Hao Huang, and Shao-Hua Sun. Adaptive helpfulness– harmlessness alignment with preference vectors. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1...

  15. [16]

    Multi-preference lambda-weighted listwise dpo for dynamic preference alignment.arXiv preprint arXiv:2506.19780, 2025

    Yuhui Sun, Xiyao Wang, Zixi Li, and Jinman Zhao. Multi-preference lambda-weighted listwise dpo for dynamic preference alignment.arXiv preprint arXiv:2506.19780, 2025

  16. [17]

    Reward consistency: Improv- ing multi-objective alignment from a data-centric perspective.arXiv preprint arXiv:2504.11337, 2025

    Zhihao Xu, Yongqi Tong, Xin Zhang, Jun Zhou, and Xiting Wang. Reward consistency: Improv- ing multi-objective alignment from a data-centric perspective.arXiv preprint arXiv:2504.11337, 2025

  17. [19]

    doi: 10.48550.arXiv preprint ARXIV .2411.15124

  18. [20]

    Interpretable preferences via multi-objective reward modeling and mixture-of-experts.arXiv preprint arXiv:2406.12845, 2024

    Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable preferences via multi-objective reward modeling and mixture-of-experts.arXiv preprint arXiv:2406.12845, 2024

  19. [21]

    Hummer: Towards limited competitive preference dataset.arXiv preprint arXiv:2405.11647, 2024

    Li Jiang, Yusen Wu, Junwu Xiong, Jingqing Ruan, Yichuan Ding, Qingpei Guo, Zujie Wen, Jun Zhou, and Xiaotie Deng. Hummer: Towards limited competitive preference dataset.arXiv preprint arXiv:2405.11647, 2024

  20. [22]

    arXiv preprint arXiv:2406.15513 , year=

    Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Juntao Dai, Boren Zheng, Tianyi Qiu, Jiayi Zhou, Kaile Wang, Boxuan Li, et al. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference.arXiv preprint arXiv:2406.15513, 2024

  21. [23]

    Adam Dahlgren Lindström, Leila Methnani, Lea Krause, Petter Ericson, Íñigo Martínez de Rit- uerto de Troya, Dimitri Coelho Mollo, and Roel Dobbe. Helpful, harmless, honest? sociotech- nical limits of ai alignment and safety through reinforcement learning from human feedback: Ad lindström et al.Ethics and Information Technology, 27(2):28, 2025

  22. [24]

    2d-curri-dpo: Two-dimensional curriculum learning for direct preference optimization.arXiv preprint arXiv:2504.07856, 2025

    Mengyang Li and Zhong Zhang. 2d-curri-dpo: Two-dimensional curriculum learning for direct preference optimization.arXiv preprint arXiv:2504.07856, 2025

  23. [25]

    Orthalign: Orthogonal subspace decomposition for non-interfering multi-objective alignment.arXiv preprint arXiv:2509.24610, 2025

    Liang Lin, Zhihao Xu, Junhao Dong, Jian Zhao, Yuchen Yuan, Guibin Zhang, Miao Yu, Yiming Zhang, Zhengtao Yao, Huahui Yi, et al. Orthalign: Orthogonal subspace decomposition for non-interfering multi-objective alignment.arXiv preprint arXiv:2509.24610, 2025

  24. [26]

    Mracl: Multi-reward space guided adaptive curriculum reinforcement learning for llms

    Wenxuan Liu, Liangyu Huo, Yi Jing, Xiyuan Zhang, and Jian Xie. Mracl: Multi-reward space guided adaptive curriculum reinforcement learning for llms. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 37663–37672, 2026

  25. [27]

    Personalized soups: Per- sonalized large language model alignment via post-hoc pa- rameter merging.arXiv:2310.11564, 2023

    Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu. Personalized soups: Per- sonalized large language model alignment via post-hoc parameter merging.arXiv preprint arXiv:2310.11564, 2023

  26. [28]

    Mitigating the alignment tax of rlhf, 2024

    Yong Lin, Hangyu Lin, Wei Xiong, Shizhe Diao, Jianmeng Liu, Jipeng Zhang, Rui Pan, Haoxiang Wang, Wenbin Hu, Hanning Zhang, et al. Mitigating the alignment tax of rlhf.arXiv preprint arXiv:2309.06256, 2023. 11

  27. [29]

    Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023

    Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023

  28. [30]

    Mix data or merge models? balancing the helpful- ness, honesty, and harmlessness of large language model via model merging.arXiv preprint arXiv:2502.06876, 2025

    Jinluan Yang, Dingnan Jin, Anke Tang, Li Shen, Didi Zhu, Zhengyu Chen, Ziyu Zhao, Daixin Wang, Qing Cui, Zhiqiang Zhang, et al. Mix data or merge models? balancing the helpful- ness, honesty, and harmlessness of large language model via model merging.arXiv preprint arXiv:2502.06876, 2025

  29. [31]

    Stay unique, stay efficient: Preserving model personality in multi-task merging.arXiv preprint arXiv:2512.01461, 2025

    Kuangpu Guo, Yuhe Ding, Jian Liang, Zilei Wang, and Ran He. Stay unique, stay efficient: Preserving model personality in multi-task merging.arXiv preprint arXiv:2512.01461, 2025

  30. [32]

    Combining domain and alignment vectors provides better knowledge- safety trade-offs in llms

    Megh Thakkar, Quentin Fournier, Matthew Riemer, Pin-Yu Chen, Amal Zouaq, Payel Das, and Sarath Chandar. Combining domain and alignment vectors provides better knowledge- safety trade-offs in llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 268–277, 2025

  31. [33]

    The hidden di- mensions of llm alignment: A multi-dimensional anal- ysis of orthogonal safety directions.arXiv preprint arXiv:2502.09674, 2025

    Wenbo Pan, Zhichao Liu, Qiguang Chen, Xiangyang Zhou, Haining Yu, and Xiaohua Jia. The hidden dimensions of llm alignment: A multi-dimensional analysis of orthogonal safety directions.arXiv preprint arXiv:2502.09674, 2025

  32. [34]

    Lssf: Safety alignment for large language models through low-rank safety subspace fusion

    Guanghao Zhou, Panjia Qiu, Cen Chen, Hongyu Li, Jason Chu, Xin Zhang, and Jun Zhou. Lssf: Safety alignment for large language models through low-rank safety subspace fusion. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 30621–30638, 2025

  33. [35]

    Advancing llm safe alignment with safety representation ranking.arXiv preprint arXiv:2505.15710, 2025

    Tianqi Du, Zeming Wei, Quan Chen, Chenheng Zhang, and Yisen Wang. Advancing llm safe alignment with safety representation ranking.arXiv preprint arXiv:2505.15710, 2025

  34. [36]

    Rank analysis of incomplete block designs: I

    Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

  35. [37]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  36. [38]

    Panacea: Pareto alignment via preference adaptation for llms

    Yifan Zhong, Chengdong Ma, Xiaoyuan Zhang, Ziran Yang, Haojun Chen, Qingfu Zhang, Siyuan Qi, and Yaodong Yang. Panacea: Pareto alignment via preference adaptation for llms. Advances in Neural Information Processing Systems, 37:75522–75558, 2024

  37. [39]

    CoRR , volume =

    Wenyi Xiao, Zechuan Wang, Leilei Gan, Shuai Zhao, Zongrui Li, Ruirui Lei, Wanggui He, Luu Anh Tuan, Long Chen, Hao Jiang, et al. A comprehensive survey of direct preference optimization: Datasets, theories, variants, and applications.arXiv preprint arXiv:2410.15595, 2024

  38. [40]

    Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization, 2024

    Zhanhui Zhou, Jie Liu, Jing Shao, Xiangyu Yue, Chao Yang, Wanli Ouyang, and Yu Qiao. Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization, 2024. URL https://arxiv. org/abs/2310.03708, 2024

  39. [41]

    Robust multi-objective preference alignment with online dpo

    Raghav Gupta, Ryan Sullivan, Yunxuan Li, Samrat Phatale, and Abhinav Rastogi. Robust multi-objective preference alignment with online dpo. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27321–27329, 2025

  40. [42]

    Sequential preference optimization: Multi-dimensional preference alignment with implicit reward modeling

    Xingzhou Lou, Junge Zhang, Jian Xie, Lifeng Liu, Dong Yan, and Kaiqi Huang. Sequential preference optimization: Multi-dimensional preference alignment with implicit reward modeling. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27509– 27517, 2025

  41. [43]

    Rs-dpo: A hybrid rejection sampling and direct preference optimization method for alignment of large language models.arXiv preprint arXiv:2402.10038, 2024

    Saeed Khaki, JinJin Li, Lan Ma, Liu Yang, and Prathap Ramachandra. Rs-dpo: A hybrid rejection sampling and direct preference optimization method for alignment of large language models.arXiv preprint arXiv:2402.10038, 2024. 12

  42. [44]

    Controllable preference optimization: Toward controllable multi-objective alignment.arXiv preprint arXiv:2402.19085, 2024

    Yiju Guo, Ganqu Cui, Lifan Yuan, Ning Ding, Zexu Sun, Bowen Sun, Huimin Chen, Ruobing Xie, Jie Zhou, Yankai Lin, et al. Controllable preference optimization: Toward controllable multi-objective alignment.arXiv preprint arXiv:2402.19085, 2024

  43. [45]

    Ada-rs: Adaptive rejection sampling for selective thinking.arXiv preprint arXiv:2602.19519, 2026

    Yirou Ge, Yixi Li, Alec Chiu, Shivani Shekhar, Zijie Pan, Avinash Thangali, Yun-Shiuan Chuang, Chaitanya Kulkarni, Uma Kona, Linsey Pang, et al. Ada-rs: Adaptive rejection sampling for selective thinking.arXiv preprint arXiv:2602.19519, 2026

  44. [46]

    Safe RLHF: Safe Reinforcement Learning from Human Feedback

    Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback.arXiv preprint arXiv:2310.12773, 2023

  45. [47]

    Helpsteer: Multi-attribute helpfulness dataset for steerlm.arXiv preprint arXiv:2311.09528, 2023

    Zhilin Wang, Yi Dong, Jiaqi Zeng, Virginia Adams, Makesh Narsimhan Sreedhar, Daniel Egert, Olivier Delalleau, Jane Polak Scowcroft, Neel Kant, Aidan Swope, et al. Helpsteer: Multi-attribute helpfulness dataset for steerlm.arXiv preprint arXiv:2311.09528, 2023

  46. [48]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  47. [49]

    The llama 3 herd of models, 2024

    AI @ Meta Llama Team. The llama 3 herd of models, 2024

  48. [50]

    Rlhf workflow: From reward modeling to online rlhf

    Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863, 2024

  49. [51]

    arXiv preprint arXiv:2310.16944 , year=

    Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro V on Werra, Clémentine Fourrier, Nathan Habib, et al. Zephyr: Direct distillation of lm alignment.arXiv preprint arXiv:2310.16944, 2023

  50. [52]

    Model merging with svd to tie the knots.arXiv preprint arXiv:2410.19735, 2024

    George Stoica, Pratik Ramesh, Boglarka Ecsedi, Leshem Choshen, and Judy Hoffman. Model merging with svd to tie the knots.arXiv preprint arXiv:2410.19735, 2024

  51. [53]

    Task singular vectors: Reducing task interference in model merging

    Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, and Emanuele Rodola. Task singular vectors: Reducing task interference in model merging. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18695–18705, 2025

  52. [54]

    Helpsteer 2: Open-source dataset for training top-performing reward models.Advances in Neural Information Processing Systems, 37:1474–1501, 2024

    Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy Zhang, Makesh Narsimhan Sreedhar, and Oleksii Kuchaiev. Helpsteer 2: Open-source dataset for training top-performing reward models.Advances in Neural Information Processing Systems, 37:1474–1501, 2024

  53. [55]

    Ultrafeedback: Boosting language models with high-quality feedback

    Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback. 2023

  54. [56]

    Alpacaeval: An automatic evaluator of instruction-following models, 2023

    Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models, 2023

  55. [57]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

  56. [58]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods.arXiv preprint arXiv:2109.07958, 2021

  57. [59]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

  58. [60]

    Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing.arXiv preprint arXiv:2406.08464, 2024

    Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, and Bill Yuchen Lin. Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing.arXiv preprint arXiv:2406.08464, 2024. 13

  59. [61]

    Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

    Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024

  60. [62]

    R-Zero: Self-Evolving Reasoning LLM from Zero Data

    Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data. arXiv preprint arXiv:2508.05004, 2025

  61. [63]

    Beyond pass@ 1: Self-play with variational problem synthesis sustains rlvr.arXiv preprint arXiv:2508.14029, 2025

    Xiao Liang, Zhongzhi Li, Yeyun Gong, Yelong Shen, Ying Nian Wu, Zhijiang Guo, and Weizhu Chen. Beyond pass@ 1: Self-play with variational problem synthesis sustains rlvr.arXiv preprint arXiv:2508.14029, 2025

  62. [64]

    Star-1: Safer alignment of reasoning llms with 1k data

    Zijun Wang, Haoqin Tu, Yuhan Wang, Juncheng Wu, Yanqing Liu, Jieru Mei, Brian R Bar- toldson, Bhavya Kailkhura, and Cihang Xie. Star-1: Safer alignment of reasoning llms with 1k data. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 37988–37997, 2026

  63. [65]

    Rewards-in-context: Multi-objective alignment of foundation models with dynamic preference adjustment.arXiv preprint arXiv:2402.10207, 2024

    Rui Yang, Xiaoman Pan, Feng Luo, Shuang Qiu, Han Zhong, Dong Yu, and Jianshu Chen. Rewards-in-context: Multi-objective alignment of foundation models with dynamic preference adjustment.arXiv preprint arXiv:2402.10207, 2024

  64. [66]

    Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards.arXiv preprint arXiv:2402.18571, 2024

    Haoxiang Wang, Yong Lin, Wei Xiong, Rui Yang, Shizhe Diao, Shuang Qiu, Han Zhao, and Tong Zhang. Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards.arXiv preprint arXiv:2402.18571, 2024

  65. [67]

    Conditional language policy: A general framework for steerable multi-objective finetuning.arXiv preprint arXiv:2407.15762, 2024

    Kaiwen Wang, Rahul Kidambi, Ryan Sullivan, Alekh Agarwal, Christoph Dann, Andrea Michi, Marco Gelmi, Yunxuan Li, Raghav Gupta, Avinava Dubey, et al. Conditional language policy: A general framework for steerable multi-objective finetuning.arXiv preprint arXiv:2407.15762, 2024

  66. [68]

    Metaaligner: Towards generalizable multi-objective alignment of language models.Advances in Neural Information Processing Systems, 37:34453–34486, 2024

    Kailai Yang, Zhiwei Liu, Qianqian Xie, Jimin Huang, Tianlin Zhang, and Sophia Ananiadou. Metaaligner: Towards generalizable multi-objective alignment of language models.Advances in Neural Information Processing Systems, 37:34453–34486, 2024

  67. [69]

    Paretohqd: Fast offline multiobjective alignment of large language models using pareto high-quality data

    Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang, and Yaochu Jin. Paretohqd: Fast offline multiobjective alignment of large language models using pareto high-quality data. InPro- ceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 17454–17462, 2026

  68. [70]

    Steerable chatbots: Personalizing llms with preference-based activation steering.arXiv preprint arXiv:2505.04260, 2025

    Jessica Y Bo, Tianyu Xu, Ishan Chatterjee, Katrina Passarella-Ward, Achin Kulshrestha, and D Shin. Steerable chatbots: Personalizing llms with preference-based activation steering.arXiv preprint arXiv:2505.04260, 2025

  69. [71]

    Confronting reward model overoptimization with constrained rlhf.arXiv preprint arXiv:2310.04373, 2023

    Ted Moskovitz, Aaditya K Singh, DJ Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca D Dragan, and Stephen McAleer. Confronting reward model overoptimization with constrained rlhf.arXiv preprint arXiv:2310.04373, 2023

  70. [72]

    Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint

    Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint. InInternational Conference on Machine Learning, pages 54715–54754. PMLR, 2024

  71. [73]

    Pareto multi-objective alignment for language models.arXiv preprint arXiv:2508.07768, 2025

    Qiang He and Setareh Maghsudi. Pareto multi-objective alignment for language models.arXiv preprint arXiv:2508.07768, 2025

  72. [74]

    Evolutionary constrained multi-objective optimization: A review.Vicinagearth, 1(1):5, 2024

    Jing Liang, Hongyu Lin, Caitong Yue, Xuanxuan Ban, and Kunjie Yu. Evolutionary constrained multi-objective optimization: A review.Vicinagearth, 1(1):5, 2024

  73. [75]

    Analysis of real-world constrained multi- objective problems and performance comparison of multi-objective algorithms

    Yang Nan, Hisao Ishibuchi, Tianye Shu, and Ke Shang. Analysis of real-world constrained multi- objective problems and performance comparison of multi-objective algorithms. InProceedings of the Genetic and Evolutionary Computation Conference, pages 576–584, 2024. 14

  74. [76]

    Constraints separation based evolutionary multitasking for constrained multi-objective optimization problems.IEEE/CAA Journal of Automatica Sinica, 11(8):1819–1835, 2024

    Kangjia Qiao, Jing Liang, Kunjie Yu, Xuanxuan Ban, Caitong Yue, Boyang Qu, and Pon- nuthurai Nagaratnam Suganthan. Constraints separation based evolutionary multitasking for constrained multi-objective optimization problems.IEEE/CAA Journal of Automatica Sinica, 11(8):1819–1835, 2024

  75. [77]

    Map: Multi-human-value alignment palette.arXiv preprint arXiv:2410.19198, 2024

    Xinran Wang, Qi Le, Ammar Ahmed, Enmao Diao, Yi Zhou, Nathalie Baracaldo, Jie Ding, and Ali Anwar. Map: Multi-human-value alignment palette.arXiv preprint arXiv:2410.19198, 2024

  76. [78]

    Mitigating the safety alignment tax with null-space constrained policy optimization.arXiv preprint arXiv:2512.11391, 2025

    Yifan Niu, Han Xiao, Dongyi Liu, Nuo Chen, and Jia Li. Mitigating the safety alignment tax with null-space constrained policy optimization.arXiv preprint arXiv:2512.11391, 2025

  77. [79]

    Multi-value alignment for llms via value decorrelation and extrapolation

    Hefei Xu, Le Wu, Chen Cheng, and Hao Liu. Multi-value alignment for llms via value decorrelation and extrapolation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34133–34141, 2026

  78. [80]

    Alphadpo: Adaptive reward margin for direct preference optimization.arXiv preprint arXiv:2410.10148, 2024

    Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. Alphadpo: Adaptive reward margin for direct preference optimization.arXiv preprint arXiv:2410.10148, 2024

  79. [81]

    Latent pref- erence coding: Aligning large language models via discrete latent codes.arXiv preprint arXiv:2505.04993, 2025

    Zhuocheng Gong, Jian Guan, Wei Wu, Huishuai Zhang, and Dongyan Zhao. Latent pref- erence coding: Aligning large language models via discrete latent codes.arXiv preprint arXiv:2505.04993, 2025

  80. [82]

    Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 7(2):195–204, 2025

    Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, and David Ha. Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 7(2):195–204, 2025

Showing first 80 references.