arxiv: 2605.11679 · v2 · submitted 2026-05-12 · 💻 cs.AI

Recognition: unknown

Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

ShiYing Huang , Liang Lin , Yuer Li , Kaiwen Luo , Zhenhong Zhou , An Zhang , Junhao Dong , Kun Wang

show 1 more author

Zhigang Zeng

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:08 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-objective alignmentLLM safetyhelpfulness trade-offprompt rewritingreward dimensionspreference expansionMORA

0 comments

The pith

The safety-helpfulness conflict in LLMs stems from prompts that inherently limit achievable multi-dimensional rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that tensions between objectives such as helpfulness and harmlessness are not fundamental but arise because a given prompt restricts the range of rewards the model can attain across dimensions. By scaling rollouts and inspecting outputs, the authors conclude that rewriting prompts to embed multiple intents expands the attainable reward space. This matters to a sympathetic reader because it reframes alignment away from forced training compromises and toward input redesign that can raise performance on several metrics simultaneously. The proposed method, MORA, isolates single-reward prompts and performs the rewrite step, then demonstrates concrete gains in both sequential and joint alignment settings.

Core claim

The conflict among multiple objectives stems from the fact that the prompt itself inherently restricts the achievable multi-dimensional rewards. MORA isolates single-reward prompts through pre-sampling and expands their reward diversity by rewriting the original questions to incorporate multi-dimensional intents, yielding single-preference gains of 5% to 12.4% in sequential alignment and a 4.6% average overall reward improvement in simultaneous alignment.

What carries the argument

MORA (Multi-Objective Reward Assimilation), the process of pre-sampling to identify single-reward prompts and then rewriting those prompts to embed multiple preference dimensions so the model can reach higher combined rewards.

If this is right

Sequential multi-preference alignment produces single-preference gains between 5% and 12.4%, with the largest lifts in harmlessness.
Simultaneous alignment across helpful, harmless, and truthful dimensions raises average overall reward by 4.6%.
The Pareto frontier for competing preferences is not fixed but can be moved outward by prompt-level expansion of reward diversity.
Aggressive optimization on one objective no longer forces large penalties on others once prompts allow simultaneous access to multiple reward dimensions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Prompt rewriting may serve as a lightweight, training-free complement to existing data-selection or merging techniques for multi-objective alignment.
The same expansion logic could be tested on other conflicting pairs, such as creativity versus factual accuracy or brevity versus completeness.
If prompt content is the binding constraint, automated intent-augmented rewriting systems could become a standard preprocessing layer before any alignment training run.

Load-bearing premise

Rewriting original questions to incorporate multi-dimensional intents will reliably expand reward diversity without introducing new biases, reducing coherence, or creating unintended side effects in model outputs.

What would settle it

Measure whether outputs from the rewritten prompts achieve strictly higher combined scores across reward dimensions than outputs from the original prompts; if the gains vanish when the rewriting step is ablated, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.11679 by An Zhang, Junhao Dong, Kaiwen Luo, Kun Wang, Liang Lin, ShiYing Huang, Yuer Li, Zhenhong Zhou, Zhigang Zeng.

**Figure 1.** Figure 1: Performance and conceptual overview of MORA. typically induces severe performance regression in unoptimized dimensions, exposing a fundamental trade-off among different human values [9, 12, 13, 14, 15]. Therefore, moving beyond single-intent optimization towards a multi-objective alignment paradigm has emerged as a pressing necessity. To resolve the inherent tensions among competing dimensions, Multi-prefe… view at source ↗

**Figure 2.** Figure 2: Model Performance Profiles on Helpful vs. Safety Prompts. We show helpfulness score distributions (bars) and safety pass rates (lines) at Pass@N (N ∈ {16, 32, 64}). (a) On pure helpful prompts, the model establishes a static, high-performing profile. (b) On pure safety prompts, a severe alignment tax is observed: the model maintains safety by sacrificing helpfulness (over 30% are Score 1). Crucially, in… view at source ↗

**Figure 3.** Figure 3: Max-Margin Synthesis Pipeline via Self-Play. We address the safety-helpfulness dilemma through five steps: target mining, intent fusion, self-play & dual-feedback, max-margin selection, and DPO pairing. This constructs optimal preference data, achieving a new SOTA in both dimensions. • Observation 1 : Performance stagnation and scaling failure on pure helpfulness prompts. On pure helpfulness prompts, the m… view at source ↗

**Figure 4.** Figure 4: Two-objective sequential alignment results for helpfulness and harmlessness. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Data Scaling Effect on SafetyHelpfulness Trade-off. The performance trajectory shows that as the volume of additional MORAsynthesized data increases: (a) it triggers a rapid surge and stabilization in safety, and (b) simultaneously drives a steady breakthrough in helpfulness. To further validate the effectiveness of our proposed framework, we design a data scaling ablation within a sequential tuning … view at source ↗

**Figure 6.** Figure 6: Reward score distributions across varying helpfulness levels for (a) original data on Pure [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: The evaluation prompt for helpfulness. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: The evaluation prompt for helpfulness. A.2 Details Of Fine-grained helpfulness evaluation For helpfulness, we use the prompt in [43], see [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: The prompt template used for synthesizing multi-intent fusion data. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

read the original abstract

In the realm of multi-objective alignment for large language models, balancing disparate human preferences often manifests as a zero-sum conflict. Specifically, the intrinsic tension between competing goals dictates that aggressively optimizing for one metric (e.g., helpfulness) frequently incurs a substantial penalty on another (e.g., harmlessness). While prior work mainly focuses on data selection, parameter merging, or algorithmic balancing during training, these approaches merely force compromises between divergent preferences along a fixed Pareto frontier, failing to fundamentally resolve the inherent trade-off. In this work, we approach this problem from a novel perspective of multi-dimensional rewards. By scaling up the model's rollouts and analyzing the outputs across different reward dimensions, we arrive at a critical conclusion: the conflict among multiple objectives stems from the fact that the prompt itself inherently restricts the achievable multi-dimensional rewards. Based on this core observation, we propose MORA: Multi-Objective Reward Assimilation. Specifically, MORA isolates single-reward prompts through pre-sampling and expands their reward diversity by rewriting the original questions to incorporate multi-dimensional intents. Extensive experiments demonstrate that: (1) in sequential alignment, MORA achieves single-preference improvements ranging from 5% to 12.4%, with exceptional gains in harmlessness, after multiple-preference alignment across helpful, harmless, and truthful dimensions. (2) In simultaneous alignment, MORA achieves an average overall reward improvement of 4.6%. Our codes are available at https://github.com/Shiying-Huang/MORA-MPA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper argues that prompt structure itself caps multi-objective rewards and tries to lift the ceiling by rewriting prompts to embed multiple intents, but the gains may not be isolated from generic prompt improvements.

read the letter

The core claim is that the safety-helpfulness tension is baked into the original prompt rather than just the training process, and MORA tries to break it by first sampling single-reward prompts then rewriting them to pull in helpful, harmless, and truthful intents at once. They report 5-12.4% gains on sequential alignment and a 4.6% lift on simultaneous reward after this step, with code released on GitHub. That framing is distinct from the data-selection or merging approaches they cite, and the empirical numbers are the main thing the work offers right now. The code availability is a plus for anyone who wants to check the rollout analysis they describe. The central weakness is the missing control the stress-test note flags: they only compare against the unmodified original prompts. There is no test of whether a same-length rewrite that stays single-objective (say, just adding detail while keeping only helpfulness) would produce comparable lifts. Without that, it is hard to know whether the reported improvements come from dimensional expansion specifically or from any elaboration that makes the prompt richer. The abstract also gives percentages without spelling out the exact baselines, statistical checks, or data splits, so the strength of the evidence stays hard to judge from the summary. This is the kind of paper that would interest people already working on multi-objective alignment or prompt-based interventions in RLHF. It is not yet tight enough on the mechanism to change how most labs would run their own experiments, but the idea is straightforward enough that a referee could give useful feedback on the ablations and measurement details. I would send it out for review rather than desk-reject it.

Referee Report

3 major / 1 minor

Summary. The manuscript claims that the safety-helpfulness trade-off in LLM alignment arises because the original prompt inherently restricts the diversity of achievable multi-dimensional rewards. It proposes MORA (Multi-Objective Reward Assimilation), which identifies single-reward prompts via pre-sampling and rewrites them to embed multiple intents (helpfulness, harmlessness, truthfulness). This yields reported gains of 5–12.4% in sequential multi-preference alignment and 4.6% average overall reward improvement in simultaneous alignment.

Significance. If the attribution to dimensional expansion holds, the work offers a training-free, prompt-based intervention that could expand the Pareto frontier for multi-objective alignment in a practical way. The release of code is a positive for reproducibility, though the absence of a mechanistic derivation or quantitative characterization of the claimed prompt restriction limits deeper theoretical contribution.

major comments (3)

[Abstract] Abstract and experimental results: the reported 5–12.4% sequential and 4.6% simultaneous gains are presented without baselines, statistical tests, data splits, or variance measures. This prevents evaluation of whether the improvements exceed what would be expected from generic prompt elaboration.
[Method] Method and experiments: the central claim that gains stem specifically from multi-dimensional intent expansion rests on comparisons only to unmodified original prompts. No ablation is described that holds prompt length and elaboration constant while varying only the number of reward dimensions (e.g., single-objective rewrites that add detail while preserving focus on helpfulness alone).
[Introduction] Core observation: the assertion that 'the prompt itself inherently restricts the achievable multi-dimensional rewards' is supported only by qualitative rollout analysis. No quantitative metric (e.g., reward variance or coverage before/after rewriting) or equation formalizing the restriction is provided, weakening the justification for the MORA intervention.

minor comments (1)

[Method] The manuscript would benefit from explicit notation distinguishing the original prompt P from the MORA-rewritten prompt P' in the method description and any pseudocode.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important opportunities to strengthen the empirical rigor and theoretical grounding of our claims. We respond to each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract and experimental results: the reported 5–12.4% sequential and 4.6% simultaneous gains are presented without baselines, statistical tests, data splits, or variance measures. This prevents evaluation of whether the improvements exceed what would be expected from generic prompt elaboration.

Authors: We agree that additional statistical detail and baselines are needed to substantiate the reported gains. In the revised manuscript we will add (i) explicit comparisons against generic prompt-elaboration baselines that increase length without introducing multi-dimensional intents, (ii) statistical significance tests (paired t-tests with p-values) across the reported metrics, (iii) a clear description of the data splits and evaluation protocol, and (iv) standard-deviation bars computed over five independent runs. These additions will allow readers to assess whether the observed improvements exceed those attributable to elaboration alone. revision: yes
Referee: [Method] Method and experiments: the central claim that gains stem specifically from multi-dimensional intent expansion rests on comparisons only to unmodified original prompts. No ablation is described that holds prompt length and elaboration constant while varying only the number of reward dimensions (e.g., single-objective rewrites that add detail while preserving focus on helpfulness alone).

Authors: We acknowledge that the current experimental design does not isolate the contribution of dimensional expansion from simple elaboration. We will add a controlled ablation in which single-objective rewrites are generated that preserve the same approximate token length and level of detail but focus exclusively on one dimension (e.g., helpfulness). Performance of these single-objective elaborations will be compared directly against the multi-dimensional MORA rewrites on the same downstream alignment tasks. This ablation will be reported in the revised experiments section. revision: yes
Referee: [Introduction] Core observation: the assertion that 'the prompt itself inherently restricts the achievable multi-dimensional rewards' is supported only by qualitative rollout analysis. No quantitative metric (e.g., reward variance or coverage before/after rewriting) or equation formalizing the restriction is provided, weakening the justification for the MORA intervention.

Authors: The core observation is currently supported by qualitative rollout examples. To address this limitation we will introduce two quantitative metrics—reward-dimension variance and Pareto-coverage ratio—computed before and after rewriting, and we will include a concise formal statement (an equation) characterizing the prompt-induced restriction on the attainable reward manifold. These additions will appear in the revised introduction and method sections, providing a more rigorous justification for the MORA intervention. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation drives proposal without self-referential reduction

full rationale

The paper's core claim rests on rollout analysis yielding the observation that prompts restrict multi-dimensional rewards, followed by the empirical intervention MORA (rewriting prompts to embed multiple intents). No equations, fitted parameters, or derivations are present that reduce claimed gains to inputs by construction. No self-citations are load-bearing for uniqueness or ansatz; the work is self-contained as an experimental method without renaming known results or smuggling assumptions via prior author work. This matches the default non-circular case for empirical alignment papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on standard assumptions from RLHF and multi-objective optimization literature plus the new MORA procedure; no free parameters or invented physical entities are introduced in the abstract.

axioms (1)

domain assumption LLM outputs can be meaningfully scored along independent reward dimensions such as helpfulness, harmlessness, and truthfulness
Invoked when analyzing rollouts across different reward dimensions

invented entities (1)

MORA (Multi-Objective Reward Assimilation) no independent evidence
purpose: Technique that isolates single-reward prompts and rewrites them to expand achievable multi-dimensional rewards
New algorithmic procedure proposed in the paper

pith-pipeline@v0.9.0 · 5596 in / 1140 out tokens · 38702 ms · 2026-05-14T21:08:40.273167+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

102 extracted references · 55 canonical work pages · 15 internal anchors

[1]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 1(2), 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

A survey on evaluation of large language models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024

2024
[3]

The llama 4 herd: Architecture, training, evaluation, and deployment notes.arXiv preprint arXiv:2601.11659, 2026

Aaron Adcock, Aayushi Srivastava, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pande, Ab- hinav Pandey, Abhinav Sharma, Abhishek Kadian, Abhishek Kumawat, Adam Kelsey, et al. The llama 4 herd: Architecture, training, evaluation, and deployment notes.arXiv preprint arXiv:2601.11659, 2026

work page arXiv 2026
[4]

Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

A survey on trustworthy llm agents: Threats and countermeasures

Miao Yu, Fanci Meng, Xinyun Zhou, Shilong Wang, Junyuan Mao, Linsey Pan, Tianlong Chen, Kun Wang, Xinfeng Li, Yongfeng Zhang, et al. A survey on trustworthy llm agents: Threats and countermeasures. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 6216–6226, 2025

2025
[7]

A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment.arXiv preprint arXiv:2504.15585, 2025

Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Hanjun Luo, et al. A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment.arXiv preprint arXiv:2504.15585, 2025

work page arXiv 2025
[8]

From System 1 to System 2: A Survey of Reasoning Large Language Models

Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From system 1 to system 2: A survey of reasoning large language models.arXiv preprint arXiv:2502.17419, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, et al. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143, 2024

work page internal anchor Pith review arXiv 2024
[11]

Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719, 2024

Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, and Yi Wu. Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719, 2024

work page arXiv 2024
[12]

Towards friendly ai: A comprehensive review and new perspectives on human-ai alignment.arXiv preprint arXiv:2412.15114, 2024

Qiyang Sun, Yupei Li, Emran Alturki, Sunil Munthumoduku Krishna Murthy, and Björn W Schuller. Towards friendly ai: A comprehensive review and new perspectives on human-ai alignment.arXiv preprint arXiv:2412.15114, 2024. 10

work page arXiv 2024
[13]

Beyond Compromise: Pareto-Lenient Consensus for Efficient Multi-Preference LLM Alignment

Renxuan Tan, Rongpeng Li, Zhifeng Zhao, and Honggang Zhang. Beyond compromise: Pareto- lenient consensus for efficient multi-preference llm alignment.arXiv preprint arXiv:2604.05965, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Towards acyclic preference evaluation of language models via multiple evaluators

Zhengyu Hu, Jieyu Zhang, Zhihan Xiong, Alexander Ratner, Kaize Ding, and Ranjay Krishna. Towards acyclic preference evaluation of language models via multiple evaluators. InPro- ceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 21903–21911, 2026

2026
[15]

Adaptive helpfulness– harmlessness alignment with preference vectors

Ren-Wei Liang, Chin Ting Hsu, Chan-Hung Yu, Saransh Agrawal, Shih-Cheng Huang, Chieh- Yen Lin, Shang-Tse Chen, Kuan-Hao Huang, and Shao-Hua Sun. Adaptive helpfulness– harmlessness alignment with preference vectors. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1...

2026
[16]

Multi-preference lambda-weighted listwise dpo for dynamic preference alignment.arXiv preprint arXiv:2506.19780, 2025

Yuhui Sun, Xiyao Wang, Zixi Li, and Jinman Zhao. Multi-preference lambda-weighted listwise dpo for dynamic preference alignment.arXiv preprint arXiv:2506.19780, 2025

work page arXiv 2025
[17]

Reward consistency: Improv- ing multi-objective alignment from a data-centric perspective.arXiv preprint arXiv:2504.11337, 2025

Zhihao Xu, Yongqi Tong, Xin Zhang, Jun Zhou, and Xiting Wang. Reward consistency: Improv- ing multi-objective alignment from a data-centric perspective.arXiv preprint arXiv:2504.11337, 2025

work page arXiv 2025
[19]

doi: 10.48550.arXiv preprint ARXIV .2411.15124

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Interpretable preferences via multi-objective reward modeling and mixture-of-experts.arXiv preprint arXiv:2406.12845, 2024

Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable preferences via multi-objective reward modeling and mixture-of-experts.arXiv preprint arXiv:2406.12845, 2024

work page arXiv 2024
[21]

Hummer: Towards limited competitive preference dataset.arXiv preprint arXiv:2405.11647, 2024

Li Jiang, Yusen Wu, Junwu Xiong, Jingqing Ruan, Yichuan Ding, Qingpei Guo, Zujie Wen, Jun Zhou, and Xiaotie Deng. Hummer: Towards limited competitive preference dataset.arXiv preprint arXiv:2405.11647, 2024

work page arXiv 2024
[22]

arXiv preprint arXiv:2406.15513 , year=

Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Juntao Dai, Boren Zheng, Tianyi Qiu, Jiayi Zhou, Kaile Wang, Boxuan Li, et al. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference.arXiv preprint arXiv:2406.15513, 2024

work page arXiv 2024
[23]

Adam Dahlgren Lindström, Leila Methnani, Lea Krause, Petter Ericson, Íñigo Martínez de Rit- uerto de Troya, Dimitri Coelho Mollo, and Roel Dobbe. Helpful, harmless, honest? sociotech- nical limits of ai alignment and safety through reinforcement learning from human feedback: Ad lindström et al.Ethics and Information Technology, 27(2):28, 2025

2025
[24]

2d-curri-dpo: Two-dimensional curriculum learning for direct preference optimization.arXiv preprint arXiv:2504.07856, 2025

Mengyang Li and Zhong Zhang. 2d-curri-dpo: Two-dimensional curriculum learning for direct preference optimization.arXiv preprint arXiv:2504.07856, 2025

work page arXiv 2025
[25]

Orthalign: Orthogonal subspace decomposition for non-interfering multi-objective alignment.arXiv preprint arXiv:2509.24610, 2025

Liang Lin, Zhihao Xu, Junhao Dong, Jian Zhao, Yuchen Yuan, Guibin Zhang, Miao Yu, Yiming Zhang, Zhengtao Yao, Huahui Yi, et al. Orthalign: Orthogonal subspace decomposition for non-interfering multi-objective alignment.arXiv preprint arXiv:2509.24610, 2025

work page arXiv 2025
[26]

Mracl: Multi-reward space guided adaptive curriculum reinforcement learning for llms

Wenxuan Liu, Liangyu Huo, Yi Jing, Xiyuan Zhang, and Jian Xie. Mracl: Multi-reward space guided adaptive curriculum reinforcement learning for llms. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 37663–37672, 2026

2026
[27]

Personalized soups: Per- sonalized large language model alignment via post-hoc pa- rameter merging.arXiv:2310.11564, 2023

Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu. Personalized soups: Per- sonalized large language model alignment via post-hoc parameter merging.arXiv preprint arXiv:2310.11564, 2023

work page arXiv 2023
[28]

Mitigating the alignment tax of rlhf, 2024

Yong Lin, Hangyu Lin, Wei Xiong, Shizhe Diao, Jianmeng Liu, Jipeng Zhang, Rui Pan, Haoxiang Wang, Wenbin Hu, Hanning Zhang, et al. Mitigating the alignment tax of rlhf.arXiv preprint arXiv:2309.06256, 2023. 11

work page arXiv 2023
[29]

Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023

Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023

2023
[30]

Mix data or merge models? balancing the helpful- ness, honesty, and harmlessness of large language model via model merging.arXiv preprint arXiv:2502.06876, 2025

Jinluan Yang, Dingnan Jin, Anke Tang, Li Shen, Didi Zhu, Zhengyu Chen, Ziyu Zhao, Daixin Wang, Qing Cui, Zhiqiang Zhang, et al. Mix data or merge models? balancing the helpful- ness, honesty, and harmlessness of large language model via model merging.arXiv preprint arXiv:2502.06876, 2025

work page arXiv 2025
[31]

Stay unique, stay efficient: Preserving model personality in multi-task merging.arXiv preprint arXiv:2512.01461, 2025

Kuangpu Guo, Yuhe Ding, Jian Liang, Zilei Wang, and Ran He. Stay unique, stay efficient: Preserving model personality in multi-task merging.arXiv preprint arXiv:2512.01461, 2025

work page arXiv 2025
[32]

Combining domain and alignment vectors provides better knowledge- safety trade-offs in llms

Megh Thakkar, Quentin Fournier, Matthew Riemer, Pin-Yu Chen, Amal Zouaq, Payel Das, and Sarath Chandar. Combining domain and alignment vectors provides better knowledge- safety trade-offs in llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 268–277, 2025

2025
[33]

The hidden di- mensions of llm alignment: A multi-dimensional anal- ysis of orthogonal safety directions.arXiv preprint arXiv:2502.09674, 2025

Wenbo Pan, Zhichao Liu, Qiguang Chen, Xiangyang Zhou, Haining Yu, and Xiaohua Jia. The hidden dimensions of llm alignment: A multi-dimensional analysis of orthogonal safety directions.arXiv preprint arXiv:2502.09674, 2025

work page arXiv 2025
[34]

Lssf: Safety alignment for large language models through low-rank safety subspace fusion

Guanghao Zhou, Panjia Qiu, Cen Chen, Hongyu Li, Jason Chu, Xin Zhang, and Jun Zhou. Lssf: Safety alignment for large language models through low-rank safety subspace fusion. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 30621–30638, 2025

2025
[35]

Advancing llm safe alignment with safety representation ranking.arXiv preprint arXiv:2505.15710, 2025

Tianqi Du, Zeming Wei, Quan Chen, Chenheng Zhang, and Yisen Wang. Advancing llm safe alignment with safety representation ranking.arXiv preprint arXiv:2505.15710, 2025

work page arXiv 2025
[36]

Rank analysis of incomplete block designs: I

Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

1952
[37]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

2023
[38]

Panacea: Pareto alignment via preference adaptation for llms

Yifan Zhong, Chengdong Ma, Xiaoyuan Zhang, Ziran Yang, Haojun Chen, Qingfu Zhang, Siyuan Qi, and Yaodong Yang. Panacea: Pareto alignment via preference adaptation for llms. Advances in Neural Information Processing Systems, 37:75522–75558, 2024

2024
[39]

CoRR , volume =

Wenyi Xiao, Zechuan Wang, Leilei Gan, Shuai Zhao, Zongrui Li, Ruirui Lei, Wanggui He, Luu Anh Tuan, Long Chen, Hao Jiang, et al. A comprehensive survey of direct preference optimization: Datasets, theories, variants, and applications.arXiv preprint arXiv:2410.15595, 2024

work page arXiv 2024
[40]

Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization, 2024

Zhanhui Zhou, Jie Liu, Jing Shao, Xiangyu Yue, Chao Yang, Wanli Ouyang, and Yu Qiao. Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization, 2024. URL https://arxiv. org/abs/2310.03708, 2024

work page arXiv 2024
[41]

Robust multi-objective preference alignment with online dpo

Raghav Gupta, Ryan Sullivan, Yunxuan Li, Samrat Phatale, and Abhinav Rastogi. Robust multi-objective preference alignment with online dpo. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27321–27329, 2025

2025
[42]

Sequential preference optimization: Multi-dimensional preference alignment with implicit reward modeling

Xingzhou Lou, Junge Zhang, Jian Xie, Lifeng Liu, Dong Yan, and Kaiqi Huang. Sequential preference optimization: Multi-dimensional preference alignment with implicit reward modeling. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27509– 27517, 2025

2025
[43]

Rs-dpo: A hybrid rejection sampling and direct preference optimization method for alignment of large language models.arXiv preprint arXiv:2402.10038, 2024

Saeed Khaki, JinJin Li, Lan Ma, Liu Yang, and Prathap Ramachandra. Rs-dpo: A hybrid rejection sampling and direct preference optimization method for alignment of large language models.arXiv preprint arXiv:2402.10038, 2024. 12

work page arXiv 2024
[44]

Controllable preference optimization: Toward controllable multi-objective alignment.arXiv preprint arXiv:2402.19085, 2024

Yiju Guo, Ganqu Cui, Lifan Yuan, Ning Ding, Zexu Sun, Bowen Sun, Huimin Chen, Ruobing Xie, Jie Zhou, Yankai Lin, et al. Controllable preference optimization: Toward controllable multi-objective alignment.arXiv preprint arXiv:2402.19085, 2024

work page arXiv 2024
[45]

Ada-rs: Adaptive rejection sampling for selective thinking.arXiv preprint arXiv:2602.19519, 2026

Yirou Ge, Yixi Li, Alec Chiu, Shivani Shekhar, Zijie Pan, Avinash Thangali, Yun-Shiuan Chuang, Chaitanya Kulkarni, Uma Kona, Linsey Pang, et al. Ada-rs: Adaptive rejection sampling for selective thinking.arXiv preprint arXiv:2602.19519, 2026

work page arXiv 2026
[46]

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback.arXiv preprint arXiv:2310.12773, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Helpsteer: Multi-attribute helpfulness dataset for steerlm.arXiv preprint arXiv:2311.09528, 2023

Zhilin Wang, Yi Dong, Jiaqi Zeng, Virginia Adams, Makesh Narsimhan Sreedhar, Daniel Egert, Olivier Delalleau, Jane Polak Scowcroft, Neel Kant, Aidan Swope, et al. Helpsteer: Multi-attribute helpfulness dataset for steerlm.arXiv preprint arXiv:2311.09528, 2023

work page arXiv 2023
[48]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

The llama 3 herd of models, 2024

AI @ Meta Llama Team. The llama 3 herd of models, 2024

2024
[50]

Rlhf workflow: From reward modeling to online rlhf

Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863, 2024

work page arXiv 2024
[51]

arXiv preprint arXiv:2310.16944 , year=

Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro V on Werra, Clémentine Fourrier, Nathan Habib, et al. Zephyr: Direct distillation of lm alignment.arXiv preprint arXiv:2310.16944, 2023

work page arXiv 2023
[52]

Model merging with svd to tie the knots.arXiv preprint arXiv:2410.19735, 2024

George Stoica, Pratik Ramesh, Boglarka Ecsedi, Leshem Choshen, and Judy Hoffman. Model merging with svd to tie the knots.arXiv preprint arXiv:2410.19735, 2024

work page arXiv 2024
[53]

Task singular vectors: Reducing task interference in model merging

Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, and Emanuele Rodola. Task singular vectors: Reducing task interference in model merging. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18695–18705, 2025

2025
[54]

Helpsteer 2: Open-source dataset for training top-performing reward models.Advances in Neural Information Processing Systems, 37:1474–1501, 2024

Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy Zhang, Makesh Narsimhan Sreedhar, and Oleksii Kuchaiev. Helpsteer 2: Open-source dataset for training top-performing reward models.Advances in Neural Information Processing Systems, 37:1474–1501, 2024

2024
[55]

Ultrafeedback: Boosting language models with high-quality feedback

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback. 2023

2023
[56]

Alpacaeval: An automatic evaluator of instruction-following models, 2023

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models, 2023

2023
[57]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods.arXiv preprint arXiv:2109.07958, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[59]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[60]

Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing.arXiv preprint arXiv:2406.08464, 2024

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, and Bill Yuchen Lin. Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing.arXiv preprint arXiv:2406.08464, 2024. 13

work page arXiv 2024
[61]

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024

work page internal anchor Pith review arXiv 2024
[62]

R-Zero: Self-Evolving Reasoning LLM from Zero Data

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data. arXiv preprint arXiv:2508.05004, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Beyond pass@ 1: Self-play with variational problem synthesis sustains rlvr.arXiv preprint arXiv:2508.14029, 2025

Xiao Liang, Zhongzhi Li, Yeyun Gong, Yelong Shen, Ying Nian Wu, Zhijiang Guo, and Weizhu Chen. Beyond pass@ 1: Self-play with variational problem synthesis sustains rlvr.arXiv preprint arXiv:2508.14029, 2025

work page arXiv 2025
[64]

Star-1: Safer alignment of reasoning llms with 1k data

Zijun Wang, Haoqin Tu, Yuhan Wang, Juncheng Wu, Yanqing Liu, Jieru Mei, Brian R Bar- toldson, Bhavya Kailkhura, and Cihang Xie. Star-1: Safer alignment of reasoning llms with 1k data. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 37988–37997, 2026

2026
[65]

Rewards-in-context: Multi-objective alignment of foundation models with dynamic preference adjustment.arXiv preprint arXiv:2402.10207, 2024

Rui Yang, Xiaoman Pan, Feng Luo, Shuang Qiu, Han Zhong, Dong Yu, and Jianshu Chen. Rewards-in-context: Multi-objective alignment of foundation models with dynamic preference adjustment.arXiv preprint arXiv:2402.10207, 2024

work page arXiv 2024
[66]

Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards.arXiv preprint arXiv:2402.18571, 2024

Haoxiang Wang, Yong Lin, Wei Xiong, Rui Yang, Shizhe Diao, Shuang Qiu, Han Zhao, and Tong Zhang. Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards.arXiv preprint arXiv:2402.18571, 2024

work page arXiv 2024
[67]

Conditional language policy: A general framework for steerable multi-objective finetuning.arXiv preprint arXiv:2407.15762, 2024

Kaiwen Wang, Rahul Kidambi, Ryan Sullivan, Alekh Agarwal, Christoph Dann, Andrea Michi, Marco Gelmi, Yunxuan Li, Raghav Gupta, Avinava Dubey, et al. Conditional language policy: A general framework for steerable multi-objective finetuning.arXiv preprint arXiv:2407.15762, 2024

work page arXiv 2024
[68]

Metaaligner: Towards generalizable multi-objective alignment of language models.Advances in Neural Information Processing Systems, 37:34453–34486, 2024

Kailai Yang, Zhiwei Liu, Qianqian Xie, Jimin Huang, Tianlin Zhang, and Sophia Ananiadou. Metaaligner: Towards generalizable multi-objective alignment of language models.Advances in Neural Information Processing Systems, 37:34453–34486, 2024

2024
[69]

Paretohqd: Fast offline multiobjective alignment of large language models using pareto high-quality data

Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang, and Yaochu Jin. Paretohqd: Fast offline multiobjective alignment of large language models using pareto high-quality data. InPro- ceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 17454–17462, 2026

2026
[70]

Steerable chatbots: Personalizing llms with preference-based activation steering.arXiv preprint arXiv:2505.04260, 2025

Jessica Y Bo, Tianyu Xu, Ishan Chatterjee, Katrina Passarella-Ward, Achin Kulshrestha, and D Shin. Steerable chatbots: Personalizing llms with preference-based activation steering.arXiv preprint arXiv:2505.04260, 2025

work page arXiv 2025
[71]

Confronting reward model overoptimization with constrained rlhf.arXiv preprint arXiv:2310.04373, 2023

Ted Moskovitz, Aaditya K Singh, DJ Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca D Dragan, and Stephen McAleer. Confronting reward model overoptimization with constrained rlhf.arXiv preprint arXiv:2310.04373, 2023

work page arXiv 2023
[72]

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint

Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint. InInternational Conference on Machine Learning, pages 54715–54754. PMLR, 2024

2024
[73]

Pareto multi-objective alignment for language models.arXiv preprint arXiv:2508.07768, 2025

Qiang He and Setareh Maghsudi. Pareto multi-objective alignment for language models.arXiv preprint arXiv:2508.07768, 2025

work page arXiv 2025
[74]

Evolutionary constrained multi-objective optimization: A review.Vicinagearth, 1(1):5, 2024

Jing Liang, Hongyu Lin, Caitong Yue, Xuanxuan Ban, and Kunjie Yu. Evolutionary constrained multi-objective optimization: A review.Vicinagearth, 1(1):5, 2024

2024
[75]

Analysis of real-world constrained multi- objective problems and performance comparison of multi-objective algorithms

Yang Nan, Hisao Ishibuchi, Tianye Shu, and Ke Shang. Analysis of real-world constrained multi- objective problems and performance comparison of multi-objective algorithms. InProceedings of the Genetic and Evolutionary Computation Conference, pages 576–584, 2024. 14

2024
[76]

Constraints separation based evolutionary multitasking for constrained multi-objective optimization problems.IEEE/CAA Journal of Automatica Sinica, 11(8):1819–1835, 2024

Kangjia Qiao, Jing Liang, Kunjie Yu, Xuanxuan Ban, Caitong Yue, Boyang Qu, and Pon- nuthurai Nagaratnam Suganthan. Constraints separation based evolutionary multitasking for constrained multi-objective optimization problems.IEEE/CAA Journal of Automatica Sinica, 11(8):1819–1835, 2024

2024
[77]

Map: Multi-human-value alignment palette.arXiv preprint arXiv:2410.19198, 2024

Xinran Wang, Qi Le, Ammar Ahmed, Enmao Diao, Yi Zhou, Nathalie Baracaldo, Jie Ding, and Ali Anwar. Map: Multi-human-value alignment palette.arXiv preprint arXiv:2410.19198, 2024

work page arXiv 2024
[78]

Mitigating the safety alignment tax with null-space constrained policy optimization.arXiv preprint arXiv:2512.11391, 2025

Yifan Niu, Han Xiao, Dongyi Liu, Nuo Chen, and Jia Li. Mitigating the safety alignment tax with null-space constrained policy optimization.arXiv preprint arXiv:2512.11391, 2025

work page arXiv 2025
[79]

Multi-value alignment for llms via value decorrelation and extrapolation

Hefei Xu, Le Wu, Chen Cheng, and Hao Liu. Multi-value alignment for llms via value decorrelation and extrapolation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34133–34141, 2026

2026
[80]

Alphadpo: Adaptive reward margin for direct preference optimization.arXiv preprint arXiv:2410.10148, 2024

Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. Alphadpo: Adaptive reward margin for direct preference optimization.arXiv preprint arXiv:2410.10148, 2024

work page arXiv 2024
[81]

Latent pref- erence coding: Aligning large language models via discrete latent codes.arXiv preprint arXiv:2505.04993, 2025

Zhuocheng Gong, Jian Guan, Wei Wu, Huishuai Zhang, and Dongyan Zhao. Latent pref- erence coding: Aligning large language models via discrete latent codes.arXiv preprint arXiv:2505.04993, 2025

work page arXiv 2025
[82]

Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 7(2):195–204, 2025

Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, and David Ha. Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 7(2):195–204, 2025

2025

Showing first 80 references.