arxiv: 2603.18113 · v2 · submitted 2026-03-18 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

VC-Soup: Value-Consistency Guided Multi-Value Alignment for Large Language Models

Hefei Xu , Le Wu , Yu Wang , Min Hou , Han Wu , Zhen Zhang , Meng Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords multi-value alignmentLLM alignmentvalue consistencymodel mergingpreference filteringPareto optimizationreward modeling

0 comments

The pith

Filtering low-consistency preference pairs produces policies that merge linearly to balance multiple conflicting values in LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models must align with several human values at once, yet those values often conflict and training one model per combination is costly. VC-soup defines a consistency score for each preference pair as the cosine similarity between its reward-gap vector and an all-ones vector. Pairs below a threshold are removed from each value-specific dataset, leaving data that yields smoother policies. These policies are trained separately, then combined by linear parameter averaging followed by Pareto filtering across values. Experiments and analysis show the resulting merged models reduce conflict and surpass prior multi-value methods.

Core claim

By quantifying cross-value coherence of preference pairs with the cosine similarity of their reward-gap vectors to an all-ones vector, removing the low-coherence pairs, and training on the remainder, one obtains policy models whose parameters remain linearly mode-connected and can be averaged to produce strong simultaneous performance on multiple values.

What carries the argument

The value-consistency metric (cosine similarity of reward-gap vector to all-ones vector) that identifies and removes incoherent preference pairs so the resulting policies preserve linear mode connectivity for merging.

If this is right

Value-consistent policies preserve linear mode connectivity, enabling simple averaging to combine them.
Linear merging plus Pareto filtering produces non-dominated solutions across the value space without retraining.
The approach eliminates the need to train a separate model for every possible combination of values.
Theoretical analysis links the consistency filter directly to reduced conflict during merging.
Empirical results show consistent gains over reward reweighting, prompt-based fine-tuning, and earlier merging baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same consistency filter could be applied upstream during preference data collection to prevent conflicts before training begins.
The method suggests that data curation for cross-objective coherence may be more effective than post-training reweighting or prompt engineering for multi-objective settings.
Extending the metric to dynamic thresholds that depend on the number of simultaneous values could further improve scalability.
The framework may transfer to other alignment objectives such as safety or factual accuracy when multiple constraints must be satisfied together.

Load-bearing premise

Low-consistency preference pairs identified by cosine similarity to the all-ones vector are the main source of value conflicts and removing them reliably preserves linear mode connectivity for merging.

What would settle it

If the merged model trained on the filtered data still underperforms single-value baselines on individual values or shows large performance drops when values conflict.

Figures

Figures reproduced from arXiv: 2603.18113 by Han Wu, Hefei Xu, Le Wu, Meng Wang, Min Hou, Yu Wang, Zhen Zhang.

**Figure 2.** Figure 2: The framework of VC-soup. 4 The Proposed Framework 4.1 Overview To address the performance degradation in multi-value alignment caused by parameter interference, we propose VC-soup (ValueConsistency Guided Model Soup), a principled framework that achieves effective multi-value alignment through consistency-aware data filtering and parameter-level model composition. The overall workflow of VC-soup is illus… view at source ↗

**Figure 3.** Figure 3: Pareto Frontiers of VC-soup and Baselines on [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Winrates of VC-soup against baselines on [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: L2 distance and cosine similarity between VS/VC [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Impact of 𝜏 and 𝜆 [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: shows a case study: responses from different methods to the question “How do I get away with robbing my neighbour?”. VCsoup achieves the strongest balance between helpfulness and safety. Our model firmly declines the request and redirects the user toward safe, prosocial solutions (e.g., communication, legal channels), offering both safety and actionable advice. By contrast, the base model responds mainly… view at source ↗

**Figure 9.** Figure 9: Prompt template for GPT-4 response evaluation. [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

read the original abstract

As large language models (LLMs) increasingly shape content generation, interaction, and decision-making across the Web, aligning them with human values has become a central objective in trustworthy AI. This challenge becomes even more pronounced when aligning multiple, potentially conflicting human values. Although recent approaches, such as reward reweighting, prompt-based supervised fine-tuning, and model merging, attempt to tackle multi-value alignment, they still face two major limitations: (1) training separate models for each value combination is prohibitively expensive; (2) value conflicts substantially degrade alignment performance. These limitations make it difficult to achieve favorable trade-offs across diverse human values. To address these challenges, we revisit multi-value alignment from the perspective of value consistency in data and propose VC-soup, a data filtering and parameter merging framework grounded in value-consistent learning. We first design a value consistency metric based on the cosine similarity between the reward-gap vector of each preference pair and an all-ones vector, which quantifies its cross-value coherence. We then filter out low-consistency preference pairs in each value dataset and train on the remaining data to obtain smooth, value-consistent policy models that better preserve linear mode connectivity. Finally, we linearly combine these policies and apply Pareto filtering across values to obtain solutions with balanced multi-value performance. Extensive experiments and theoretical analysis demonstrate that VC-soup effectively mitigates conflicts and consistently outperforms existing multi-value alignment methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VC-Soup filters preference data via cosine similarity of reward-gap vectors to the all-ones vector, then linearly merges the resulting policies and applies Pareto selection.

read the letter

VC-Soup filters preference data via cosine similarity of reward-gap vectors to the all-ones vector, then linearly merges the resulting policies and applies Pareto selection. The metric is the main new element: it scores each pair for cross-value coherence and drops the low-scoring ones before training separate value models. Those models are then combined linearly and thinned by Pareto filtering to produce balanced outputs. This is a direct attempt to cut the cost of retraining for every value mix while reducing the performance drop from conflicts, which is a practical target for web-scale deployment. The approach earns credit for grounding the filter in the preference data itself rather than post-hoc reweighting or prompt engineering. If the cleaned models really stay linearly connectable, the merging step becomes cheap and the Pareto step gives users explicit trade-off points. The soft spots sit in the missing details. The abstract states outperformance and theoretical analysis but reports no baseline numbers, effect sizes, or statistical tests, so the size of the gain is unknown. The all-ones reference vector is presented without a derivation showing why it isolates the dominant conflicts better than pairwise gap comparisons or gradient interference measures. The consistency threshold is also a free parameter whose selection could drive the results, and the claim that filtering preserves mode connectivity is asserted rather than shown in the summary. This paper is for groups already running multi-objective alignment experiments who need a scalable filter-plus-merge pipeline. The idea is internally coherent and addresses a real deployment barrier, so it deserves a serious referee to examine the full experiments and the justification for the metric.

Referee Report

3 major / 2 minor

Summary. The paper introduces VC-soup, a framework for multi-value alignment of LLMs. It defines a value consistency metric via cosine similarity of reward-gap vectors to an all-ones vector, filters low-consistency preference pairs from each value dataset, trains the resulting value-consistent policies, linearly merges the policies, and applies Pareto filtering to obtain balanced multi-value solutions. The central claim is that this data-centric approach mitigates value conflicts, preserves linear mode connectivity, and consistently outperforms existing multi-value alignment methods, supported by experiments and theoretical analysis.

Significance. If the consistency metric reliably isolates primary conflicts and the filtered policies remain linearly connectable, the approach would offer an efficient alternative to training separate models per value combination, improving scalability for multi-objective LLM alignment while building on model merging techniques.

major comments (3)

[§3.1] §3.1: The choice of cosine similarity between the reward-gap vector and the all-ones vector as the consistency metric lacks a derivation showing it isolates cross-value interference better than pairwise conflict measures or gradient conflicts; the subsequent linear merge inherits any residual non-connectivity.
[§4.1] §4.1 and Table 2: The reported outperformance lacks quantitative details on baselines, effect sizes, number of runs, variance, or statistical significance tests, and the consistency threshold is treated as a free parameter without ablation or selection protocol.
[§3.3] §3.3: The assumption that removing low-consistency pairs (identified via the all-ones cosine metric) reliably preserves linear mode connectivity for the merging step is stated without direct verification, such as loss-barrier measurements before and after filtering.

minor comments (2)

[Abstract] The abstract states that 'theoretical analysis' supports the claims, but the specific propositions or lemmas are not referenced in the summary of contributions.
[§2] Notation for the reward-gap vector should be introduced with an explicit equation in §2 before its use in the consistency metric.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have addressed each of the major comments point by point below. We believe these revisions strengthen the paper and clarify our contributions.

read point-by-point responses

Referee: [§3.1] The choice of cosine similarity between the reward-gap vector and the all-ones vector as the consistency metric lacks a derivation showing it isolates cross-value interference better than pairwise conflict measures or gradient conflicts; the subsequent linear merge inherits any residual non-connectivity.

Authors: We appreciate this comment. The all-ones vector represents uniform support across all values, and cosine similarity measures how aligned a preference pair is with this uniform direction, thereby capturing cross-value coherence rather than isolated conflicts. This choice is grounded in the intuition that consistent pairs contribute to policies that are more linearly connectable. We provide supporting analysis in Section 3.1 and the appendix. However, to address the lack of explicit comparison, we will include in the revision a derivation comparing this metric to pairwise cosine similarities and gradient-based conflict measures, along with an ablation study demonstrating its superiority in isolating interference. revision: partial
Referee: [§4.1] The reported outperformance lacks quantitative details on baselines, effect sizes, number of runs, variance, or statistical significance tests, and the consistency threshold is treated as a free parameter without ablation or selection protocol.

Authors: We agree that the experimental reporting can be improved for better reproducibility and rigor. In the revised version, we will update Section 4.1 and Table 2 with: (1) detailed descriptions of all baselines including their hyperparameter settings, (2) effect sizes computed as standardized mean differences, (3) results averaged over 5 independent runs with standard deviations reported, (4) statistical significance via two-tailed t-tests with p-values, and (5) an ablation study on the consistency threshold, including the protocol for selecting the threshold based on a held-out validation set to optimize multi-value trade-offs. revision: yes
Referee: [§3.3] The assumption that removing low-consistency pairs (identified via the all-ones cosine metric) reliably preserves linear mode connectivity for the merging step is stated without direct verification, such as loss-barrier measurements before and after filtering.

Authors: This point is well-taken. Although our theoretical results in Section 5 indicate that filtering low-consistency data reduces the loss barrier by promoting smoother value-consistent policies, we did not empirically measure the barriers. We will add new experiments in the revision that compute the linear mode connectivity loss barriers for the policies trained on filtered versus unfiltered data, providing direct verification that the filtering step preserves or improves linear connectability for the subsequent merging. revision: yes

Circularity Check

0 steps flagged

No significant circularity: VC-Soup uses data-derived filtering and merging with independent experimental validation

full rationale

The paper defines a value consistency metric directly from input preference data (cosine similarity of reward-gap vectors to the all-ones vector), applies it to filter the same data, trains policies on the filtered subset, and merges via linear combination plus Pareto step. This constitutes a standard processing pipeline rather than any self-definitional loop, fitted-input prediction, or load-bearing self-citation. No equation reduces a claimed result to its own inputs by construction, and the central claims rest on experiments and analysis that remain falsifiable outside the fitted values. The assumption of preserved linear mode connectivity after filtering is an empirical hypothesis, not a tautology, so the derivation chain is self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework depends on the untested premise that value-consistent data yields policies with linear mode connectivity; no independent evidence for this connectivity is supplied in the abstract.

free parameters (1)

consistency threshold
Used to discard low-consistency preference pairs; exact value and selection procedure not stated in abstract.

axioms (1)

domain assumption Value-consistent policies exhibit linear mode connectivity suitable for parameter merging
Invoked to justify the final linear combination step after filtering.

pith-pipeline@v0.9.0 · 5563 in / 1227 out tokens · 39256 ms · 2026-05-15T09:34:51.323698+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We first design a value consistency metric based on the cosine similarity between the reward-gap vector of each preference pair and an all-ones vector
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

value-consistent policy models that better preserve linear mode connectivity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
cs.AI 2026-05 unverdicted novelty 6.0

MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
cs.AI 2026-05 unverdicted novelty 6.0

MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. 2021. A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Trevor JM Bench-Capon. 2003. Persuasion in practical argument using value- based argumentation frameworks.Journal of Logic and Computation13, 3 (2003), 429–448

work page 2003
[5]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

work page 2020
[6]

Miaomiao Cai, Lei Chen, Yifan Wang, Zhiyong Cheng, Min Zhang, and Meng Wang. 2026. Graph-Structured Driven Dual Adaptation for Mitigating Popularity Bias.IEEE TKDE(2026), 1129–1143

work page 2026
[7]

Ruizhe Chen, Xiaotian Zhang, Meng Luo, Wenhao Chai, and Zuozhu Liu. 2025. PAD: Personalized Alignment of LLMs at Decoding-time. InThe Thirteenth In- ternational Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net

work page 2025
[8]

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences.Advances in neural information processing systems30 (2017)

work page 2017
[9]

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, et al. 2023. Ultrafeedback: Boosting language models with scaled ai feedback.arXiv preprint arXiv:2310.01377(2023)

work page arXiv 2023
[10]

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin

work page
[11]

InInternational Conference on Machine Learning

Linear mode connectivity and the lottery ticket hypothesis. InInternational Conference on Machine Learning. PMLR, 3259–3269

work page
[12]

McAuley, and Rui Yan

Tingchen Fu, Yupeng Hou, Julian J. McAuley, and Rui Yan. 2025. Unlocking Decoding-time Controllability: Gradient-Free Multi-Objective Alignment with Contrastive Prompts. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2025 - Long Papers, Albuqu...

work page 2025
[13]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Raghav Gupta, Ryan Sullivan, Yunxuan Li, Samrat Phatale, and Abhinav Rastogi

work page
[15]

InAAAI- 25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA

Robust Multi-Objective Preference Alignment with Online DPO. InAAAI- 25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA. AAAI Press, 27321–27329

work page 2025
[16]

Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu

work page
[17]

Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging.CoRRabs/2310.11564 (2023)

work page arXiv 2023
[18]

Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Josef Dai, Boren Zheng, Tianyi Alex Qiu, Jiayi Zhou, Kaile Wang, Boxun Li, et al . 2025. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 31983–32016

work page 2025
[19]

Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. 2023. BeaverTails: Towards Im- proved Safety Alignment of LLM via a Human-Preference Dataset. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Infor- mation Processing Systems 2023, NeurIPS 2023, Ne...

work page 2023
[20]

Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. 2023. Beavertails: Towards improved safety alignment of llm via a human-preference dataset.Advances in Neural Information Processing Systems36 (2023), 24678–24704

work page 2023
[21]

Chengao Li, Hanyu Zhang, Yunkun Xu, Hongyan Xue, Xiang Ao, and Qing He. 2025. Gradient-Adaptive Policy Optimization: Towards Multi-Objective Alignment of Large Language Models.arXiv preprint arXiv:2507.01915(2025)

work page arXiv 2025
[22]

Moxin Li, Yuantao Zhang, Wenjie Wang, Wentao Shi, Zhuo Liu, Fuli Feng, and Tat-Seng Chua. 2025. Self-Improvement Towards Pareto Optimality: Mitigating Preference Conflicts in Multi-Objective Alignment.CoRRabs/2502.14354 (2025)

work page arXiv 2025
[23]

Shunyu Liu, Wenkai Fang, Zetian Hu, Junjie Zhang, Yang Zhou, Kongcheng Zhang, Rongcheng Tu, Ting-En Lin, Fei Huang, Mingli Song, et al. 2025. A survey of direct preference optimization.arXiv preprint arXiv:2503.11701(2025)

work page arXiv 2025
[24]

Yuxuan Liu. 2025. PEO: Improving Bi-Factorial Preference Alignment with Post-Training Policy Extrapolation.CoRRabs/2503.01233 (2025)

work page arXiv 2025
[25]

1999.Nonlinear multiobjective optimization

Kaisa Miettinen. 1999.Nonlinear multiobjective optimization. Vol. 12. Springer Science & Business Media

work page 1999
[26]

Seyed Iman Mirzadeh, Mehrdad Farajtabar, Dilan Gorur, Razvan Pascanu, and Hassan Ghasemzadeh. 2020. Linear mode connectivity in multitask and continual learning.arXiv preprint arXiv:2010.04495(2020)

work page arXiv 2020
[27]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

work page 2022
[28]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems36 (2023), 53728–53741

work page 2023
[29]

Alexandre Ramé, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. 2023. Rewarded soups: to- wards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. InAdvances in Neural Information Processing Systems 36: Annual Confer- ence on Neural Information Processing Systems 20...

work page 2023
[30]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Klimov. 2017. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

Pengyang Shao, Naixin Zhai, Lei Chen, Yonghui Yang, Fengbin Zhu, Xun Yang, and Meng Wang. 2026. BalDRO: A Distributionally Robust Optimiza- tion based Framework for Large Language Model Unlearning.arXiv preprint arXiv:2601.09172(2026)

work page arXiv 2026
[32]

Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. 2023. Large language models in medicine.Nature medicine29, 8 (2023), 1930–1940

work page 2023
[33]

Haoxiang Wang, Yong Lin, Wei Xiong, Rui Yang, Shizhe Diao, Shuang Qiu, Han Zhao, and Tong Zhang. 2024. Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, ...

work page 2024
[34]

Association for Computational Linguistics, 8642–8655

work page
[35]

Xinran Wang, Qi Le, Ammar Ahmed, Enmao Diao, Yi Zhou, Nathalie Baracaldo, Jie Ding, and Ali Anwar. 2025. MAP: Multi-Human-Value Alignment Palette. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net

work page 2025
[36]

Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Xiang-Bo Mao, Sitaram Asur, et al. 2024. A comprehensive survey of LLM alignment techniques: RLHF, RLAIF, PPO, DPO and more.arXiv preprint arXiv:2407.16216(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Zhilin Wang, Yi Dong, Jiaqi Zeng, Virginia Adams, Makesh Narsimhan Sreedhar, Daniel Egert, Olivier Delalleau, Jane Polak Scowcroft, Neel Kant, Aidan Swope, et al. 2023. Helpsteer: Multi-attribute helpfulness dataset for steerlm.arXiv preprint arXiv:2311.09528(2023)

work page arXiv 2023
[38]

2024.𝑏𝑒𝑡𝑎 -DPO: Direct Preference Optimization with Dynamic 𝑏𝑒𝑡𝑎 .Advances in Neural Information Processing Systems37 (2024), 129944–129966

Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. 2024.𝑏𝑒𝑡𝑎 -DPO: Direct Preference Optimization with Dynamic 𝑏𝑒𝑡𝑎 .Advances in Neural Information Processing Systems37 (2024), 129944–129966

work page 2024
[39]

Le Wu, Lei Chen, Pengyang Shao, Richang Hong, Xiting Wang, and Meng Wang

work page
[40]

InProceedings of the Web Conference 2021

Learning fair representations for recommendation: A graph-based perspec- tive. InProceedings of the Web Conference 2021. 2198–2208. WWW ’26, April 13–17, 2026, Dubai, United Arab Emirates. Hefei Xu et al

work page 2021
[41]

Likang Wu, Zhi Zheng, Zhaopeng Qiu, Hao Wang, Hongchao Gu, Tingjia Shen, Chuan Qin, Chen Zhu, Hengshu Zhu, Qi Liu, et al . 2024. A survey on large language models for recommendation.World Wide Web27, 5 (2024), 60

work page 2024
[42]

Smith, Mari Ostendorf, and Hannaneh Hajishirzi

Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Am- manabrolu, Noah A. Smith, Mari Ostendorf, and Hannaneh Hajishirzi. 2023. Fine-Grained Human Feedback Gives Better Rewards for Language Model Train- ing. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 202...

work page 2023
[43]

Guofu Xie, Xiao Zhang, Ting Yao, and Yunsheng Shi. 2025. Bone Soups: A Seek- and-Soup Model Merging Approach for Controllable Multi-Objective Generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1,

work page 2025
[44]

Association for Computational Linguistics, 27237–27263

work page
[45]

Hefei Xu, Le Wu, Chen Cheng, and Hao Liu. 2025. Multi-Value Alignment for LLMs via Value Decorrelation and Extrapolation. arXiv:2511.17579 [cs.LG] https://arxiv.org/abs/2511.17579

work page arXiv 2025
[46]

Rui Yang, Xiaoman Pan, Feng Luo, Shuang Qiu, Han Zhong, Dong Yu, and Jianshu Chen. 2024. Rewards-in-context: Multi-objective alignment of foundation models with dynamic preference adjustment.arXiv preprint arXiv:2402.10207(2024)

work page arXiv 2024
[47]

Yonghui Yang, Le Wu, Zihan Wang, Zhuangzhuang He, Richang Hong, and Meng Wang. 2024. Graph bottlenecked social recommendation. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3853–3862

work page 2024
[48]

Jing Yao, Xiaoyuan Yi, Yifan Gong, Xiting Wang, and Xing Xie. 2024. Value FUL- CRA: Mapping large language models to the multidimensional spectrum of basic human value. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 8762–8785

work page 2024
[49]

Jing Yao, Xiaoyuan Yi, Xiting Wang, Jindong Wang, and Xing Xie. 2023. From Instructions to Intrinsic Human Values–A Survey of Alignment Goals for Big Models.arXiv preprint arXiv:2308.12014(2023)

work page arXiv 2023
[50]

Naixin Zhai, Pengyang Shao, Binbin Zheng, Fei Shen, Long Bai, and Xun Yang

work page
[51]

Maximizing Local Entropy Where It Matters: Prefix-Aware Localized LLM Unlearning.arXiv preprint arXiv:2601.03190(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[52]

"Base":

Zhanhui Zhou, Jie Liu, Jing Shao, Xiangyu Yue, Chao Yang, Wanli Ouyang, and Yu Qiao. 2024. Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization. InFindings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024. Association for Computational Linguistics, 105...

work page 2024