One Model for All: Multi-Objective Controllable Language Models
Pith reviewed 2026-05-10 20:30 UTC · model grok-4.3
The pith
A single LLM trained with multi-objective optimization can generate outputs for any point on the user preference Pareto front.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By introducing multi-objective optimization directly into RLHF and conditioning the policy network on a preference vector, a single LLM can be trained to output responses lying in any region of the Pareto front defined by the trade-offs among multiple reward objectives, achieving controllability over user preferences, improved hyper-volume of solutions, and generalization to unseen preferences.
What carries the argument
Multi-Objective Control (MOC), which applies multi-objective optimization at the policy level to create a preference-conditioned LLM policy network.
If this is right
- One 7B model fine-tuned on a single GPU can handle multiple simultaneous objectives instead of requiring separate models.
- Users can control output style by specifying preference weights at inference time without any retraining.
- The model produces a wider set of high-quality solutions measured by hyper-volume across the objective space.
- Performance holds for preference combinations not encountered in training data.
- Scalable personalization becomes possible without collecting large amounts of user-specific data.
Where Pith is reading between the lines
- The same conditioning trick could be tried with more than three objectives or with objectives that change mid-conversation.
- It may reduce the need for post-training alignment techniques that currently target only average preferences.
- Testing on larger models or different base architectures would show whether the policy-level approach continues to scale.
- Combining MOC with retrieval or tool-use methods could extend controllable generation to more complex tasks.
Load-bearing premise
That applying multi-objective optimization at the policy level during RLHF will let one model reliably cover diverse preference regions and generalize without losing output quality.
What would settle it
After training, test whether the model produces responses with the expected reward trade-offs for completely new preference vectors that were never shown during training, and check if quality drops compared to single-objective baselines.
Figures
read the original abstract
Aligning large language models (LLMs) with human preferences is critical for enhancing LLMs' safety, helpfulness, humor, faithfulness, etc. Current reinforcement learning from human feedback (RLHF) mainly focuses on a fixed reward learned from average human ratings, which may weaken the adaptability and controllability of varying preferences. However, creating personalized LLMs requires aligning LLMs with individual human preferences, which is non-trivial due to the scarce data per user and the diversity of user preferences in multi-objective trade-offs, varying from emphasizing empathy in certain contexts to demanding efficiency and precision in others. Can we train one LLM to produce personalized outputs across different user preferences on the Pareto front? In this paper, we introduce Multi-Objective Control (MOC), which trains a single LLM to directly generate responses in the preference-defined regions of the Pareto front. Our approach introduces multi-objective optimization (MOO) principles into RLHF to train an LLM as a preference-conditioned policy network. We improve the computational efficiency of MOC by applying MOO at the policy level, enabling us to fine-tune a 7B-parameter model on a single A6000 GPU. Extensive experiments demonstrate the advantages of MOC over baselines in three aspects: (i) controllability of LLM outputs w.r.t. user preferences on the trade-off among multiple rewards; (ii) quality and diversity of LLM outputs, measured by the hyper-volume of multiple solutions achieved; and (iii) generalization to unseen preferences. These results highlight MOC's potential for real-world applications requiring scalable and customizable LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Multi-Objective Control (MOC), a training procedure that incorporates multi-objective optimization principles into RLHF to produce a single preference-conditioned LLM policy capable of generating outputs in arbitrary regions of the Pareto front defined by multiple human preference objectives. The method is claimed to improve computational efficiency by operating at the policy level, allowing fine-tuning of a 7B model on one GPU, and is evaluated on controllability with respect to preference trade-offs, hypervolume-based quality/diversity, and generalization to unseen preferences.
Significance. If the central construction is sound, the result would be significant for scalable personalization of LLMs, as it offers a single-model alternative to per-user fine-tuning or ensembles while maintaining output quality across diverse objectives such as empathy versus efficiency. The reported efficiency gain for 7B-scale training is a practical strength.
major comments (2)
- [Abstract and method description] The central claim that a preference-conditioned policy covers the Pareto front without mode collapse or loss of controllability rests on the unstated assumption that the conditioning mechanism (preference vector or embedding) produces smooth interpolation and well-behaved policy gradients when the scalarized reward changes at inference time; no derivation or stability analysis is provided to support this.
- [Experiments] The reported gains in hypervolume and generalization to unseen preferences are load-bearing for the contribution, yet the abstract supplies no details on the preference embedding, joint optimization procedure, or baseline definitions, making it impossible to verify whether the policy avoids collapsing to high-average modes.
minor comments (1)
- The abstract would benefit from a concise statement of the exact conditioning input (e.g., concatenated weights, learned embedding) and the scalarization method used during training.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our paper. We address each major comment below, providing clarifications from the full manuscript and indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract and method description] The central claim that a preference-conditioned policy covers the Pareto front without mode collapse or loss of controllability rests on the unstated assumption that the conditioning mechanism (preference vector or embedding) produces smooth interpolation and well-behaved policy gradients when the scalarized reward changes at inference time; no derivation or stability analysis is provided to support this.
Authors: We appreciate this observation on the theoretical foundations. Section 3 of the manuscript details the conditioning mechanism: the preference vector is mapped via a linear embedding layer and incorporated into the input or hidden states of the LLM policy. Training employs a multi-objective RL objective (PPO with scalarized rewards weighted by the preference vector), enabling the policy to interpolate across the front at inference by varying the vector. While no formal derivation of gradient stability is provided, the empirical results in Section 4 (controllability curves, hypervolume plots, and generalization tests) demonstrate smooth trade-offs without observed mode collapse. We will add a brief discussion paragraph in the method section on the empirical support for these assumptions and the practical stability observed. revision: partial
-
Referee: [Experiments] The reported gains in hypervolume and generalization to unseen preferences are load-bearing for the contribution, yet the abstract supplies no details on the preference embedding, joint optimization procedure, or baseline definitions, making it impossible to verify whether the policy avoids collapsing to high-average modes.
Authors: We agree the abstract's brevity omits these specifics. The full manuscript clarifies them in Sections 3 and 4: the preference embedding is a trainable linear projection of the vector into the model's embedding space; joint optimization samples preference vectors during training and scalarizes rewards accordingly within the RLHF loop; baselines include standard single-objective RLHF (averaged rewards) and a non-conditioned multi-task variant. Hypervolume is used precisely to quantify coverage of the Pareto front and guard against collapse to high-average modes, with additional plots showing per-preference performance. We will revise the abstract to briefly reference the preference-conditioned policy and hypervolume evaluation for improved transparency. revision: partial
Circularity Check
No significant circularity in the claimed derivation.
full rationale
The paper introduces MOC as a methodological extension of standard RLHF by applying multi-objective optimization at the policy level to produce a single preference-conditioned LLM. No equations, self-definitions, or fitted inputs are shown that reduce the controllability or generalization claims to tautologies by construction. The approach is presented as building on external MOO and RLHF principles with empirical validation via experiments on hypervolume and unseen preferences, rather than relying on load-bearing self-citations or renamed ansatzes from prior author work. This is a normal non-circular outcome for a methods paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
doi: 10.48550/ARXIV.2204.05862. URLhttps://doi.org/10.48550/arXiv.2204.05862. Stephen P Boyd and Lieven Vandenberghe.Convex optimization. Cambridge university press,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.05862
-
[4]
URL https://inria.hal.science/inria-00389811. In this report, the problem of minimizing simultaneously n smooth and unconstrained criteria is considered. A descent direction common to all the criteria is identified, knowing all the gradients. An algorithm is defined in which the optimization process is carried out in two phases : one that is cooperative y...
work page 2019
-
[5]
BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
doi: 10.18653/V1/N19-1423. URLhttps://doi.org/10.18653/v1/n19-1423. Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines.https://github.com/ openai/baselines,
-
[6]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Florian Felten, Lucas N. Alegre, Ann Nowé, Ana L. C. Bazzan, El Ghazali Talbi, Grégoire Danoy, and Bruno C. da Silva. A toolkit for reliable benchmarking and research in multi-objective reinforcement learning. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023),
work page 2023
-
[8]
URLhttps://api.semanticscholar.org/CorpusID:277244364. Yiju Guo, Ganqu Cui, Lifan Yuan, Ning Ding, Zexu Sun, Bowen Sun, Huimin Chen, Ruobing Xie, Jie Zhou, Yankai Lin, Zhiyuan Liu, and Maosong Sun. Controllable preference optimization: Toward controllable multi-objective alignment. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.),Proceedings of...
work page 2024
-
[9]
URLhttps://doi.org/10.18653/v1/2024.emnlp-main.85
doi: 10.18653/V1/2024.EMNLP-MAIN.85. URLhttps://doi.org/10.18653/v1/2024.emnlp-main.85. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,
-
[10]
URL https://openreview.net/forum?id=nZeVKeeFYf9. Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. InProceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 ofJMLR Workshop and Conference Proceedings, pp. 427–435. JMLR.org,
work page 2013
-
[11]
http://www.jstor.org/ stable/2332226
ISSN 00063444. URL http://www.jstor.org/stable/2332226. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR (Poster),
-
[12]
Adam: A Method for Stochastic Optimization
URLhttp://arxiv.org/abs/1412.6980. Shuyue Stella Li, Avinandan Bose, Faeze Brahman, Simon Shaolei Du, Pang Wei Koh, Maryam Fazel, and Yulia Tsvetkov. Personalized reasoning: Just-in-time personalization and why llms fail at it.ArXiv, abs/2510.00177,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
13 Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu
URLhttps://api.semanticscholar.org/CorpusID:281705946. 13 Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.),Advances in Neural Information Processing Systems 34: An- nual Confe...
work page 2021
-
[14]
Bo Liu, Yihao Feng, Peter Stone, and Qiang Liu
URL https://proceedings.neurips.cc/paper/2021/hash/ 9d27fdf2477ffbff837d73ef7ae23db9-Abstract.html. Bo Liu, Yihao Feng, Peter Stone, and Qiang Liu. FAMO: fast adaptive multitask optimization. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.),Advances in Neural Information Processing Systems 36: Annual Confer...
work page 2021
-
[15]
Pingchuan Ma, Tao Du, and Wojciech Matusik
URLhttp://papers.nips.cc/ paper_files/paper/2023/hash/b2fe1ee8d936ac08dd26f2ff58986c8f-Abstract-Conference.html. Pingchuan Ma, Tao Du, and Wojciech Matusik. Efficient continuous pareto exploration in multi-task learning. InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 ofProceedi...
work page 2023
-
[16]
Debabrata Mahapatra and Vaibhav Rajan
URL http://proceedings.mlr.press/v119/ma20a.html. Debabrata Mahapatra and Vaibhav Rajan. Exact pareto optimal search for multi-task learning: Touring the pareto front.ArXiv, abs/2108.00597,
-
[17]
Dang Nguyen, Jiuhai Chen, and Tianyi Zhou
URL https://api.semanticscholar.org/CorpusID: 236772107. Dang Nguyen, Jiuhai Chen, and Tianyi Zhou. Multi-objective linguistic control of large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Findings of the Association for Computational Linguistics: ACL 2024, pp. 4336–4347, Bangkok, Thailand, August
work page 2024
-
[18]
Association for Computational Lin- guistics. doi: 10.18653/v1/2024.findings-acl.257. URL https://aclanthology.org/2024.findings-acl. 257/. OpenAI. GPT-4 technical report.CoRR, abs/2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.findings-acl.257 2024
-
[19]
doi: 10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2022
-
[20]
Alec Radford and Karthik Narasimhan
URLhttp://papers.nips.cc/paper_files/paper/2022/ hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html. Alec Radford and Karthik Narasimhan. Improving language understanding by generative pre-training
work page 2022
-
[21]
URLhttps://api.semanticscholar.org/CorpusID:49313245. Alexandre Ramé, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, ...
work page 2023
-
[23]
Proximal Policy Optimization Algorithms
URLhttp://arxiv.org/abs/1707.06347. OzanSenerandVladlenKoltun. Multi-tasklearningasmulti-objectiveoptimization. InSamyBengio, HannaM. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (eds.),Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, D...
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[24]
14 Ruizhe Shi, Yifang Chen, Yushi Hu, Alisa Liu, Hannaneh Hajishirzi, Noah A
URLhttps: //proceedings.neurips.cc/paper/2018/hash/432aca3a1e345e339f35a30c8f65edce-Abstract.html. 14 Ruizhe Shi, Yifang Chen, Yushi Hu, Alisa Liu, Hannaneh Hajishirzi, Noah A. Smith, and Simon S. Du. Decoding-time language model alignment with multiple objectives.CoRR, abs/2406.18853,
-
[25]
14 Ruizhe Shi, Yifang Chen, Yushi Hu, Alisa Liu, Hannaneh Hajishirzi, Noah A
doi: 10.48550/ARXIV.2406.18853. URLhttps://doi.org/10.48550/arXiv.2406.18853. Seongho Son, William Bankes, Sangwoong Yoon, Shyam Sundhar Ramesh, Xiaohang Tang, and Ilija Bogunovic. Robust multi-objective controlled decoding of large language models.CoRR, abs/2503.08796,
-
[26]
Robust multi-objective controlled decoding of large language models
doi: 10.48550/ARXIV.2503.08796. URLhttps://doi.org/10.48550/arXiv.2503.08796. Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F. Christiano. Learning to summarize with human feedback. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (e...
-
[27]
Llama 2: Open Foundation and Fine-Tuned Chat Models
URL https://proceedings.neurips.cc/paper/ 2020/hash/1f89885d556929e98d3ef9b86448f951-Abstract.html. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[28]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.),Advances in Neural Information Processing Systems 30: Annual Conference on...
work page 2017
-
[29]
cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
URLhttps://proceedings.neurips. cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html. Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, and Shengyi Huang. Trl: Transformer reinforcement learning.https://github.com/huggingface/trl,
work page 2017
-
[30]
Conditional language policy: A general framework for steerable multi-objective finetuning
Kaiwen Wang, Rahul Kidambi, Ryan Sullivan, Alekh Agarwal, Christoph Dann, Andrea Michi, Marco Gelmi, Yunxuan Li, Raghav Gupta, Kumar Avinava Dubey, Alexandre Rame, Johan Ferret, Geof- frey Cideron, Le Hou, Hongkun Yu, Amr Ahmed, Aranyak Mehta, Leonard Hussenot, Olivier Bachem, and Edouard Leurent. Conditional language policy: A general framework for steer...
work page 2024
-
[31]
doi: 10.18653/v1/2024.findings-emnlp.118
Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.118. URL https://aclanthology.org/2024.findings-emnlp.118/. Peiyao Xiao, Hao Ban, and Kaiyi Ji. Direction-oriented multi-objective learning: Simple and prov- able stochastic algorithms. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine ...
-
[32]
Jie Xu, Yunsheng Tian, Pingchuan Ma, Daniela Rus, Shinjiro Sueda, and Wojciech Matusik
URLhttp://papers.nips.cc/paper_files/paper/2023/hash/ 0e5b96f97c1813bb75f6c28532c2ecc7-Abstract-Conference.html. Jie Xu, Yunsheng Tian, Pingchuan Ma, Daniela Rus, Shinjiro Sueda, and Wojciech Matusik. Prediction-guided multi-objective reinforcement learning for continuous robot control. InProceedings of the 37th International Conference on Machine Learnin...
work page 2023
-
[33]
URLhttp://proceedings.mlr.press/ v119/xu20h.html. Kailai Yang, Zhiwei Liu, Qianqian Xie, Jimin Huang, Tianlin Zhang, and Sophia Ananiadou. Metaaligner: Towardsgeneralizablemulti-objectivealignmentoflanguagemodels. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024a. 15 Qwen An Yang, Baosong Yang, Beichen Zhang, Binyuan Hu...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
URLhttps://openreview.net/forum?id=QLcBzRI3V3
OpenReview.net, 2024c. URLhttps://openreview.net/forum?id=QLcBzRI3V3. Runzhe Yang, Xingyuan Sun, and Karthik Narasimhan. A generalized algorithm for multi-objective re- inforcement learning and policy adaptation. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelz- imer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (eds.),Advances in Neural Infor...
work page 2019
-
[35]
Yijun Yang, Jing Jiang, Tianyi Zhou, Jie Ma, and Yuhui Shi
URLhttps: //proceedings.neurips.cc/paper/2019/hash/4a46fbfca3f1465a27b210f4bdfe6ab3-Abstract.html. Yijun Yang, Jing Jiang, Tianyi Zhou, Jie Ma, and Yuhui Shi. Pareto policy pool for model-based offline reinforcement learning. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,
work page 2019
-
[36]
Xiaoyuan Zhang, Xi Lin, and Qingfu Zhang
URL https://openreview.net/forum?id= OqcZu8JIIzS. Xiaoyuan Zhang, Xi Lin, and Qingfu Zhang. PMGDA: A preference-based multiple gradient descent algorithm. CoRR, abs/2402.09492,
-
[37]
Xiaoyuan Zhang, Xi Lin, and Qingfu Zhang
doi: 10.48550/ARXIV.2402.09492. URLhttps://doi.org/10.48550/arXiv. 2402.09492. Yu Zhang, Wanli Jiang, and Zhengyu Yang. Moslim:align with diverse preferences in prompts through reward classification.CoRR, abs/2505.20336,
-
[38]
URL https: //doi.org/10.48550/arXiv.2505.20336
doi: 10.48550/ARXIV.2505.20336. URL https: //doi.org/10.48550/arXiv.2505.20336. Zhanhui Zhou, Jie Liu, Jing Shao, Xiangyu Yue, Chao Yang, Wanli Ouyang, and Yu Qiao. Beyond one- preference-fits-all alignment: Multi-objective direct preference optimization. InFindings of the Association for Computational Linguistics ACL 2024, pp. 10586–10613,
-
[39]
16 Appendix Table of Contents A Proof of Theorem 1 18 B Pareto Optimality and MOC’s Advantages 19 C Approximated Normalized Vector Similarity 20 D Pseudocode 21 E Why RL Loss Functions Are Unsuitable for Preference Control 22 F Further Discussion of Related Work 24 G Details of the Illustrative Example 25 H Details of Language Model Experiments 26 I Kenda...
work page 2017
-
[40]
We recommend that the reader checks Schulman et al. (2017); von Werra et al. (2020) for more training details of PPO in the language model settings. The min-norm algorithm used in MOC is shown in Algorithm 2, based on Sener & Koltun (2018). Algorithm 2 gives ac(1) andc (2) = 1−c(1). Algorithm 1Multi Objective Control Algorithm (MOC) for Language Models Re...
work page 2017
-
[41]
KL regularization 0.2 Epochs 1 New value headNtwo-layer feed-forward head Units of value head decoder hidden size Activation of value head ReLU ϕin Equation (5) 0.1 Learning rate 1.41e-5 Lambda for GAE 0.95 Gamma 1 Cliprange 0.2 Number of optimization epochs per batch 4 Target KL 6 The hyper-volumes in Table 3 are computed by existing package PyGMO. The r...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.