Recognition: no theorem link
xi-DPO: Direct Preference Optimization via Ratio Reward Margin
Pith reviewed 2026-05-13 00:49 UTC · model grok-4.3
The pith
Reformulating rewards as ratios of chosen to rejected responses cancels the scaling parameter beta and produces a bounded margin xi that is preset directly from the initial reward gap distribution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By reformulating the preference objective through an equivalent transformation that shifts the target from maximizing the likelihood of reward gaps to minimizing the distance between reward gaps and optimal margins, and then redefining the reward in ratio form between chosen and rejected responses, the optimization problem becomes independent of beta. This produces a bounded and interpretable ratio reward margin xi that explicitly represents the desired relative separation between chosen and rejected responses and can be determined from the initial reward gap distribution without repeated trial-and-error tuning.
What carries the argument
The ratio reward margin xi, obtained by expressing the reward as the ratio of chosen to rejected response scores, which cancels the scaling effect of beta and bounds the margin for direct interpretation as relative separation.
If this is right
- Hyperparameter selection reduces to choosing a single xi value from the starting reward gap statistics instead of searching over pairs of beta and gamma.
- The bounded nature of xi makes the optimization target stable across training steps even as the policy changes.
- The method applies directly to datasets with varying reward gap structures without needing dataset-specific retuning.
- Performance remains comparable to SimPO while eliminating the implicit sample filtering effect controlled by beta.
Where Pith is reading between the lines
- The ratio reformulation could be applied to other direct preference methods that retain explicit scaling parameters to simplify their tuning.
- If the initial reward gap distribution reliably predicts the final policy behavior, this would enable fully automated hyperparameter selection for preference tuning pipelines.
- Extending the ratio construction to settings with multiple rejected responses per chosen one might further reduce the need for margin adjustments in complex alignment tasks.
Load-bearing premise
That rewriting the reward as a ratio of chosen to rejected scores produces an optimization problem exactly equivalent to the original whose margin xi stays independent of beta and can be preset from initial gaps without degrading final performance.
What would settle it
Run xi-DPO with xi preset from the initial reward gap distribution on a held-out preference dataset and compare final win rates or reward scores against a fully tuned SimPO baseline on the same data; a statistically significant gap would falsify equivalence.
Figures
read the original abstract
Reference-free preference optimization has emerged as an efficient alternative to reinforcement learning from human feedback, with Simple Preference Optimization(SimPO) demonstrating strong performance by eliminating the explicit reference model through a simple objective. However, the joint tuning of the hyperparameters $\beta$ and $\gamma$ in SimPO remains a central challenge. We argue that this difficulty arises because the margin formulation in SimPO is not easily interpretable across datasets with different reward gap structures. To better understand this issue, we conduct a comprehensive analysis of SimPO and find that $\beta$ implicitly controls sample filtering, while the effect of $\gamma$ depends on the reward gap structure of the dataset. Motivated by these observations, we propose $\xi$-DPO: Direct preference optimization via ratio reward margin. We first reformulate the preference objective through an equivalent transformation, changing the optimization target from maximizing the likelihood of reward gaps to minimizing the distance between reward gaps and optimal margins. Then, we redefine the reward in a ratio form between the chosen and rejected, which effectively cancels the effect of $\beta$ and yields a bounded and interpretable margin. This margin is called the ratio reward margin and is denoted by $\xi$. Unlike the margin $\gamma$ in SimPO, $\xi$ explicitly represents the desired relative separation between chosen and rejected responses and can be determined from the initial reward gap distribution, avoiding repeated trial-and-error tuning. ....
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes ξ-DPO, a reformulation of SimPO for reference-free direct preference optimization. It first analyzes how β implicitly controls sample filtering in SimPO while γ's effect depends on dataset reward-gap structure. The core contribution is an equivalent transformation of the objective from maximizing reward-gap likelihood to minimizing distance to optimal margins, followed by re-expressing the reward as the ratio of chosen to rejected log-probabilities; this is claimed to cancel β exactly, producing a bounded, interpretable margin ξ that can be preset once from the initial reward-gap distribution and thereby eliminates joint tuning of β and γ.
Significance. If the claimed equivalence is exact and the fixed-ξ schedule preserves the original optimum without performance loss, the work would meaningfully reduce hyperparameter sensitivity in a popular class of preference optimization methods, improving reproducibility across datasets with varying reward-gap statistics. The SimPO analysis itself supplies a useful diagnostic lens on existing margin-based objectives.
major comments (3)
- [Method section (reformulation paragraph)] The abstract and method section assert an 'equivalent transformation' that cancels β after the ratio redefinition, yet supply neither the intermediate algebraic steps nor a direct comparison of gradients between the original SimPO loss and the new distance-to-margin objective. Without these, it is impossible to confirm that the two problems share the same stationary points once the policy begins to update.
- [Definition of ξ and experimental section] ξ is defined from the initial reward-gap distribution of the very policy being optimized. Because the policy (and therefore the gap distribution) evolves during training, a fixed ξ preset at initialization risks becoming mismatched; the manuscript provides no theoretical argument or ablation showing that this mismatch does not alter the final optimum or degrade performance relative to a re-tuned γ.
- [Experiments] The central claim that ξ is 'tuning-free' and independent of β rests on the ratio reformulation. Experiments should therefore include a direct verification that the new objective produces identical or superior win rates to SimPO when both are given their respective best hyperparameters, plus a sensitivity plot of final performance versus the choice of initial-gap quantile used to set ξ.
minor comments (2)
- [Notation subsection] Notation for the ratio reward (chosen/rejected) versus the original log-probability reward should be introduced with an explicit side-by-side equation to avoid reader confusion.
- [SimPO analysis] The sentence claiming that 'β implicitly controls sample filtering' would be clearer if accompanied by the precise filtering threshold expression derived from the loss.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which highlight important aspects of clarity and validation. We respond point-by-point to the major comments below, committing to revisions that address the concerns while preserving the core contributions of the work.
read point-by-point responses
-
Referee: [Method section (reformulation paragraph)] The abstract and method section assert an 'equivalent transformation' that cancels β after the ratio redefinition, yet supply neither the intermediate algebraic steps nor a direct comparison of gradients between the original SimPO loss and the new distance-to-margin objective. Without these, it is impossible to confirm that the two problems share the same stationary points once the policy begins to update.
Authors: We agree that the reformulation was presented concisely without full intermediate steps. In the revised manuscript, we will expand the method section to include the complete algebraic derivation: starting from the SimPO objective, applying the equivalent transformation to a distance-to-optimal-margin form, and then substituting the ratio-based reward definition to show exact cancellation of β. We will also derive the gradients of both the original SimPO loss and the new objective, demonstrating that they share identical stationary points under the ratio redefinition. This will make the equivalence verifiable. revision: yes
-
Referee: [Definition of ξ and experimental section] ξ is defined from the initial reward-gap distribution of the very policy being optimized. Because the policy (and therefore the gap distribution) evolves during training, a fixed ξ preset at initialization risks becoming mismatched; the manuscript provides no theoretical argument or ablation showing that this mismatch does not alter the final optimum or degrade performance relative to a re-tuned γ.
Authors: We acknowledge the potential for distribution shift during training. The ratio reformulation yields a bounded, interpretable ξ that is independent of β, but the manuscript indeed lacks an explicit stability argument or ablation for the fixed initial ξ. In the revision, we will add a theoretical discussion explaining why the ratio margin remains effective despite evolution (based on its relative separation property), together with an ablation comparing fixed initial ξ against adaptive re-tuning of ξ or γ. This will quantify any performance impact. revision: partial
-
Referee: [Experiments] The central claim that ξ is 'tuning-free' and independent of β rests on the ratio reformulation. Experiments should therefore include a direct verification that the new objective produces identical or superior win rates to SimPO when both are given their respective best hyperparameters, plus a sensitivity plot of final performance versus the choice of initial-gap quantile used to set ξ.
Authors: We agree that direct empirical verification would strengthen the tuning-free claim. While existing results show competitive performance, the revised experimental section will add: (i) a head-to-head comparison of win rates for ξ-DPO (using ξ preset from the initial distribution) versus SimPO with its optimally tuned β and γ on the same datasets, and (ii) a sensitivity plot of final win rates versus different initial-gap quantiles for setting ξ. These will confirm robustness and independence from β. revision: yes
Circularity Check
No significant circularity; reformulation presented as independent equivalent transformation.
full rationale
The paper's core derivation consists of two steps: (1) an explicit claim of equivalent transformation that shifts the objective from likelihood maximization on reward gaps to distance minimization to optimal margins, and (2) a ratio redefinition of reward that algebraically cancels β to produce a new bounded margin ξ. ξ is then preset once from the initial reward-gap distribution of the starting model. This preset is a conventional hyperparameter choice, not a fitted parameter that is later renamed as a prediction, nor a self-referential definition in which the final result is forced by the inputs. No equations are shown reducing the new objective to the original by construction in a tautological way, and no load-bearing self-citation or ansatz smuggling is present. The derivation therefore remains self-contained relative to the SimPO baseline.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The preference objective admits an equivalent transformation from maximizing likelihood of reward gaps to minimizing distance to optimal margins.
- domain assumption Expressing reward as the ratio of chosen to rejected scores cancels the effect of the β scaling term.
invented entities (1)
-
ratio reward margin ξ
no independent evidence
Reference graph
Works this paper leans on
- [1]
-
[2]
Thinking fast and slow with deep learning and tree search
Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: An- nual Conference on Neural Information Processing Systems 201...
work page 2017
-
[3]
A general theoretical paradigm to understand learning from human pref- erences
Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal V alko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human pref- erences. In International Conference on Artificial Intelligence and Statistics , pages 4447–4455. PMLR, 2024
work page 2024
-
[4]
Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022
Y untao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...
work page 2022
-
[5]
Pythia: A suite for analyzing large language models across training and scaling
Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle OBrien, Eric Hal- lahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International conference on machine learning, pages 2397–2430. PMLR, 2023
work page 2023
-
[6]
Rank analysis of incomplete block designs: I
Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952
work page 1952
-
[7]
ODIN: disentangled reward mitigates hacking in RLHF
Lichang Chen, Chen Zhu, Jiuhai Chen, Davit Soselia, Tianyi Zhou, Tom Goldstein, Heng Huang, Mo- hammad Shoeybi, and Bryan Catanzaro. ODIN: disentangled reward mitigates hacking in RLHF. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, F orty-first International Confere...
work page 2024
-
[8]
Deep re- inforcement learning from human preferences
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep re- inforcement learning from human preferences. Advances in neural information processing systems , 30, 2017
work page 2017
-
[9]
Enhancing chat language models by scaling high-quality instructional conversations
Ning Ding, Y ulin Chen, Bokai Xu, Y ujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, Decem...
work page 2023
-
[10]
Model alignment as prospect theoretic optimization
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Model alignment as prospect theoretic optimization. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, F orty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, J...
work page 2024
-
[11]
Towards neuron attributions in multi-modal large language models
Junfeng Fang, Zac Bi, Ruipeng Wang, Houcheng Jiang, Y uan Gao, Kun Wang, An Zhang, Jie Shi, Xiang Wang, and Tat-Seng Chua. Towards neuron attributions in multi-modal large language models. Advances in Neural Information Processing Systems , 37:122867–122890, 2024
work page 2024
-
[12]
Scaling laws for reward model overoptimization
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, Proceedings of Machine Learning Research, pages 10835–...
work page 2023
-
[13]
Orpo: Monolithic preference optimization without reference model
Jiwoo Hong, Noah Lee, and James Thorne. Orpo: Monolithic preference optimization without reference model. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 11170–11189, 2024
work page 2024
-
[14]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023
work page 2023
-
[15]
Openas- sistant conversations - democratizing large language model alignment
Andreas Köpf, Y annic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Ab- dullah Barhoum, Duc Nguyen, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Mattick. Openas- sistant conversations - democratizing large language mode...
work page 2023
-
[16]
Omnisql: Synthesizing high-quality text-to- sql data at scale
Haoyang Li, Shang Wu, Xiaokang Zhang, Xinmei Huang, Jing Zhang, Fuxin Jiang, Shuai Wang, Tieying Zhang, Jianjun Chen, Rui Shi, Hong Chen, and Cuiping Li. Omnisql: Synthesizing high-quality text-to- sql data at scale. Proc. VLDB Endow., 18(11):4695–4709, 2025
work page 2025
- [17]
-
[18]
Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jian-Guang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Y ansong Tang, and Dongmei Zhang. Wizardmath: Empowering mathematical reason- ing for large language models via reinforced evol-instruct. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28...
work page 2025
-
[19]
Simpo: Simple preference optimization with a reference-free reward
Y u Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. Advances in Neural Information Processing Systems , 37:124198–124235, 2024
work page 2024
-
[20]
OpenAI. Openai gpt-5 system card. Technical report, OpenAI, 2025
work page 2025
-
[22]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems , 35:27730–27744, 2022
work page 2022
-
[23]
What matters in data for dpo? arXiv preprint arXiv:2508.18312, 2025
Y u Pan, Zhongze Cai, Guanting Chen, Huaiyang Zhong, and Chonghuan Wang. What matters in data for dpo? arXiv preprint arXiv:2508.18312, 2025
-
[24]
Disentangling length from quality in direct preference optimization
Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024 , volume ACL 2024 of Findings of ACL , pa...
work page 2024
-
[25]
Bradley Knox, Chelsea Finn, and Scott Niekum
Rafael Rafailov, Y aswanth Chittepu, Ryan Park, Harshit Sikchi, Joey Hejna, W. Bradley Knox, Chelsea Finn, and Scott Niekum. Scaling laws for reward model overoptimization in direct alignment algorithms. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Informati...
work page 2024
-
[26]
Direct preference optimization: Y our language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Y our language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023
work page 2023
-
[27]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy opti- mization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [28]
-
[29]
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, An- ton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin...
work page 2024
- [30]
-
[31]
RA T-SQL: relation-aware schema encoding and linking for text-to-sql parsers
Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. RA T-SQL: relation-aware schema encoding and linking for text-to-sql parsers. In Dan Jurafsky, Joyce Chai, Na- talie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-...
work page 2020
-
[32]
Interpretable preferences via multi-objective reward modeling and mixture-of-experts
Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable preferences via multi-objective reward modeling and mixture-of-experts. In Y aser Al-Onaizan, Mohit Bansal, and Y un- Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12-16, 2024 , Findings of ACL, pages 1...
work page 2024
-
[33]
Alphadpo: Adaptive reward margin for direct preference optimization, 2025
Junkang Wu, Xue Wang, Zhengyi Y ang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiang- nan He. Alphadpo: Adaptive reward margin for direct preference optimization, 2025
work page 2025
-
[34]
β-dpo: Direct preference optimization with dynamic β
Junkang Wu, Y uexiang Xie, Zhengyi Y ang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. β-dpo: Direct preference optimization with dynamic β. Advances in Neural Information Processing Systems, 37:129944–129966, 2024
work page 2024
-
[35]
Teng Xiao, Yige Y uan, Zhengyu Chen, Mingxiao Li, Shangsong Liang, Zhaochun Ren, and V asant G. Honavar. Simper: A minimalist approach to preference alignment without hyperparameters. In The 12 Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,
work page 2025
-
[36]
OpenReview.net, 2025
work page 2025
-
[37]
Haoran Xu, Amr Sharaf, Y unmo Chen, Weiting Tan, Lingfeng Shen, Benjamin V an Durme, Kenton Murray, and Y oung Jin Kim. Contrastive preference optimization: Pushing the boundaries of LLM per- formance in machine translation. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, edi...
work page 2024
-
[38]
Tao Y u, Rui Zhang, Kai Y ang, Michihiro Y asunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn- ing Y ao, Shanelle Roman, Zilin Zhang, and Dragomir R. Radev. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Ellen Riloff, David Chi- ang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,...
work page 2018
-
[39]
Tao Y u, Rui Zhang, Michihiro Y asunaga, Yi Chern Tan, Xi Victoria Lin, Suyi Li, Heyang Er, Irene Li, Bo Pang, Tao Chen, Emily Ji, Shreya Dixit, David Proctor, Sungrok Shim, Jonathan Kraft, Vincent Zhang, Caiming Xiong, Richard Socher, and Dragomir R. Radev. Sparc: Cross-domain semantic parsing in context. In Anna Korhonen, David R. Traum, and Lluís Màrqu...
work page 2019
-
[40]
Judging llm-as-a-judge with mt-bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Y onghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems , 36:46595–46623, 2023
work page 2023
-
[41]
Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning
Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries from natural language using reinforcement learning. CoRR, abs/1709.00103, 2017
work page internal anchor Pith review arXiv 2017
-
[42]
LIMA: less is more for alignment
Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Y uning Mao, Xuezhe Ma, Avia Efrat, Ping Y u, Lili Y u, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. LIMA: less is more for alignment. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information...
work page 2023
-
[43]
Fine-Tuning Language Models from Human Preferences
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Chris- tiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019. 13 A Limitations The log πθ of the target policy decreases sharply during the later stages of the ξ-DPO optimization process. The reason fo...
work page internal anchor Pith review Pith/arXiv arXiv 1909
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.