pith. machine review for the scientific record. sign in

arxiv: 2605.10981 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI

Recognition: no theorem link

xi-DPO: Direct Preference Optimization via Ratio Reward Margin

Authors on Pith no claims yet

Pith reviewed 2026-05-13 00:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords preference optimizationdirect preference optimizationreward marginhyperparameter tuningratio rewardSimPOlanguage model alignment
0
0 comments X

The pith

Reformulating rewards as ratios of chosen to rejected responses cancels the scaling parameter beta and produces a bounded margin xi that is preset directly from the initial reward gap distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that joint tuning of beta and gamma in SimPO is hard because the margin gamma is not interpretable across datasets with different reward structures. By rewriting the objective as minimizing distance to optimal margins and redefining the reward as the ratio of chosen to rejected scores, beta drops out of the formulation entirely. This creates a new margin xi that directly encodes the desired relative separation between responses. A reader would care because the approach removes repeated trial-and-error searches for hyperparameters when moving to new datasets or tasks.

Core claim

By reformulating the preference objective through an equivalent transformation that shifts the target from maximizing the likelihood of reward gaps to minimizing the distance between reward gaps and optimal margins, and then redefining the reward in ratio form between chosen and rejected responses, the optimization problem becomes independent of beta. This produces a bounded and interpretable ratio reward margin xi that explicitly represents the desired relative separation between chosen and rejected responses and can be determined from the initial reward gap distribution without repeated trial-and-error tuning.

What carries the argument

The ratio reward margin xi, obtained by expressing the reward as the ratio of chosen to rejected response scores, which cancels the scaling effect of beta and bounds the margin for direct interpretation as relative separation.

If this is right

  • Hyperparameter selection reduces to choosing a single xi value from the starting reward gap statistics instead of searching over pairs of beta and gamma.
  • The bounded nature of xi makes the optimization target stable across training steps even as the policy changes.
  • The method applies directly to datasets with varying reward gap structures without needing dataset-specific retuning.
  • Performance remains comparable to SimPO while eliminating the implicit sample filtering effect controlled by beta.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The ratio reformulation could be applied to other direct preference methods that retain explicit scaling parameters to simplify their tuning.
  • If the initial reward gap distribution reliably predicts the final policy behavior, this would enable fully automated hyperparameter selection for preference tuning pipelines.
  • Extending the ratio construction to settings with multiple rejected responses per chosen one might further reduce the need for margin adjustments in complex alignment tasks.

Load-bearing premise

That rewriting the reward as a ratio of chosen to rejected scores produces an optimization problem exactly equivalent to the original whose margin xi stays independent of beta and can be preset from initial gaps without degrading final performance.

What would settle it

Run xi-DPO with xi preset from the initial reward gap distribution on a held-out preference dataset and compare final win rates or reward scores against a fully tuned SimPO baseline on the same data; a statistically significant gap would falsify equivalence.

Figures

Figures reproduced from arXiv: 2605.10981 by Qun Chen, Yuxuan Du, Zhengyuan Fan, Zhonghua Wu.

Figure 1
Figure 1. Figure 1: Comparison of reward curves. (a) shows the reward dynamics of AlphaDPO during train [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution and sigmoid over ∆r ∗ . As β increases, the sigmoid becomes steeper, causing more samples in both the head region (left) and the tail region (right) to be filtered out during training. Previous studies[26, 19, 33] generally treat β as a scal￾ing factor. We find that its effect on the sigmoid slope essentially shifts forward or delays the point at which the gradient approaches zero, thereby fil… view at source ↗
Figure 3
Figure 3. Figure 3: Density and cumulative distribution functions(CDF) of reward gaps( [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Reward curves of model training without LeakyReLU. [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
read the original abstract

Reference-free preference optimization has emerged as an efficient alternative to reinforcement learning from human feedback, with Simple Preference Optimization(SimPO) demonstrating strong performance by eliminating the explicit reference model through a simple objective. However, the joint tuning of the hyperparameters $\beta$ and $\gamma$ in SimPO remains a central challenge. We argue that this difficulty arises because the margin formulation in SimPO is not easily interpretable across datasets with different reward gap structures. To better understand this issue, we conduct a comprehensive analysis of SimPO and find that $\beta$ implicitly controls sample filtering, while the effect of $\gamma$ depends on the reward gap structure of the dataset. Motivated by these observations, we propose $\xi$-DPO: Direct preference optimization via ratio reward margin. We first reformulate the preference objective through an equivalent transformation, changing the optimization target from maximizing the likelihood of reward gaps to minimizing the distance between reward gaps and optimal margins. Then, we redefine the reward in a ratio form between the chosen and rejected, which effectively cancels the effect of $\beta$ and yields a bounded and interpretable margin. This margin is called the ratio reward margin and is denoted by $\xi$. Unlike the margin $\gamma$ in SimPO, $\xi$ explicitly represents the desired relative separation between chosen and rejected responses and can be determined from the initial reward gap distribution, avoiding repeated trial-and-error tuning. ....

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes ξ-DPO, a reformulation of SimPO for reference-free direct preference optimization. It first analyzes how β implicitly controls sample filtering in SimPO while γ's effect depends on dataset reward-gap structure. The core contribution is an equivalent transformation of the objective from maximizing reward-gap likelihood to minimizing distance to optimal margins, followed by re-expressing the reward as the ratio of chosen to rejected log-probabilities; this is claimed to cancel β exactly, producing a bounded, interpretable margin ξ that can be preset once from the initial reward-gap distribution and thereby eliminates joint tuning of β and γ.

Significance. If the claimed equivalence is exact and the fixed-ξ schedule preserves the original optimum without performance loss, the work would meaningfully reduce hyperparameter sensitivity in a popular class of preference optimization methods, improving reproducibility across datasets with varying reward-gap statistics. The SimPO analysis itself supplies a useful diagnostic lens on existing margin-based objectives.

major comments (3)
  1. [Method section (reformulation paragraph)] The abstract and method section assert an 'equivalent transformation' that cancels β after the ratio redefinition, yet supply neither the intermediate algebraic steps nor a direct comparison of gradients between the original SimPO loss and the new distance-to-margin objective. Without these, it is impossible to confirm that the two problems share the same stationary points once the policy begins to update.
  2. [Definition of ξ and experimental section] ξ is defined from the initial reward-gap distribution of the very policy being optimized. Because the policy (and therefore the gap distribution) evolves during training, a fixed ξ preset at initialization risks becoming mismatched; the manuscript provides no theoretical argument or ablation showing that this mismatch does not alter the final optimum or degrade performance relative to a re-tuned γ.
  3. [Experiments] The central claim that ξ is 'tuning-free' and independent of β rests on the ratio reformulation. Experiments should therefore include a direct verification that the new objective produces identical or superior win rates to SimPO when both are given their respective best hyperparameters, plus a sensitivity plot of final performance versus the choice of initial-gap quantile used to set ξ.
minor comments (2)
  1. [Notation subsection] Notation for the ratio reward (chosen/rejected) versus the original log-probability reward should be introduced with an explicit side-by-side equation to avoid reader confusion.
  2. [SimPO analysis] The sentence claiming that 'β implicitly controls sample filtering' would be clearer if accompanied by the precise filtering threshold expression derived from the loss.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important aspects of clarity and validation. We respond point-by-point to the major comments below, committing to revisions that address the concerns while preserving the core contributions of the work.

read point-by-point responses
  1. Referee: [Method section (reformulation paragraph)] The abstract and method section assert an 'equivalent transformation' that cancels β after the ratio redefinition, yet supply neither the intermediate algebraic steps nor a direct comparison of gradients between the original SimPO loss and the new distance-to-margin objective. Without these, it is impossible to confirm that the two problems share the same stationary points once the policy begins to update.

    Authors: We agree that the reformulation was presented concisely without full intermediate steps. In the revised manuscript, we will expand the method section to include the complete algebraic derivation: starting from the SimPO objective, applying the equivalent transformation to a distance-to-optimal-margin form, and then substituting the ratio-based reward definition to show exact cancellation of β. We will also derive the gradients of both the original SimPO loss and the new objective, demonstrating that they share identical stationary points under the ratio redefinition. This will make the equivalence verifiable. revision: yes

  2. Referee: [Definition of ξ and experimental section] ξ is defined from the initial reward-gap distribution of the very policy being optimized. Because the policy (and therefore the gap distribution) evolves during training, a fixed ξ preset at initialization risks becoming mismatched; the manuscript provides no theoretical argument or ablation showing that this mismatch does not alter the final optimum or degrade performance relative to a re-tuned γ.

    Authors: We acknowledge the potential for distribution shift during training. The ratio reformulation yields a bounded, interpretable ξ that is independent of β, but the manuscript indeed lacks an explicit stability argument or ablation for the fixed initial ξ. In the revision, we will add a theoretical discussion explaining why the ratio margin remains effective despite evolution (based on its relative separation property), together with an ablation comparing fixed initial ξ against adaptive re-tuning of ξ or γ. This will quantify any performance impact. revision: partial

  3. Referee: [Experiments] The central claim that ξ is 'tuning-free' and independent of β rests on the ratio reformulation. Experiments should therefore include a direct verification that the new objective produces identical or superior win rates to SimPO when both are given their respective best hyperparameters, plus a sensitivity plot of final performance versus the choice of initial-gap quantile used to set ξ.

    Authors: We agree that direct empirical verification would strengthen the tuning-free claim. While existing results show competitive performance, the revised experimental section will add: (i) a head-to-head comparison of win rates for ξ-DPO (using ξ preset from the initial distribution) versus SimPO with its optimally tuned β and γ on the same datasets, and (ii) a sensitivity plot of final win rates versus different initial-gap quantiles for setting ξ. These will confirm robustness and independence from β. revision: yes

Circularity Check

0 steps flagged

No significant circularity; reformulation presented as independent equivalent transformation.

full rationale

The paper's core derivation consists of two steps: (1) an explicit claim of equivalent transformation that shifts the objective from likelihood maximization on reward gaps to distance minimization to optimal margins, and (2) a ratio redefinition of reward that algebraically cancels β to produce a new bounded margin ξ. ξ is then preset once from the initial reward-gap distribution of the starting model. This preset is a conventional hyperparameter choice, not a fitted parameter that is later renamed as a prediction, nor a self-referential definition in which the final result is forced by the inputs. No equations are shown reducing the new objective to the original by construction in a tautological way, and no load-bearing self-citation or ansatz smuggling is present. The derivation therefore remains self-contained relative to the SimPO baseline.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the mathematical equivalence of the rewritten objective and the scale-invariance property of the chosen-to-rejected ratio; both are standard domain assumptions in preference optimization rather than new axioms.

axioms (2)
  • domain assumption The preference objective admits an equivalent transformation from maximizing likelihood of reward gaps to minimizing distance to optimal margins.
    Invoked as the first reformulation step in the abstract.
  • domain assumption Expressing reward as the ratio of chosen to rejected scores cancels the effect of the β scaling term.
    Central to the claim that ξ is independent of β.
invented entities (1)
  • ratio reward margin ξ no independent evidence
    purpose: Provide a bounded, interpretable target separation that replaces the γ margin and removes β dependence.
    Newly defined quantity introduced to solve the tuning problem identified in SimPO.

pith-pipeline@v0.9.0 · 5557 in / 1646 out tokens · 71311 ms · 2026-05-13T00:49:57.043435+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 3 internal anchors

  1. [1]

    Llama 3 model card

    AI@Meta. Llama 3 model card. 2024

  2. [2]

    Thinking fast and slow with deep learning and tree search

    Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: An- nual Conference on Neural Information Processing Systems 201...

  3. [3]

    A general theoretical paradigm to understand learning from human pref- erences

    Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal V alko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human pref- erences. In International Conference on Artificial Intelligence and Statistics , pages 4447–4455. PMLR, 2024

  4. [4]

    Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022

    Y untao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

  5. [5]

    Pythia: A suite for analyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle OBrien, Eric Hal- lahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International conference on machine learning, pages 2397–2430. PMLR, 2023

  6. [6]

    Rank analysis of incomplete block designs: I

    Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952

  7. [7]

    ODIN: disentangled reward mitigates hacking in RLHF

    Lichang Chen, Chen Zhu, Jiuhai Chen, Davit Soselia, Tianyi Zhou, Tom Goldstein, Heng Huang, Mo- hammad Shoeybi, and Bryan Catanzaro. ODIN: disentangled reward mitigates hacking in RLHF. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, F orty-first International Confere...

  8. [8]

    Deep re- inforcement learning from human preferences

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep re- inforcement learning from human preferences. Advances in neural information processing systems , 30, 2017

  9. [9]

    Enhancing chat language models by scaling high-quality instructional conversations

    Ning Ding, Y ulin Chen, Bokai Xu, Y ujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, Decem...

  10. [10]

    Model alignment as prospect theoretic optimization

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Model alignment as prospect theoretic optimization. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, F orty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, J...

  11. [11]

    Towards neuron attributions in multi-modal large language models

    Junfeng Fang, Zac Bi, Ruipeng Wang, Houcheng Jiang, Y uan Gao, Kun Wang, An Zhang, Jie Shi, Xiang Wang, and Tat-Seng Chua. Towards neuron attributions in multi-modal large language models. Advances in Neural Information Processing Systems , 37:122867–122890, 2024

  12. [12]

    Scaling laws for reward model overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, Proceedings of Machine Learning Research, pages 10835–...

  13. [13]

    Orpo: Monolithic preference optimization without reference model

    Jiwoo Hong, Noah Lee, and James Thorne. Orpo: Monolithic preference optimization without reference model. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 11170–11189, 2024

  14. [14]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023

  15. [15]

    Openas- sistant conversations - democratizing large language model alignment

    Andreas Köpf, Y annic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Ab- dullah Barhoum, Duc Nguyen, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Mattick. Openas- sistant conversations - democratizing large language mode...

  16. [16]

    Omnisql: Synthesizing high-quality text-to- sql data at scale

    Haoyang Li, Shang Wu, Xiaokang Zhang, Xinmei Huang, Jing Zhang, Fuxin Jiang, Shuai Wang, Tieying Zhang, Jianjun Chen, Rui Shi, Hong Chen, and Cuiping Li. Omnisql: Synthesizing high-quality text-to- sql data at scale. Proc. VLDB Endow., 18(11):4695–4709, 2025

  17. [17]

    Hashimoto

    Xuechen Li, Tianyi Zhang, Y ann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models.https: //github.com/tatsu-lab/alpaca_eval, 5 2023

  18. [18]

    Wizardmath: Empowering mathematical reason- ing for large language models via reinforced evol-instruct

    Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jian-Guang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Y ansong Tang, and Dongmei Zhang. Wizardmath: Empowering mathematical reason- ing for large language models via reinforced evol-instruct. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28...

  19. [19]

    Simpo: Simple preference optimization with a reference-free reward

    Y u Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. Advances in Neural Information Processing Systems , 37:124198–124235, 2024

  20. [20]

    Openai gpt-5 system card

    OpenAI. Openai gpt-5 system card. Technical report, OpenAI, 2025

  21. [22]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems , 35:27730–27744, 2022

  22. [23]

    What matters in data for dpo? arXiv preprint arXiv:2508.18312, 2025

    Y u Pan, Zhongze Cai, Guanting Chen, Huaiyang Zhong, and Chonghuan Wang. What matters in data for dpo? arXiv preprint arXiv:2508.18312, 2025

  23. [24]

    Disentangling length from quality in direct preference optimization

    Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024 , volume ACL 2024 of Findings of ACL , pa...

  24. [25]

    Bradley Knox, Chelsea Finn, and Scott Niekum

    Rafael Rafailov, Y aswanth Chittepu, Ryan Park, Harshit Sikchi, Joey Hejna, W. Bradley Knox, Chelsea Finn, and Scott Niekum. Scaling laws for reward model overoptimization in direct alignment algorithms. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Informati...

  25. [26]

    Direct preference optimization: Y our language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Y our language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  26. [27]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy opti- mization algorithms. arXiv preprint arXiv:1707.06347, 2017

  27. [28]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Y ann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github. com/tatsu-lab/stanford_alpaca, 2023

  28. [29]

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, An- ton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin...

  29. [30]

    Qwen3 technical report, 2025

    Qwen Team. Qwen3 technical report, 2025

  30. [31]

    RA T-SQL: relation-aware schema encoding and linking for text-to-sql parsers

    Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. RA T-SQL: relation-aware schema encoding and linking for text-to-sql parsers. In Dan Jurafsky, Joyce Chai, Na- talie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-...

  31. [32]

    Interpretable preferences via multi-objective reward modeling and mixture-of-experts

    Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable preferences via multi-objective reward modeling and mixture-of-experts. In Y aser Al-Onaizan, Mohit Bansal, and Y un- Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12-16, 2024 , Findings of ACL, pages 1...

  32. [33]

    Alphadpo: Adaptive reward margin for direct preference optimization, 2025

    Junkang Wu, Xue Wang, Zhengyi Y ang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiang- nan He. Alphadpo: Adaptive reward margin for direct preference optimization, 2025

  33. [34]

    β-dpo: Direct preference optimization with dynamic β

    Junkang Wu, Y uexiang Xie, Zhengyi Y ang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. β-dpo: Direct preference optimization with dynamic β. Advances in Neural Information Processing Systems, 37:129944–129966, 2024

  34. [35]

    Teng Xiao, Yige Y uan, Zhengyu Chen, Mingxiao Li, Shangsong Liang, Zhaochun Ren, and V asant G. Honavar. Simper: A minimalist approach to preference alignment without hyperparameters. In The 12 Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28,

  35. [36]

    OpenReview.net, 2025

  36. [37]

    Contrastive preference optimization: Pushing the boundaries of LLM per- formance in machine translation

    Haoran Xu, Amr Sharaf, Y unmo Chen, Weiting Tan, Lingfeng Shen, Benjamin V an Durme, Kenton Murray, and Y oung Jin Kim. Contrastive preference optimization: Pushing the boundaries of LLM per- formance in machine translation. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, edi...

  37. [38]

    Tao Y u, Rui Zhang, Kai Y ang, Michihiro Y asunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn- ing Y ao, Shanelle Roman, Zilin Zhang, and Dragomir R. Radev. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Ellen Riloff, David Chi- ang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,...

  38. [39]

    Tao Y u, Rui Zhang, Michihiro Y asunaga, Yi Chern Tan, Xi Victoria Lin, Suyi Li, Heyang Er, Irene Li, Bo Pang, Tao Chen, Emily Ji, Shreya Dixit, David Proctor, Sungrok Shim, Jonathan Kraft, Vincent Zhang, Caiming Xiong, Richard Socher, and Dragomir R. Radev. Sparc: Cross-domain semantic parsing in context. In Anna Korhonen, David R. Traum, and Lluís Màrqu...

  39. [40]

    Judging llm-as-a-judge with mt-bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Y onghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems , 36:46595–46623, 2023

  40. [41]

    Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning

    Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries from natural language using reinforcement learning. CoRR, abs/1709.00103, 2017

  41. [42]

    LIMA: less is more for alignment

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Y uning Mao, Xuezhe Ma, Avia Efrat, Ping Y u, Lili Y u, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. LIMA: less is more for alignment. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information...

  42. [43]

    Fine-Tuning Language Models from Human Preferences

    Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Chris- tiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019. 13 A Limitations The log πθ of the target policy decreases sharply during the later stages of the ξ-DPO optimization process. The reason fo...