Multiplayer Nash Preference Optimization

Bing Hu; Fang Wu; Guancheng Wan; Jure Leskovec; Peng Xia; Weihao Xuan; Xiaomin Li; Xu Huang; Yejin Choi; Yijia Xiao

arxiv: 2509.23102 · v3 · submitted 2025-09-27 · 💻 cs.AI · cs.CL

Multiplayer Nash Preference Optimization

Fang Wu , Xu Huang , Weihao Xuan , Zhiwei Zhang , Yijia Xiao , Guancheng Wan , Xiaomin Li , Bing Hu

show 3 more authors

Peng Xia Jure Leskovec Yejin Choi

This is my paper

Pith reviewed 2026-05-18 13:05 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords Multiplayer Nash Preference OptimizationNash learning from human feedbackRLHFLLM alignmentpreference optimizationnon-transitive preferencesmultiplayer gamesheterogeneous preferences

0 comments

The pith

Multiplayer Nash Preference Optimization generalizes two-player Nash learning to n-player games for better handling of complex human preferences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MNPO to fix the single-opponent limitation in current Nash-based alignment methods for large language models. It recasts the problem as an n-player game in which each policy competes against a population of opponents while staying close to a reference model. This keeps the equilibrium guarantees of two-player approaches but adds richer interactions that reflect the non-transitive and varied nature of real preferences. Experiments show MNPO beats prior NLHF baselines on instruction-following tasks, with clearest gains when annotators hold differing views.

Core claim

MNPO formulates alignment as an n-player game where each policy competes against a population of opponents and is regularized toward a reference model, inheriting equilibrium guarantees from two-player Nash learning while enabling richer competitive dynamics and improved coverage of diverse preference structures.

What carries the argument

The n-player Nash game with population opponents and reference regularization, which extends two-player NLHF to multiplayer interactions.

If this is right

Equilibrium guarantees established for two-player methods extend directly to the n-player setting.
Policies encounter richer competitive dynamics through simultaneous interactions with multiple opponents.
Diverse and non-transitive preference structures receive better coverage than single-opponent formulations allow.
Empirical gains appear consistently on instruction-following benchmarks under mixed-policy and heterogeneous-annotator conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The population-opponent structure may reduce single-source bias when preferences are collected from multiple independent groups.
Varying the size of the opponent population offers a direct knob for trading off computational cost against preference diversity.
The same n-player regularization pattern could apply to other multi-stakeholder optimization problems beyond language-model alignment.

Load-bearing premise

That modeling alignment as an n-player game with population opponents and reference regularization accurately captures real heterogeneous human preferences without introducing new biases or scalability issues.

What would settle it

An experiment in which MNPO shows no improvement or adds measurable bias over two-player methods when evaluated on preference data from highly diverse annotators with clear non-transitivity would falsify the central claim.

read the original abstract

Reinforcement learning from human feedback (RLHF) has emerged as the standard paradigm for aligning large language models with human preferences. However, reward-based methods grounded in the Bradley-Terry assumption struggle to capture the nontransitivity and heterogeneity of real-world preferences. To address this, recent studies have reframed alignment as a two-player Nash game, giving rise to Nash learning from human feedback (NLHF). While this perspective has inspired algorithms such as INPO, ONPO, and EGPO that offer strong theoretical and empirical guarantees, they remain fundamentally restricted to two-player interactions, introducing a single-opponent bias that fails to capture the full complexity of realistic preference structures. This work introduces Multiplayer Nash Preference Optimization (MNPO), a novel framework that generalizes NLHF to the multiplayer regime. It formulates alignment as an n-player game, where each policy competes against a population of opponents while being regularized toward a reference model. We demonstrate that MNPO inherits the equilibrium guarantees of two-player methods while enabling richer competitive dynamics and improved coverage of diverse preference structures. Comprehensive empirical evaluation shows that MNPO consistently outperforms existing NLHF baselines on instruction-following benchmarks, achieving superior alignment quality under heterogeneous annotator conditions and mixed-policy evaluation scenarios. Together, these results establish MNPO as a principled and scalable framework for aligning LLMs with complex, non-transitive human preferences. Code is available at: https://github.com/smiles724/MNPO

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Multiplayer Nash Preference Optimization (MNPO) as a generalization of two-player Nash Learning from Human Feedback (NLHF) to an n-player game. Each policy competes against a population of opponents and is regularized toward a reference model. The central claims are that MNPO inherits the equilibrium guarantees of two-player NLHF methods, enables richer competitive dynamics to better model non-transitive and heterogeneous preferences, and empirically outperforms existing NLHF baselines on instruction-following benchmarks under heterogeneous annotator conditions.

Significance. If the claimed inheritance of equilibrium guarantees holds via an exact reduction and the performance gains are robustly attributable to the multiplayer formulation, MNPO could meaningfully extend game-theoretic alignment approaches to capture more realistic preference structures. The public code release aids reproducibility and verification.

major comments (2)

[Method / Theoretical Analysis] The abstract and introduction assert that MNPO inherits the equilibrium guarantees of two-player NLHF methods. No explicit reduction is shown demonstrating that n=2 and population size=1 recovers the original NLHF objective and Nash equilibrium conditions exactly (without approximation artifacts from population sampling or reference regularization). This reduction is load-bearing for the central generalization claim.
[Experiments] The empirical evaluation claims consistent outperformance and improved coverage of diverse preference structures under heterogeneous annotator conditions. Details are needed on population sampling procedure, how mixed-policy evaluation isolates the effect of multiplayer dynamics, and controls for training compute or model scale to rule out confounds.

minor comments (2)

[Notation / Method] Clarify the precise mathematical formulation of the n-player objective and population opponent sampling in the method section with numbered equations for readability.
[Discussion] Add a brief discussion of potential scalability limitations or new biases introduced by population-based opponents when n grows.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested clarifications.

read point-by-point responses

Referee: [Method / Theoretical Analysis] The abstract and introduction assert that MNPO inherits the equilibrium guarantees of two-player NLHF methods. No explicit reduction is shown demonstrating that n=2 and population size=1 recovers the original NLHF objective and Nash equilibrium conditions exactly (without approximation artifacts from population sampling or reference regularization). This reduction is load-bearing for the central generalization claim.

Authors: We agree that an explicit reduction strengthens the central claim. In the revised manuscript we will add a new subsection (in Section 3) that formally derives the reduction: setting n=2 and opponent population size=1 causes the MNPO objective to recover the exact two-player NLHF objective and the corresponding Nash equilibrium conditions, with no residual sampling or regularization artifacts because the reference-model term is identical to that used in the original NLHF formulation. revision: yes
Referee: [Experiments] The empirical evaluation claims consistent outperformance and improved coverage of diverse preference structures under heterogeneous annotator conditions. Details are needed on population sampling procedure, how mixed-policy evaluation isolates the effect of multiplayer dynamics, and controls for training compute or model scale to rule out confounds.

Authors: We will expand the experimental details section to describe the population sampling procedure (uniform sampling from a fixed pool of policies trained on disjoint preference subsets) and to clarify that mixed-policy evaluation compares each method against the same mixture of opponents, thereby isolating the benefit of multiplayer dynamics from single-opponent baselines. All methods were trained for an identical number of gradient steps on models of the same scale; we will add a supplementary table listing the shared hyperparameters to rule out compute or scale confounds. revision: yes

Circularity Check

0 steps flagged

MNPO introduces a distinct n-player formulation without reducing core claims to inputs by construction or self-referential definitions.

full rationale

The paper's derivation formulates alignment as an n-player game where each policy competes against a population of opponents and is regularized to a reference model. It asserts inheritance of two-player NLHF equilibrium guarantees as a property of this generalization, but the provided abstract and context show no equations or steps where the multiplayer objective is defined in terms of the target result or where a fitted parameter is relabeled as a prediction. Prior NLHF work (INPO, ONPO, EGPO) is cited as external inspiration rather than a load-bearing self-citation chain, and the new elements—population opponents and richer competitive dynamics—add independent content for heterogeneous preferences. Empirical results on instruction-following benchmarks further anchor the claims outside any internal fit. No quoted reduction exhibits equivalence by construction, so the derivation remains self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework relies on standard game-theoretic assumptions about equilibrium existence and the effectiveness of regularization toward a reference policy; no new entities are introduced.

free parameters (1)

regularization coefficient
Controls strength of pull toward reference model; likely tuned during training as in standard RLHF.

axioms (1)

domain assumption Nash equilibrium exists and is stable in the multiplayer preference game
Invoked when claiming inheritance of guarantees from two-player methods to the n-player setting.

pith-pipeline@v0.9.0 · 5815 in / 1060 out tokens · 49853 ms · 2026-05-18T13:05:38.723841+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Multiplayer Nash Preference Optimization (MNPO), a novel framework that generalizes NLHF to the multiplayer regime. It formulates alignment as an n-player game...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment
cs.CL 2026-05 unverdicted novelty 6.0

Introduces HRC model for game-theoretic decomposition of preferences into orthogonal transitive and cyclic components, paired with DSPPO for dynamic Nash-seeking alignment, reporting gains over BT and GPM baselines on...
Towards General Preference Alignment: Diffusion Models at Nash Equilibrium
cs.LG 2026-05 unverdicted novelty 5.0

Diff.-NPO frames diffusion alignment as a self-play game reaching Nash equilibrium and reports better text-to-image results than prior DPO-style methods.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 2 Pith papers · 22 internal anchors

[1]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Human alignment of large language models through online preference optimisation.arXiv preprint arXiv:2403.08635,

Daniele Calandriello, Daniel Guo, Remi Munos, Mark Rowland, Yunhao Tang, Bernardo Avila Pires, Pierre Harvey Richemond, Charline Le Lan, Michal Valko, Tianqi Liu, et al. Human alignment of large language models through online preference optimisation.arXiv preprint arXiv:2403.08635,

work page arXiv
[3]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

UltraFeedback: Boosting Language Models with Scaled AI Feedback

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. UltraFeedback: Boosting language models with high-quality feedback.arXiv preprint arXiv:2310.01377,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

RLHF Workflow: From Reward Modeling to Online RLHF

Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Robust preference optimization through reward model distillation.arXiv preprint arXiv:2405.19316,

Adam Fisch, Jacob Eisenstein, Vicky Zayats, Alekh Agarwal, Ahmad Beirami, Chirag Nagpal, Pete Shaw, and Jonathan Berant. Robust preference optimization through reward model distillation. arXiv preprint arXiv:2405.19316,

work page arXiv
[11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Measuring Massive Multitask Language Understanding

10 Preprint, Under Review Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[13]

ORPO: Monolithic Preference Optimization without Reference Model

Jiwoo Hong, Noah Lee, and James Thorne. Orpo: Monolithic preference optimization without reference model.arXiv preprint arXiv:2403.07691,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

From live data to high-quality benchmarks: The arena-hard pipeline.Blog post.[Accessed 07-02-2025],

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E Gonzalez, and Ion Stoica. From live data to high-quality benchmarks: The arena-hard pipeline.Blog post.[Accessed 07-02-2025],

work page 2025
[15]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods.arXiv preprint arXiv:2109.07958,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

The hidden link between rlhf and contrastive learning.arXiv preprint arXiv:2506.22578,

Xufei Lv, Haoyuan Sun, Xuefeng Bai, Min Zhang, Houde Liu, and Kehai Chen. The hidden link between rlhf and contrastive learning.arXiv preprint arXiv:2506.22578,

work page arXiv
[18]

15 UNA: Unifying Alignments of RLHF/PPO, DPO and KTO by a Generalized Implicit Reward Function Nvidia, :, Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H

Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi, et al. Nash learning from human feedback.arXiv preprint arXiv:2312.00886, 18,

work page arXiv
[19]

Pre-dpo: Improving data utilization in direct preference optimization using a guiding reference model.arXiv preprint arXiv:2504.15843,

Junshu Pan, Wei Shen, Shulin Huang, Qiji Zhou, and Yue Zhang. Pre-dpo: Improving data utilization in direct preference optimization using a guiding reference model.arXiv preprint arXiv:2504.15843,

work page arXiv
[20]

Disentangling length from quality in direct preference optimization, 2024

Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization.arXiv preprint arXiv:2403.19159,

work page arXiv
[21]

Direct nash optimization: Teaching language models to self-improve with general preferences,

Corby Rosset, Ching-An Cheng, Arindam Mitra, Michael Santacroce, Ahmed Awadallah, and Tengyang Xie. Direct nash optimization: Teaching language models to self-improve with general preferences.arXiv preprint arXiv:2404.03715,

work page arXiv
[22]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games.arXiv preprint arXiv:2206.05825, 2022

Samuel Sokota, Ryan D’Orazio, J Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer. A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games.arXiv preprint arXiv:2206.05825,

work page arXiv
[25]

Reward-aware preference optimization: A unified mathematical framework for model alignment.arXiv preprint arXiv:2502.00203,

Shengyang Sun, Yian Zhang, Alexander Bukharin, David Mosallanezhad, Jiaqi Zeng, Soumye Singhal, Gerald Shen, Adithya Renduchintala, Tugrul Konuk, Yi Dong, et al. Reward-aware preference optimization: A unified mathematical framework for model alignment.arXiv preprint arXiv:2502.00203,

work page arXiv
[26]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Causal confusion and reward misidentification in preference-based reward learning

URL https: //github.com/modelscope/evalscope. Jeremy Tien, Jerry Zhi-Yang He, Zackory Erickson, Anca D Dragan, and Daniel S Brown. Causal confusion and reward misidentification in preference-based reward learning.arXiv preprint arXiv:2204.06601,

work page arXiv
[29]

Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints, 2023

Chaoqi Wang, Yibo Jiang, Chenghao Yang, Han Liu, and Yuxin Chen. Beyond reverse kl: Gen- eralizing direct preference optimization with diverse divergence constraints.arXiv preprint arXiv:2309.16240,

work page arXiv
[30]

Interpretable preferences via multi-objective reward modeling and mixture-of-experts

Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable preferences via multi-objective reward modeling and mixture-of-experts. InEMNLP, 2024a. Mingzhi Wang, Chengdong Ma, Qizhi Chen, Linjian Meng, Yang Han, Jiancong Xiao, Zhaowei Zhang, Jing Huo, Weijie J Su, and Yaodong Yang. Magnetic preference optimization: Achieving last-itera...

work page arXiv
[31]

Jing Xu, Andrew Lee, Sainbayar Sukhbaatar, and Jason Weston

URL https://lilianweng.github.io/posts/2024-11-28-reward-hacking/. Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu. Self-play preference optimization for language model alignment.arXiv preprint arXiv:2405.00675,

work page arXiv 2024
[32]

Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf.arXiv preprint arXiv:2405.21046,

Tengyang Xie, Dylan J Foster, Akshay Krishnamurthy, Corby Rosset, Ahmed Awadallah, and Alexander Rakhlin. Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf.arXiv preprint arXiv:2405.21046,

work page arXiv
[33]

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint.arXiv preprint arXiv:2312.11456,

12 Preprint, Under Review Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint.arXiv preprint arXiv:2312.11456,

work page arXiv
[34]

Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation.arXiv preprint arXiv:2401.08417, 2024

Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, and Young Jin Kim. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation.arXiv preprint arXiv:2401.08417,

work page arXiv
[35]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xiangpeng Wei, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118,

work page internal anchor Pith review Pith/arXiv arXiv
[38]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[39]

Improving LLM general preference alignment via optimistic online mirror descent,

Yuheng Zhang, Dian Yu, Tao Ge, Linfeng Song, Zhichen Zeng, Haitao Mi, Nan Jiang, and Dong Yu. Improving llm general preference alignment via optimistic online mirror descent.arXiv preprint arXiv:2502.16852, 2025a. Yuheng Zhang, Dian Yu, Baolin Peng, Linfeng Song, Ye Tian, Mingyue Huo, Nan Jiang, Haitao Mi, and Dong Yu. Iterative nash policy optimization: ...

work page arXiv
[40]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Extragradient preference optimization (egpo): Beyond last-iterate convergence for nash learning from human feedback.arXiv preprint arXiv:2503.08942,

Runlong Zhou, Maryam Fazel, and Simon S Du. Extragradient preference optimization (egpo): Beyond last-iterate convergence for nash learning from human feedback.arXiv preprint arXiv:2503.08942,

work page arXiv
[42]

Wpo: Enhancing rlhf with weighted preference optimization

Wenxuan Zhou, Ravi Agrawal, Shujian Zhang, Sathish Reddy Indurthi, Sanqiang Zhao, Kaiqiang Song, Silei Xu, and Chenguang Zhu. Wpo: Enhancing rlhf with weighted preference optimization. arXiv preprint arXiv:2406.11827,

work page arXiv
[43]

Self-play methods like SPIN (Chen et al., 2024), SPPO (Wu et al., 2024), and INPO (Zhang et al., 2025b) use no- regret dynamics, while Pre-DPO (Pan et al.,

13 Preprint, Under Review A ADDITIONALRELATEDWORK ONGAME-THEORETICRLHF Another perspective casts preference optimization as a game between the model and its opponents, where equilibrium-seeking methods provide stronger last-iterate guarantees. Self-play methods like SPIN (Chen et al., 2024), SPPO (Wu et al., 2024), and INPO (Zhang et al., 2025b) use no- r...

work page 2024
[44]

More recent methods, such as ONPO (Zhang et al., 2025a) and EGPO (Zhou et al., 2025), introduce optimism and extragradient techniques for stable convergence under noisy preferences

and MPO (Wang et al., 2024b) adapt mirror descent. More recent methods, such as ONPO (Zhang et al., 2025a) and EGPO (Zhou et al., 2025), introduce optimism and extragradient techniques for stable convergence under noisy preferences. These advances primarily focus on two-player games but highlight the importance of equilibrium views in preference optimizat...

work page 2025
[45]

INPO (Zhang et al., 2025b) is reproduced according to the settings described in the paper

is trained using the official Hugging Face DPO Trainer, while SimPO (Meng et al., 2024)2 and SPPO (Wu et al., 2024)3 follow their official GitHub implementations. INPO (Zhang et al., 2025b) is reproduced according to the settings described in the paper. HyperparametersFor MNPO, we adopt hyperparameters consistent with SimPO and INPO, using a cosine learni...

work page 2024
[46]

minimal,

MT-Bench6 is tested and implemented following its official GitHub repository. The LLM 2https://github.com/princeton-nlp/SimPO 3https://github.com/uclaml/SPPO 4https://huggingface.co/datasets/princeton-nlp/gemma2-ultrafeedback-armorm 5https://github.com/modelscope/evalscope 6https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge 14 Preprint, Under...

work page 2025
[47]

logπθ y+t π′t y+t −logπθ y−t π′t y−t −η 2 #2 , whereπ′t= argminπ E(x,y+t ,y−t)∼Dt

E(x,y′t,y′′t)∼Dt−σ(rt(x, y′t)−rt(x, y′′t)) log σ ηlogπ(y′t|x) πt(y′t|x)−ηlogπ(y′′t |x) πt(y′′t |x) −σ(rt(x, y′′t)−rt(x, y′t)) log σ ηlogπ(y′′t |x) πt(y′′t |x)−ηlogπ(y′t|x) πt(y′t|x) ✗ ✓Online DNO-Prct (Rosset et al., 2024)E (x,y+t ,y−t)∼Dt−log σ eηlogπ(y+t |x) πt(y+t |x)−eηlogπ(y−t |x) πt(y−t |x) ✗ ✓Online SPIN (Chen et al., 2024)E (x,y,y−t)∼Dt−ℓ βlogπθ(y...

work page 2024
[48]

Thus, the MNPO update inherits the regret guarantees of OMD: the average regret after T rounds scales as O(1/ √ T), ensuring convergence to equilibrium in the no-regret learning sense. 16 Preprint, Under Review Pairwise Ratio Dynamics.To avoid computing the intractable partition function, we analyze the pairwise log-ratio ht(π, y, y′) = log π(y|x) π(y ′ |...

work page 1999

[1] [1]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Human alignment of large language models through online preference optimisation.arXiv preprint arXiv:2403.08635,

Daniele Calandriello, Daniel Guo, Remi Munos, Mark Rowland, Yunhao Tang, Bernardo Avila Pires, Pierre Harvey Richemond, Charline Le Lan, Michal Valko, Tianqi Liu, et al. Human alignment of large language models through online preference optimisation.arXiv preprint arXiv:2403.08635,

work page arXiv

[3] [3]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

UltraFeedback: Boosting Language Models with Scaled AI Feedback

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. UltraFeedback: Boosting language models with high-quality feedback.arXiv preprint arXiv:2310.01377,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

RLHF Workflow: From Reward Modeling to Online RLHF

Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Robust preference optimization through reward model distillation.arXiv preprint arXiv:2405.19316,

Adam Fisch, Jacob Eisenstein, Vicky Zayats, Alekh Agarwal, Ahmad Beirami, Chirag Nagpal, Pete Shaw, and Jonathan Berant. Robust preference optimization through reward model distillation. arXiv preprint arXiv:2405.19316,

work page arXiv

[11] [11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Measuring Massive Multitask Language Understanding

10 Preprint, Under Review Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[13] [13]

ORPO: Monolithic Preference Optimization without Reference Model

Jiwoo Hong, Noah Lee, and James Thorne. Orpo: Monolithic preference optimization without reference model.arXiv preprint arXiv:2403.07691,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

From live data to high-quality benchmarks: The arena-hard pipeline.Blog post.[Accessed 07-02-2025],

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E Gonzalez, and Ion Stoica. From live data to high-quality benchmarks: The arena-hard pipeline.Blog post.[Accessed 07-02-2025],

work page 2025

[15] [15]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods.arXiv preprint arXiv:2109.07958,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

The hidden link between rlhf and contrastive learning.arXiv preprint arXiv:2506.22578,

Xufei Lv, Haoyuan Sun, Xuefeng Bai, Min Zhang, Houde Liu, and Kehai Chen. The hidden link between rlhf and contrastive learning.arXiv preprint arXiv:2506.22578,

work page arXiv

[18] [18]

15 UNA: Unifying Alignments of RLHF/PPO, DPO and KTO by a Generalized Implicit Reward Function Nvidia, :, Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H

Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi, et al. Nash learning from human feedback.arXiv preprint arXiv:2312.00886, 18,

work page arXiv

[19] [19]

Pre-dpo: Improving data utilization in direct preference optimization using a guiding reference model.arXiv preprint arXiv:2504.15843,

Junshu Pan, Wei Shen, Shulin Huang, Qiji Zhou, and Yue Zhang. Pre-dpo: Improving data utilization in direct preference optimization using a guiding reference model.arXiv preprint arXiv:2504.15843,

work page arXiv

[20] [20]

Disentangling length from quality in direct preference optimization, 2024

Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization.arXiv preprint arXiv:2403.19159,

work page arXiv

[21] [21]

Direct nash optimization: Teaching language models to self-improve with general preferences,

Corby Rosset, Ching-An Cheng, Arindam Mitra, Michael Santacroce, Ahmed Awadallah, and Tengyang Xie. Direct nash optimization: Teaching language models to self-improve with general preferences.arXiv preprint arXiv:2404.03715,

work page arXiv

[22] [22]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games.arXiv preprint arXiv:2206.05825, 2022

Samuel Sokota, Ryan D’Orazio, J Zico Kolter, Nicolas Loizou, Marc Lanctot, Ioannis Mitliagkas, Noam Brown, and Christian Kroer. A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games.arXiv preprint arXiv:2206.05825,

work page arXiv

[25] [25]

Reward-aware preference optimization: A unified mathematical framework for model alignment.arXiv preprint arXiv:2502.00203,

Shengyang Sun, Yian Zhang, Alexander Bukharin, David Mosallanezhad, Jiaqi Zeng, Soumye Singhal, Gerald Shen, Adithya Renduchintala, Tugrul Konuk, Yi Dong, et al. Reward-aware preference optimization: A unified mathematical framework for model alignment.arXiv preprint arXiv:2502.00203,

work page arXiv

[26] [26]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Causal confusion and reward misidentification in preference-based reward learning

URL https: //github.com/modelscope/evalscope. Jeremy Tien, Jerry Zhi-Yang He, Zackory Erickson, Anca D Dragan, and Daniel S Brown. Causal confusion and reward misidentification in preference-based reward learning.arXiv preprint arXiv:2204.06601,

work page arXiv

[29] [29]

Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints, 2023

Chaoqi Wang, Yibo Jiang, Chenghao Yang, Han Liu, and Yuxin Chen. Beyond reverse kl: Gen- eralizing direct preference optimization with diverse divergence constraints.arXiv preprint arXiv:2309.16240,

work page arXiv

[30] [30]

Interpretable preferences via multi-objective reward modeling and mixture-of-experts

Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable preferences via multi-objective reward modeling and mixture-of-experts. InEMNLP, 2024a. Mingzhi Wang, Chengdong Ma, Qizhi Chen, Linjian Meng, Yang Han, Jiancong Xiao, Zhaowei Zhang, Jing Huo, Weijie J Su, and Yaodong Yang. Magnetic preference optimization: Achieving last-itera...

work page arXiv

[31] [31]

Jing Xu, Andrew Lee, Sainbayar Sukhbaatar, and Jason Weston

URL https://lilianweng.github.io/posts/2024-11-28-reward-hacking/. Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu. Self-play preference optimization for language model alignment.arXiv preprint arXiv:2405.00675,

work page arXiv 2024

[32] [32]

Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf.arXiv preprint arXiv:2405.21046,

Tengyang Xie, Dylan J Foster, Akshay Krishnamurthy, Corby Rosset, Ahmed Awadallah, and Alexander Rakhlin. Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf.arXiv preprint arXiv:2405.21046,

work page arXiv

[33] [33]

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint.arXiv preprint arXiv:2312.11456,

12 Preprint, Under Review Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint.arXiv preprint arXiv:2312.11456,

work page arXiv

[34] [34]

Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation.arXiv preprint arXiv:2401.08417, 2024

Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, and Young Jin Kim. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation.arXiv preprint arXiv:2401.08417,

work page arXiv

[35] [35]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, Xiangpeng Wei, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118,

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905

[39] [39]

Improving LLM general preference alignment via optimistic online mirror descent,

Yuheng Zhang, Dian Yu, Tao Ge, Linfeng Song, Zhichen Zeng, Haitao Mi, Nan Jiang, and Dong Yu. Improving llm general preference alignment via optimistic online mirror descent.arXiv preprint arXiv:2502.16852, 2025a. Yuheng Zhang, Dian Yu, Baolin Peng, Linfeng Song, Ye Tian, Mingyue Huo, Nan Jiang, Haitao Mi, and Dong Yu. Iterative nash policy optimization: ...

work page arXiv

[40] [40]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

Extragradient preference optimization (egpo): Beyond last-iterate convergence for nash learning from human feedback.arXiv preprint arXiv:2503.08942,

Runlong Zhou, Maryam Fazel, and Simon S Du. Extragradient preference optimization (egpo): Beyond last-iterate convergence for nash learning from human feedback.arXiv preprint arXiv:2503.08942,

work page arXiv

[42] [42]

Wpo: Enhancing rlhf with weighted preference optimization

Wenxuan Zhou, Ravi Agrawal, Shujian Zhang, Sathish Reddy Indurthi, Sanqiang Zhao, Kaiqiang Song, Silei Xu, and Chenguang Zhu. Wpo: Enhancing rlhf with weighted preference optimization. arXiv preprint arXiv:2406.11827,

work page arXiv

[43] [43]

Self-play methods like SPIN (Chen et al., 2024), SPPO (Wu et al., 2024), and INPO (Zhang et al., 2025b) use no- regret dynamics, while Pre-DPO (Pan et al.,

13 Preprint, Under Review A ADDITIONALRELATEDWORK ONGAME-THEORETICRLHF Another perspective casts preference optimization as a game between the model and its opponents, where equilibrium-seeking methods provide stronger last-iterate guarantees. Self-play methods like SPIN (Chen et al., 2024), SPPO (Wu et al., 2024), and INPO (Zhang et al., 2025b) use no- r...

work page 2024

[44] [44]

More recent methods, such as ONPO (Zhang et al., 2025a) and EGPO (Zhou et al., 2025), introduce optimism and extragradient techniques for stable convergence under noisy preferences

and MPO (Wang et al., 2024b) adapt mirror descent. More recent methods, such as ONPO (Zhang et al., 2025a) and EGPO (Zhou et al., 2025), introduce optimism and extragradient techniques for stable convergence under noisy preferences. These advances primarily focus on two-player games but highlight the importance of equilibrium views in preference optimizat...

work page 2025

[45] [45]

INPO (Zhang et al., 2025b) is reproduced according to the settings described in the paper

is trained using the official Hugging Face DPO Trainer, while SimPO (Meng et al., 2024)2 and SPPO (Wu et al., 2024)3 follow their official GitHub implementations. INPO (Zhang et al., 2025b) is reproduced according to the settings described in the paper. HyperparametersFor MNPO, we adopt hyperparameters consistent with SimPO and INPO, using a cosine learni...

work page 2024

[46] [46]

minimal,

MT-Bench6 is tested and implemented following its official GitHub repository. The LLM 2https://github.com/princeton-nlp/SimPO 3https://github.com/uclaml/SPPO 4https://huggingface.co/datasets/princeton-nlp/gemma2-ultrafeedback-armorm 5https://github.com/modelscope/evalscope 6https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge 14 Preprint, Under...

work page 2025

[47] [47]

logπθ y+t π′t y+t −logπθ y−t π′t y−t −η 2 #2 , whereπ′t= argminπ E(x,y+t ,y−t)∼Dt

E(x,y′t,y′′t)∼Dt−σ(rt(x, y′t)−rt(x, y′′t)) log σ ηlogπ(y′t|x) πt(y′t|x)−ηlogπ(y′′t |x) πt(y′′t |x) −σ(rt(x, y′′t)−rt(x, y′t)) log σ ηlogπ(y′′t |x) πt(y′′t |x)−ηlogπ(y′t|x) πt(y′t|x) ✗ ✓Online DNO-Prct (Rosset et al., 2024)E (x,y+t ,y−t)∼Dt−log σ eηlogπ(y+t |x) πt(y+t |x)−eηlogπ(y−t |x) πt(y−t |x) ✗ ✓Online SPIN (Chen et al., 2024)E (x,y,y−t)∼Dt−ℓ βlogπθ(y...

work page 2024

[48] [48]

Thus, the MNPO update inherits the regret guarantees of OMD: the average regret after T rounds scales as O(1/ √ T), ensuring convergence to equilibrium in the no-regret learning sense. 16 Preprint, Under Review Pairwise Ratio Dynamics.To avoid computing the intractable partition function, we analyze the pairwise log-ratio ht(π, y, y′) = log π(y|x) π(y ′ |...

work page 1999