Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution

Adam Barla; Emanuele Nevali; Luca Viano; Volkan Cevher

arxiv: 2602.06239 · v3 · pith:RITIFKOXnew · submitted 2026-02-05 · 💻 cs.LG

Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution

Adam Barla , Emanuele Nevali , Luca Viano , Volkan Cevher This is my paper

Pith reviewed 2026-05-21 13:02 UTC · model grok-4.3

classification 💻 cs.LG

keywords preference optimizationdirect preference optimizationover-optimizationensemble methodsconcentrability coefficienttabular settingpessimistic aggregation

0 comments

The pith

PEPO uses pessimistic ensembles on disjoint data subsets to avoid over-optimization in DPO by depending only on single-policy concentrability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PEPO as a single-step DPO-like method that mitigates over-optimization in preference learning without knowing the data distribution or training an explicit reward model. It trains an ensemble of policies on disjoint data subsets and aggregates them with a worst-case construction that favors agreement across the models to enforce pessimism. In the tabular setting this yields sample complexity bounds that depend only on a single-policy concentrability coefficient, in contrast to the all-policy concentrability that appears in guarantees for standard DPO. A sympathetic reader would care because over-optimization is a known practical failure mode that makes learned policies degrade outside the training data, and PEPO keeps the training simplicity of DPO while supplying a theoretical fix.

Core claim

PEPO achieves sample complexity guarantees depending only on a single-policy concentrability coefficient by training an ensemble of preference-optimized policies on disjoint data subsets and aggregating them through a worst-case construction that favors agreement across models, thereby avoiding the all-policy concentrability which affects the guarantees of algorithms prone to over-optimization such as DPO.

What carries the argument

The pessimistic ensemble aggregation, which constructs a final policy from the worst-case agreement among models trained on disjoint data subsets to enforce pessimism without an explicit reward model.

If this is right

Sample complexity in the tabular setting depends only on single-policy concentrability rather than all-policy concentrability.
The method retains the single-step training simplicity of DPO while adding theoretical protection against over-optimization.
No knowledge of the data-generating distribution is required for the guarantees.
Practical performance matches or exceeds DPO while remaining computationally comparable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same disjoint-subset construction might reduce concentrability requirements in non-tabular or continuous settings if policy diversity can be maintained.
PEPO could be combined with other regularization techniques to further improve robustness in large language model alignment.
Empirical checks on real preference datasets would test whether the required diversity across disjoint subsets holds in practice.

Load-bearing premise

Training on disjoint data subsets produces sufficiently diverse policies whose worst-case aggregation still yields a performant and safe policy.

What would settle it

A tabular MDP in which the worst-case aggregation of the subset-trained policies produces a low-performing or unsafe policy even though each individual policy is good on its own subset.

Figures

Figures reproduced from arXiv: 2602.06239 by Adam Barla, Emanuele Nevali, Luca Viano, Volkan Cevher.

**Figure 2.** Figure 2: Experiment in 3 arms bandit setting for the situation of known and unknown πdata. experimental setting where the preference dataset is generated directly from πref (i.e., πdata = πref), as this would restore the conditions under which χ 2PO and SFT+DPO are theoretically sound. Known vs unknown πdata in a toy environment. For our toy experiment, we consider an empty prompt and 3 possible actions. The prefer… view at source ↗

**Figure 3.** Figure 3: Ablation for L in a bandit setting. A.2 Details for the experiments in the controlled setting For Figure 2a, we used πdata = πref = [0.04, 0.93, 0.03]. That is, the largest mass is on the central action, which is the optimal one. This makes the optimal action well-covered, i.e. small C ⋆ , therefore χ 2PO and PEPO can work reliably. At the same time, the all-policy concentrability is large because the two … view at source ↗

**Figure 4.** Figure 4: Result in the controlled setting without regularization, i.e. with [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 6.** Figure 6: Illustration of the tie probability (missing mass) resulting from a right shifting of the sigmoid [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

read the original abstract

We introduce PEPO (Pessimistic Ensemble based Preference Optimization), a single-step Direct Preference Optimization (DPO)-like algorithm to mitigate the well-known over-optimization issue in preference learning without requiring the knowledge of the data-generating distribution or learning an explicit reward model. PEPO achieves pessimism via an ensemble of preference-optimized policies trained on disjoint data subsets and then aggregates them through a worst case construction that favors the agreement across models. In the tabular setting, PEPO achieves sample complexity guarantees depending only on a single-policy concentrability coefficient, thus avoiding the all-policy concentrability which affects the guarantees of algorithms prone to over-optimization, such as DPO. The theoretical findings are corroborated by a convincing practical performance, while retaining the simplicity and the practicality of DPO-style training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PEPO adds ensemble pessimism to DPO via disjoint subsets and worst-case aggregation, claiming single-policy concentrability in tabular settings while keeping training simple.

read the letter

The main thing to know is that this paper introduces PEPO, which splits preference data into disjoint subsets, runs DPO on each to get an ensemble, and aggregates them with a worst-case rule that favors agreement across models. In the tabular case they claim this yields sample complexity depending only on single-policy concentrability instead of the all-policy version that usually appears with plain DPO and over-optimization issues. They keep the single-step training style and avoid needing an explicit reward model or the data distribution upfront. The experiments are presented as supporting the practical side. What the work does reasonably well is extend pessimism techniques from offline RL into the DPO setting without adding heavy extra machinery. The idea of using disjoint subsets for diversity and then taking the worst-case is a straightforward way to inject caution, and it preserves the appeal of DPO-style methods. The soft spot is whether the disjoint training actually delivers enough policy diversity for the worst-case step to truly limit coverage to a single reference policy. If the independently trained policies share similar extrapolation biases, the aggregation could still allow over-optimization in low-coverage regions, which would weaken the concentrability reduction. The abstract is light on proof details, so the full derivations need checking to confirm the bounds hold without hidden fitting or stronger assumptions. This paper is mainly for researchers working on theoretical improvements to preference optimization and RLHF. A reader focused on stabilizing DPO or reducing concentrability requirements would get value from the construction and the tabular guarantees. It deserves a serious referee to verify the proofs and compare against other DPO variants.

Referee Report

2 major / 2 minor

Summary. The paper introduces PEPO (Pessimistic Ensemble based Preference Optimization), a DPO-style algorithm that trains an ensemble of policies on disjoint data subsets and aggregates them via a worst-case agreement construction to induce pessimism. The central theoretical claim is that, in the tabular setting, this yields sample-complexity guarantees depending only on a single-policy concentrability coefficient, thereby avoiding the all-policy concentrability that appears in analyses of standard DPO and other over-optimization-prone methods. The approach requires neither knowledge of the data-generating distribution nor an explicit reward model, and the authors report competitive empirical performance.

Significance. If the concentrability reduction is rigorously established, the result would be significant for preference optimization and RLHF: it supplies a concrete algorithmic mechanism (disjoint-subset ensembles plus worst-case aggregation) that converts a multi-policy coverage requirement into a single-policy one while preserving the simplicity of DPO-style training. Such a guarantee would be a useful reference point for designing safe preference learners that do not rely on explicit reward modeling or distribution knowledge.

major comments (2)

The reduction from all-policy to single-policy concentrability rests on the claim that worst-case aggregation over policies trained on disjoint subsets produces a policy whose effective coverage is controlled by one reference policy alone. The manuscript does not supply an explicit argument or assumption showing that disjoint subsets necessarily generate sufficient policy diversity, nor does it bound the extrapolation bias that could be shared across independently optimized policies. Without such justification, the worst-case operator may still select an over-optimized action whose concentrability coefficient is closer to the all-policy quantity than to the single-policy quantity (see the tabular analysis and the definition of the aggregated policy).
The sample-complexity statement is presented without the full derivation, explicit error bounds, or the precise assumptions placed on the data-generating process and on the reference policy. Because the concentrability reduction is the load-bearing step for the main claim, the absence of these details prevents verification that no hidden multi-policy coverage terms reappear in the final bound.

minor comments (2)

Notation for the worst-case aggregation operator and for the single-policy concentrability coefficient should be introduced earlier and used consistently throughout the theoretical section.
The empirical section would benefit from an ablation that isolates the contribution of the disjoint-subset training versus the worst-case aggregation step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address the major comments point by point below and will revise the manuscript to strengthen the theoretical presentation.

read point-by-point responses

Referee: The reduction from all-policy to single-policy concentrability rests on the claim that worst-case aggregation over policies trained on disjoint subsets produces a policy whose effective coverage is controlled by one reference policy alone. The manuscript does not supply an explicit argument or assumption showing that disjoint subsets necessarily generate sufficient policy diversity, nor does it bound the extrapolation bias that could be shared across independently optimized policies. Without such justification, the worst-case operator may still select an over-optimized action whose concentrability coefficient is closer to the all-policy quantity than to the single-policy quantity (see the tabular analysis and the definition of the aggregated policy).

Authors: We appreciate the referee highlighting the need for greater explicitness here. The current tabular analysis and definition of the aggregated policy implicitly rely on the fact that disjoint subsets yield independently optimized policies with distinct extrapolation behaviors, and that the worst-case aggregation selects the most conservative action among them. However, we agree that a self-contained argument is warranted. In the revision we will add a dedicated lemma that (i) shows how independent optimization on disjoint subsets induces policy diversity sufficient to cover the relevant action space under the reference policy, and (ii) bounds the shared extrapolation bias so that the effective concentrability of the aggregated policy is controlled by the single-policy coefficient alone. This will directly address the concern that the worst-case operator could inadvertently revert to an all-policy quantity. revision: yes
Referee: The sample-complexity statement is presented without the full derivation, explicit error bounds, or the precise assumptions placed on the data-generating process and on the reference policy. Because the concentrability reduction is the load-bearing step for the main claim, the absence of these details prevents verification that no hidden multi-policy coverage terms reappear in the final bound.

Authors: We acknowledge that the sample-complexity guarantee is currently stated at a high level. In the revised manuscript we will supply the complete derivation, including all intermediate error bounds, the precise assumptions on the data-generating process (i.i.d. sampling from the preference distribution), and the conditions on the reference policy. The expanded proof will explicitly track concentrability terms at each step and confirm that only the single-policy coefficient appears in the final bound, with no residual multi-policy coverage factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity in PEPO's single-policy concentrability claim

full rationale

The paper introduces PEPO as a new ensemble-based algorithm that trains DPO-style policies on disjoint data subsets and aggregates via worst-case agreement to induce pessimism. The central theoretical claim is a sample-complexity bound in the tabular setting that depends only on a single-policy concentrability coefficient. No quoted equation or derivation step reduces this bound by construction to a fitted parameter, a self-citation chain, or an input that already encodes the target guarantee. The construction and its analysis appear self-contained against external benchmarks, with the concentrability reduction arising from the explicit algorithmic design rather than from re-labeling or tautological fitting.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract relies on standard offline RL concentrability assumptions and on the unstated claim that disjoint-subset training produces useful diversity; no explicit free parameters or invented entities are named.

axioms (1)

domain assumption Disjoint data subsets yield policies whose worst-case aggregation preserves single-policy concentrability.
This premise is required for the sample-complexity claim to hold but is not derived in the abstract.

pith-pipeline@v0.9.0 · 5668 in / 1247 out tokens · 29396 ms · 2026-05-21T13:02:08.817489+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PEPO achieves sample complexity guarantees depending only on a single-policy concentrability coefficient... avoiding the all-policy concentrability which affects... DPO
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

pessimism via an ensemble of preference-optimized policies trained on disjoint data subsets and then aggregates them through a worst case construction

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 5 internal anchors

[1]

Design considerations in offline preference-based rl

Alekh Agarwal, Christoph Dann, and Teodor V Marinov. Design considerations in offline preference-based rl. arXiv preprint arXiv:2502.06861,

work page arXiv
[2]

Xrpo: Pushing the limits of grpo with targeted exploration and exploitation.arXiv preprint arXiv:2510.06672,

Udbhav Bamba, Minghao Fang, Yifan Yu, Haizhong Zheng, and Fan Lai. Xrpo: Pushing the limits of grpo with targeted exploration and exploitation.arXiv preprint arXiv:2510.06672,

work page arXiv
[3]

Value-incentivized preference optimization: A unified approach to online and offline rlhf.arXiv preprint arXiv:2405.19320,

Shicong Cen, Jincheng Mei, Katayoon Goshvadi, Hanjun Dai, Tong Yang, Sherry Yang, Dale Schuurmans, Yuejie Chi, and Bo Dai. Value-incentivized preference optimization: A unified approach to online and offline rlhf.arXiv preprint arXiv:2405.19320,

work page arXiv
[4]

On extending direct preference optimization to accommodate ties.arXiv preprint arXiv:2409.17431,

Jinghong Chen, Guangyu Yang, Weizhe Lin, Jingbiao Mei, and Bill Byrne. On extending direct preference optimization to accommodate ties.arXiv preprint arXiv:2409.17431,

work page arXiv
[5]

AvoidingO(eRmax)scaling in rlhf through preference-based exploration.arXiv preprint arXiv:2502.00666,

Mingyu Chen, Yiding Chen, Wen Sun, and Xuezhou Zhang. AvoidingO(eRmax)scaling in rlhf through preference-based exploration.arXiv preprint arXiv:2502.00666,

work page arXiv
[6]

UCB Exploration via Q-Ensembles

Richard Y Chen, Szymon Sidor, Pieter Abbeel, and John Schulman. Ucb exploration via q-ensembles.arXiv preprint arXiv:1706.01502,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Reward model ensembles help mitigate overoptimization.arXiv preprint arXiv:2310.02743,

Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger. Reward model ensembles help mitigate overoptimization.arXiv preprint arXiv:2310.02743,

work page arXiv
[8]

Active preference optimiza- tion for sample efficient rlhf.arXiv preprint arXiv:2402.10500,

Nirjhar Das, Souradip Chakraborty, Aldo Pacchiano, and Sayak Ray Chowdhury. Active preference optimiza- tion for sample efficient rlhf.arXiv preprint arXiv:2402.10500,

work page arXiv
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URLhttps://arxiv.org/abs/2501.12948. Fei Ding, Baiqiao Wang, Zijian Zeng, and Youwei Wang. Multi-layer grpo: Enhancing reasoning and self-correction in large language models.arXiv preprint arXiv:2506.04746,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Robust preference optimization through reward model distillation.arXiv preprint arXiv:2405.19316,

Adam Fisch, Jacob Eisenstein, Vicky Zayats, Alekh Agarwal, Ahmad Beirami, Chirag Nagpal, Pete Shaw, and Jonathan Berant. Robust preference optimization through reward model distillation.arXiv preprint arXiv:2405.19316,

work page arXiv
[11]

Learn your reference model for real good alignment

13 Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian Maksimov, Nikita Balagansky, and Daniil Gavrilov. Learn your reference model for real good alignment. arXiv preprint arXiv:2404.09656,

work page arXiv
[12]

Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, and Saravan Rajmohan

URLhttps://arxiv.org/abs/2402.04792. Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, and Saravan Rajmohan. Multi-preference optimization: Generalizing dpo via set-level contrasts.arXiv preprint arXiv:2412.04628,

work page arXiv
[13]

Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D Lee, Wen Sun, Akshay Krishnamurthy, and Dylan J Foster

URLhttps://openreview.net/forum?id=nZeVKeeFYf9. Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D Lee, Wen Sun, Akshay Krishnamurthy, and Dylan J Foster. Correcting the mythos of kl-regularization: Direct alignment without overoptimization via chi- squared preference optimization.arXiv preprint arXiv:2407.13399,

work page arXiv
[14]

Provable and practical: Efficient exploration in reinforcement learning via langevin monte carlo.arXiv preprint arXiv:2305.18246,

Haque Ishfaq, Qingfeng Lan, Pan Xu, A Rupam Mahmood, Doina Precup, Anima Anandkumar, and Kamyar Azizzadenesheli. Provable and practical: Efficient exploration in reinforcement learning via langevin monte carlo.arXiv preprint arXiv:2305.18246,

work page arXiv
[15]

More efficient randomized exploration for reinforcement learning via approximate sampling.arXiv preprint arXiv:2406.12241,

Haque Ishfaq, Yixin Tan, Yu Yang, Qingfeng Lan, Jianfeng Lu, A Rupam Mahmood, Doina Precup, and Pan Xu. More efficient randomized exploration for reinforcement learning via approximate sampling.arXiv preprint arXiv:2406.12241,

work page arXiv
[16]

Langevin soft actor-critic: Efficient exploration through uncertainty-driven critic learning.arXiv preprint arXiv:2501.17827,

Haque Ishfaq, Guangyuan Wang, Sami Nur Islam, and Doina Precup. Langevin soft actor-critic: Efficient exploration through uncertainty-driven critic learning.arXiv preprint arXiv:2501.17827,

work page arXiv
[17]

Self-play with adversarial critic: Provable and scalable offline alignment for language models.arXiv preprint arXiv:2406.04274,

Xiang Ji, Sanjeev Kulkarni, Mengdi Wang, and Tengyang Xie. Self-play with adversarial critic: Provable and scalable offline alignment for language models.arXiv preprint arXiv:2406.04274,

work page arXiv
[18]

Hashimoto

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models.https: //github.com/tatsu-lab/alpaca_eval, 5 2023a. Zihao Li, Zhuoran Yang, and Mengdi Wang. Reinforcement learning with human feedback: Learning dynamic choices ...

work page arXiv
[19]

Understanding R1-Zero-Like Training: A Critical Perspective

URLhttps://arxiv. org/abs/2503.20783. Simon Matrenok, Skander Moalla, and Caglar Gulcehre. Quantile reward policy optimization: Alignment with pointwise regression and exact partition functions.arXiv preprint arXiv:2507.08068,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Optimistically optimistic exploration for provably efficient infinite-horizon reinforcement and imitation learning.arXiv preprint arXiv:2502.13900,

14 Antoine Moulin, Gergely Neu, and Luca Viano. Optimistically optimistic exploration for provably efficient infinite-horizon reinforcement and imitation learning.arXiv preprint arXiv:2502.13900,

work page arXiv
[21]

Brendan O’Donoghue

URLhttps://arxiv.org/abs/2312.00886. Brendan O’Donoghue. Variational bayesian reinforcement learning with regret bounds.Advances in Neural Information Processing Systems, 34:28208–28221,

work page arXiv
[22]

Efficient exploration via epistemic-risk-seeking policy optimization.arXiv preprint arXiv:2302.09339,

Brendan O’Donoghue. Efficient exploration via epistemic-risk-seeking policy optimization.arXiv preprint arXiv:2302.09339,

work page arXiv
[23]

Disentangling length from quality in direct preference optimization

Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. InFindings of the Association for Computational Linguistics: ACL 2024, pages 4998–5017. Association for Computational Linguistics,

work page 2024
[24]

Andreas Schlaginhaufen, Reda Ouhamma, and Maryam Kamgarpour

URLhttps://proceedings.mlr.press/v119/rosenberg20a.html. Andreas Schlaginhaufen, Reda Ouhamma, and Maryam Kamgarpour. Efficient preference-based reinforcement learning: Randomized exploration meets experimental design.arXiv preprint arXiv:2506.09508,

work page arXiv
[25]

Online apprenticeship learning.arXiv:2102.06924,

Lior Shani, Tom Zahavy, and Shie Mannor. Online apprenticeship learning.arXiv:2102.06924,

work page arXiv
[26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

URLhttps://arxiv.org/abs/2402.03300. YudaSong, GokulSwamy, AartiSingh, JBagnell, andWenSun. Theimportanceofonlinedata: Understanding preference fine-tuning via coverage.Advances in Neural Information Processing Systems, 37:12243–12270,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Robust preference optimization via dynamic target margins.arXiv preprint arXiv:2506.03690,

Jie Sun, Junkang Wu, Jiancan Wu, Zhibo Zhu, Xingyu Lu, Jun Zhou, Lintao Ma, and Xiang Wang. Robust preference optimization via dynamic target margins.arXiv preprint arXiv:2506.03690,

work page arXiv
[28]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Representation-based exploration for language models: From test-time to post-training.arXiv preprint arXiv:2510.11686,

Jens Tuyls, Dylan J Foster, Akshay Krishnamurthy, and Jordan T Ash. Representation-based exploration for language models: From test-time to post-training.arXiv preprint arXiv:2510.11686,

work page arXiv
[30]

Luca Viano, Ruida Zhou, Yifan Sun, Mahdi Namazifar, Volkan Cevher, Shoham Sabach, and Mohammad Ghavamzadeh

URL https://openreview.net/forum?id=DChQpB4AJy. Luca Viano, Ruida Zhou, Yifan Sun, Mahdi Namazifar, Volkan Cevher, Shoham Sabach, and Mohammad Ghavamzadeh. Direct preference optimization with rating information: Practical algorithms and provable gains.arXiv preprint arXiv:2602.00603,

work page arXiv
[31]

Il-soar: Imitation learning with soft optimistic actor critic

Stefano Viel, Luca Viano, and Volkan Cevher. Il-soar: Imitation learning with soft optimistic actor critic. arXiv preprint arXiv:2502.19859,

work page arXiv
[32]

α-dpo: Adaptive reward margin is what direct preference optimization needs.arXiv preprint arXiv:2410.10148,

Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. α-dpo: Adaptive reward margin is what direct preference optimization needs.arXiv preprint arXiv:2410.10148,

work page arXiv
[33]

Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf.arXiv preprint arXiv:2405.21046,

Tengyang Xie, Dylan J Foster, Akshay Krishnamurthy, Corby Rosset, Ahmed Awadallah, and Alexander Rakhlin. Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf.arXiv preprint arXiv:2405.21046,

work page arXiv
[34]

Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719,

Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weiling Liu, Zhiyu Mei, Guangju Wang, Chao Yu, and Yi Wu. Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719,

work page arXiv
[35]

To believe or not to believe your llm.arXiv preprint arXiv:2406.02543,

Yasin Abbasi Yadkori, Ilja Kuzborskij, András György, and Csaba Szepesvári. To believe or not to believe your llm.arXiv preprint arXiv:2406.02543,

work page arXiv
[36]

Trajectory bellman residual minimization: A simple value-based method for llm reasoning.arXiv preprint arXiv:2505.15311,

Yurun Yuan, Fan Chen, Zeyu Jia, Alexander Rakhlin, and Tengyang Xie. Trajectory bellman residual minimization: A simple value-based method for llm reasoning.arXiv preprint arXiv:2505.15311,

work page arXiv
[37]

Token-level direct preference optimization

Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, and Jun Wang. Token-level direct preference optimization.arXiv preprint arXiv:2404.11999,

work page arXiv
[38]

Provable offline preference-based reinforcement learning.arXiv preprint arXiv:2305.14816,

Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D Lee, and Wen Sun. Provable offline preference-based reinforcement learning.arXiv preprint arXiv:2305.14816,

work page arXiv
[39]

In this case, we can not derive DPO-like algorithms but we compute the maximum likelihood estimator ˆrfor the reward function from the preference dataset and then consider three approaches: (i) RL outputs the action with highest estimated reward, (ii) a method that we nameχ2RL which outputs πout = argmax π∈Π ⟨π,ˆr⟩ −γDχ2(π, πref)where we set γ = 40and (ii...

work page 2000
[40]

The rejection sampling rule with the minimum in the numerator as definined in Algorithm 2 can be over conservative in practice and leads to a very high rejection rate

A.3 Other worst case aggregation rules for the ensemble. The rejection sampling rule with the minimum in the numerator as definined in Algorithm 2 can be over conservative in practice and leads to a very high rejection rate. One possible alternative to reduce the generation time of the model is to consider this alternative numerator, fout(x, a) = "PL ℓ=1 ...

work page 2025
[41]

binarizing

A.4 On the token level implementation ofPEPO As mentioned in the main text, we also introduce a version forPEPO which implements the pessimistic aggregation rule given by Theorem 3.2 at the token level rather than the trajectory level. This approach is faster at generation time because it does not reject generations, while Algorithm 2 does. In practice, t...

work page 2022
[42]

It is easy to notice that for any value along the horizontal axis the two values sum up to1

The standard sigmoid is plotted in blue and σ(−x) = 1 −σ (x)is plotted with the dotted blue line. It is easy to notice that for any value along the horizontal axis the two values sum up to1. In addition, we show with a solid red line the sigmoid shifted on the right, i.e.σ(x−γ )by some margin γ and in dotted red the reflected version, i.e.σ(−x + γ), which...

work page 2023
[43]

Achieving those two improvements simultaneously, i.e

In other words, Table 3 shows thatPEPO, DPO+SFT andχ2PO all applies beyond the easiest case of tabular setting with knownπdata butalong different axes. Achieving those two improvements simultaneously, i.e. designing an approach which applies for general policy classes without knowledge about πdata, is in our opinion the most important open question rising...

work page 2024
[44]

Examples of the first kind are the emerging family of algorithms for reasoning such as GRPO Shao et al

Unknownπ data PEPO(Ours)Open Question Optimism via EnsembleIn the context of LLM post-training there are applications in which at training time it is possible to generate sentences from the model which is currently learning and getting feedback for this response under the form of a reward signal or a preference bit. Examples of the first kind are the emer...

work page 2024
[45]

and Nash Learning from Human Feedback Munos et al. [2024]. While promising at an empirical level, they all miss an exploration mechanism that should be included to ensure that the learning model queries the reward or preference oracle for response which are uncertain, i.e. for which new information should really be acquired. Such lack of exploration has b...

work page 2024
[46]

∃x, a∈ X × As.t.−Pessimism(x, a)≥ − X b πdata(b|x) h (λ(x, a, b))2eRmax +λ(x, a, b)e Rmax/2 i# ≤ |X | |A|e −2L/7 Therefore, choosingL= 7 2 log |X ||A| δ ensures that P

The condition1 −p ℓ tie(x, a, b) ≤ N ℓ(x,a,b) N ℓ(x,a,b)+2 holds true setting the tie weightsλℓ(x, a, b)to a large enough value. In particular, we requireλℓ(x, a, b)to satisfy for allx, a, b∈ X × A × Athat N ℓ(x, a, b) N ℓ(x, a, b) + 2 ≥ exp β∆log ˜πℓ/π(x, b, a)/2 + exp β∆log ˜πℓ/π(x, a, b)/2 exp β∆log ˜πℓ/π(x, b, a)/2 + exp β∆log ˜πℓ/π(x, a, b)/2 +λ ℓ(x,...

work page 2025
[47]

Finally, to have the result holding for all x, a∈ X × A , ˜N(x, a,·) : A → [N]and k∈ [N], we conclude via a union bound. Then, choosingk = N(x, a) and ˜N(x, a, b)as the minimum count observed across the ensemble denoted asNmin(x, a, b), we have that X b πdata(b|x)pℓ tie(x, a)≤ 1 N(x, a) +γ NX n=1 4eRmax/21{Xn,A+ n =x,a} 2eRmax/2 +N min(x, a, A−n ) + 4|A|l...

work page 2020
[48]

Finally, the proof is concluded by invoking Theorem F.7 which gives that the Lipschitz constant ofσ−1 pess in the interval[−Rmax, Rmax]is of ordere Rmax, i.e.L=O eRmax

−1 + 4|X |log(|X |LN 3/δ) N ≤ 6|X |log(|X |LN 3/δ) N ,(26) therefore, plugging in (26) into (25), it holds that with probability1−δ X x ν0(x) X a πdata(a|x) X b πdata(b|x) ∆r⋆(x, a, b)− b∆(x, a, b) !2 ≤ eO L2R2 max |X | |A|2 Llog(1/δ) N ! . Finally, the proof is concluded by invoking Theorem F.7 which gives that the Lipschitz constant ofσ−1 pess in the in...

work page 2020
[49]

Combining these estimates, we obtain |f ′(y)| ≤2 " λ+ λ2+6 2a 2a + 1 a # = λ+ 2 a + λ2 + 6 2a2

Therefore, λ+ |D′(y)| 2 √ D(y) λy+ p D(y) ≤ λ+ λ2+6 2a 2a , 1 1−y ≤ 1 a . Combining these estimates, we obtain |f ′(y)| ≤2 " λ+ λ2+6 2a 2a + 1 a # = λ+ 2 a + λ2 + 6 2a2 . Sincea −1 =e 2Rmax + 1, the stated bound follows. 42 F.1 Importance Sampling Routine: Proof of Theorem 3.3 Proof.Let us denote fout(x, a) = min ℓ∈[L] ˜πℓ(a|x) exp(−Bptie(x, a)/β) At each...

work page 2024

[1] [1]

Design considerations in offline preference-based rl

Alekh Agarwal, Christoph Dann, and Teodor V Marinov. Design considerations in offline preference-based rl. arXiv preprint arXiv:2502.06861,

work page arXiv

[2] [2]

Xrpo: Pushing the limits of grpo with targeted exploration and exploitation.arXiv preprint arXiv:2510.06672,

Udbhav Bamba, Minghao Fang, Yifan Yu, Haizhong Zheng, and Fan Lai. Xrpo: Pushing the limits of grpo with targeted exploration and exploitation.arXiv preprint arXiv:2510.06672,

work page arXiv

[3] [3]

Value-incentivized preference optimization: A unified approach to online and offline rlhf.arXiv preprint arXiv:2405.19320,

Shicong Cen, Jincheng Mei, Katayoon Goshvadi, Hanjun Dai, Tong Yang, Sherry Yang, Dale Schuurmans, Yuejie Chi, and Bo Dai. Value-incentivized preference optimization: A unified approach to online and offline rlhf.arXiv preprint arXiv:2405.19320,

work page arXiv

[4] [4]

On extending direct preference optimization to accommodate ties.arXiv preprint arXiv:2409.17431,

Jinghong Chen, Guangyu Yang, Weizhe Lin, Jingbiao Mei, and Bill Byrne. On extending direct preference optimization to accommodate ties.arXiv preprint arXiv:2409.17431,

work page arXiv

[5] [5]

AvoidingO(eRmax)scaling in rlhf through preference-based exploration.arXiv preprint arXiv:2502.00666,

Mingyu Chen, Yiding Chen, Wen Sun, and Xuezhou Zhang. AvoidingO(eRmax)scaling in rlhf through preference-based exploration.arXiv preprint arXiv:2502.00666,

work page arXiv

[6] [6]

UCB Exploration via Q-Ensembles

Richard Y Chen, Szymon Sidor, Pieter Abbeel, and John Schulman. Ucb exploration via q-ensembles.arXiv preprint arXiv:1706.01502,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Reward model ensembles help mitigate overoptimization.arXiv preprint arXiv:2310.02743,

Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger. Reward model ensembles help mitigate overoptimization.arXiv preprint arXiv:2310.02743,

work page arXiv

[8] [8]

Active preference optimiza- tion for sample efficient rlhf.arXiv preprint arXiv:2402.10500,

Nirjhar Das, Souradip Chakraborty, Aldo Pacchiano, and Sayak Ray Chowdhury. Active preference optimiza- tion for sample efficient rlhf.arXiv preprint arXiv:2402.10500,

work page arXiv

[9] [9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URLhttps://arxiv.org/abs/2501.12948. Fei Ding, Baiqiao Wang, Zijian Zeng, and Youwei Wang. Multi-layer grpo: Enhancing reasoning and self-correction in large language models.arXiv preprint arXiv:2506.04746,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Robust preference optimization through reward model distillation.arXiv preprint arXiv:2405.19316,

Adam Fisch, Jacob Eisenstein, Vicky Zayats, Alekh Agarwal, Ahmad Beirami, Chirag Nagpal, Pete Shaw, and Jonathan Berant. Robust preference optimization through reward model distillation.arXiv preprint arXiv:2405.19316,

work page arXiv

[11] [11]

Learn your reference model for real good alignment

13 Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian Maksimov, Nikita Balagansky, and Daniil Gavrilov. Learn your reference model for real good alignment. arXiv preprint arXiv:2404.09656,

work page arXiv

[12] [12]

Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, and Saravan Rajmohan

URLhttps://arxiv.org/abs/2402.04792. Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, and Saravan Rajmohan. Multi-preference optimization: Generalizing dpo via set-level contrasts.arXiv preprint arXiv:2412.04628,

work page arXiv

[13] [13]

Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D Lee, Wen Sun, Akshay Krishnamurthy, and Dylan J Foster

URLhttps://openreview.net/forum?id=nZeVKeeFYf9. Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D Lee, Wen Sun, Akshay Krishnamurthy, and Dylan J Foster. Correcting the mythos of kl-regularization: Direct alignment without overoptimization via chi- squared preference optimization.arXiv preprint arXiv:2407.13399,

work page arXiv

[14] [14]

Provable and practical: Efficient exploration in reinforcement learning via langevin monte carlo.arXiv preprint arXiv:2305.18246,

Haque Ishfaq, Qingfeng Lan, Pan Xu, A Rupam Mahmood, Doina Precup, Anima Anandkumar, and Kamyar Azizzadenesheli. Provable and practical: Efficient exploration in reinforcement learning via langevin monte carlo.arXiv preprint arXiv:2305.18246,

work page arXiv

[15] [15]

More efficient randomized exploration for reinforcement learning via approximate sampling.arXiv preprint arXiv:2406.12241,

Haque Ishfaq, Yixin Tan, Yu Yang, Qingfeng Lan, Jianfeng Lu, A Rupam Mahmood, Doina Precup, and Pan Xu. More efficient randomized exploration for reinforcement learning via approximate sampling.arXiv preprint arXiv:2406.12241,

work page arXiv

[16] [16]

Langevin soft actor-critic: Efficient exploration through uncertainty-driven critic learning.arXiv preprint arXiv:2501.17827,

Haque Ishfaq, Guangyuan Wang, Sami Nur Islam, and Doina Precup. Langevin soft actor-critic: Efficient exploration through uncertainty-driven critic learning.arXiv preprint arXiv:2501.17827,

work page arXiv

[17] [17]

Self-play with adversarial critic: Provable and scalable offline alignment for language models.arXiv preprint arXiv:2406.04274,

Xiang Ji, Sanjeev Kulkarni, Mengdi Wang, and Tengyang Xie. Self-play with adversarial critic: Provable and scalable offline alignment for language models.arXiv preprint arXiv:2406.04274,

work page arXiv

[18] [18]

Hashimoto

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models.https: //github.com/tatsu-lab/alpaca_eval, 5 2023a. Zihao Li, Zhuoran Yang, and Mengdi Wang. Reinforcement learning with human feedback: Learning dynamic choices ...

work page arXiv

[19] [19]

Understanding R1-Zero-Like Training: A Critical Perspective

URLhttps://arxiv. org/abs/2503.20783. Simon Matrenok, Skander Moalla, and Caglar Gulcehre. Quantile reward policy optimization: Alignment with pointwise regression and exact partition functions.arXiv preprint arXiv:2507.08068,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Optimistically optimistic exploration for provably efficient infinite-horizon reinforcement and imitation learning.arXiv preprint arXiv:2502.13900,

14 Antoine Moulin, Gergely Neu, and Luca Viano. Optimistically optimistic exploration for provably efficient infinite-horizon reinforcement and imitation learning.arXiv preprint arXiv:2502.13900,

work page arXiv

[21] [21]

Brendan O’Donoghue

URLhttps://arxiv.org/abs/2312.00886. Brendan O’Donoghue. Variational bayesian reinforcement learning with regret bounds.Advances in Neural Information Processing Systems, 34:28208–28221,

work page arXiv

[22] [22]

Efficient exploration via epistemic-risk-seeking policy optimization.arXiv preprint arXiv:2302.09339,

Brendan O’Donoghue. Efficient exploration via epistemic-risk-seeking policy optimization.arXiv preprint arXiv:2302.09339,

work page arXiv

[23] [23]

Disentangling length from quality in direct preference optimization

Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. InFindings of the Association for Computational Linguistics: ACL 2024, pages 4998–5017. Association for Computational Linguistics,

work page 2024

[24] [24]

Andreas Schlaginhaufen, Reda Ouhamma, and Maryam Kamgarpour

URLhttps://proceedings.mlr.press/v119/rosenberg20a.html. Andreas Schlaginhaufen, Reda Ouhamma, and Maryam Kamgarpour. Efficient preference-based reinforcement learning: Randomized exploration meets experimental design.arXiv preprint arXiv:2506.09508,

work page arXiv

[25] [25]

Online apprenticeship learning.arXiv:2102.06924,

Lior Shani, Tom Zahavy, and Shie Mannor. Online apprenticeship learning.arXiv:2102.06924,

work page arXiv

[26] [26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

URLhttps://arxiv.org/abs/2402.03300. YudaSong, GokulSwamy, AartiSingh, JBagnell, andWenSun. Theimportanceofonlinedata: Understanding preference fine-tuning via coverage.Advances in Neural Information Processing Systems, 37:12243–12270,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Robust preference optimization via dynamic target margins.arXiv preprint arXiv:2506.03690,

Jie Sun, Junkang Wu, Jiancan Wu, Zhibo Zhu, Xingyu Lu, Jun Zhou, Lintao Ma, and Xiang Wang. Robust preference optimization via dynamic target margins.arXiv preprint arXiv:2506.03690,

work page arXiv

[28] [28]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Representation-based exploration for language models: From test-time to post-training.arXiv preprint arXiv:2510.11686,

Jens Tuyls, Dylan J Foster, Akshay Krishnamurthy, and Jordan T Ash. Representation-based exploration for language models: From test-time to post-training.arXiv preprint arXiv:2510.11686,

work page arXiv

[30] [30]

Luca Viano, Ruida Zhou, Yifan Sun, Mahdi Namazifar, Volkan Cevher, Shoham Sabach, and Mohammad Ghavamzadeh

URL https://openreview.net/forum?id=DChQpB4AJy. Luca Viano, Ruida Zhou, Yifan Sun, Mahdi Namazifar, Volkan Cevher, Shoham Sabach, and Mohammad Ghavamzadeh. Direct preference optimization with rating information: Practical algorithms and provable gains.arXiv preprint arXiv:2602.00603,

work page arXiv

[31] [31]

Il-soar: Imitation learning with soft optimistic actor critic

Stefano Viel, Luca Viano, and Volkan Cevher. Il-soar: Imitation learning with soft optimistic actor critic. arXiv preprint arXiv:2502.19859,

work page arXiv

[32] [32]

α-dpo: Adaptive reward margin is what direct preference optimization needs.arXiv preprint arXiv:2410.10148,

Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. α-dpo: Adaptive reward margin is what direct preference optimization needs.arXiv preprint arXiv:2410.10148,

work page arXiv

[33] [33]

Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf.arXiv preprint arXiv:2405.21046,

Tengyang Xie, Dylan J Foster, Akshay Krishnamurthy, Corby Rosset, Ahmed Awadallah, and Alexander Rakhlin. Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf.arXiv preprint arXiv:2405.21046,

work page arXiv

[34] [34]

Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719,

Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weiling Liu, Zhiyu Mei, Guangju Wang, Chao Yu, and Yi Wu. Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719,

work page arXiv

[35] [35]

To believe or not to believe your llm.arXiv preprint arXiv:2406.02543,

Yasin Abbasi Yadkori, Ilja Kuzborskij, András György, and Csaba Szepesvári. To believe or not to believe your llm.arXiv preprint arXiv:2406.02543,

work page arXiv

[36] [36]

Trajectory bellman residual minimization: A simple value-based method for llm reasoning.arXiv preprint arXiv:2505.15311,

Yurun Yuan, Fan Chen, Zeyu Jia, Alexander Rakhlin, and Tengyang Xie. Trajectory bellman residual minimization: A simple value-based method for llm reasoning.arXiv preprint arXiv:2505.15311,

work page arXiv

[37] [37]

Token-level direct preference optimization

Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, and Jun Wang. Token-level direct preference optimization.arXiv preprint arXiv:2404.11999,

work page arXiv

[38] [38]

Provable offline preference-based reinforcement learning.arXiv preprint arXiv:2305.14816,

Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D Lee, and Wen Sun. Provable offline preference-based reinforcement learning.arXiv preprint arXiv:2305.14816,

work page arXiv

[39] [39]

In this case, we can not derive DPO-like algorithms but we compute the maximum likelihood estimator ˆrfor the reward function from the preference dataset and then consider three approaches: (i) RL outputs the action with highest estimated reward, (ii) a method that we nameχ2RL which outputs πout = argmax π∈Π ⟨π,ˆr⟩ −γDχ2(π, πref)where we set γ = 40and (ii...

work page 2000

[40] [40]

The rejection sampling rule with the minimum in the numerator as definined in Algorithm 2 can be over conservative in practice and leads to a very high rejection rate

A.3 Other worst case aggregation rules for the ensemble. The rejection sampling rule with the minimum in the numerator as definined in Algorithm 2 can be over conservative in practice and leads to a very high rejection rate. One possible alternative to reduce the generation time of the model is to consider this alternative numerator, fout(x, a) = "PL ℓ=1 ...

work page 2025

[41] [41]

binarizing

A.4 On the token level implementation ofPEPO As mentioned in the main text, we also introduce a version forPEPO which implements the pessimistic aggregation rule given by Theorem 3.2 at the token level rather than the trajectory level. This approach is faster at generation time because it does not reject generations, while Algorithm 2 does. In practice, t...

work page 2022

[42] [42]

It is easy to notice that for any value along the horizontal axis the two values sum up to1

The standard sigmoid is plotted in blue and σ(−x) = 1 −σ (x)is plotted with the dotted blue line. It is easy to notice that for any value along the horizontal axis the two values sum up to1. In addition, we show with a solid red line the sigmoid shifted on the right, i.e.σ(x−γ )by some margin γ and in dotted red the reflected version, i.e.σ(−x + γ), which...

work page 2023

[43] [43]

Achieving those two improvements simultaneously, i.e

In other words, Table 3 shows thatPEPO, DPO+SFT andχ2PO all applies beyond the easiest case of tabular setting with knownπdata butalong different axes. Achieving those two improvements simultaneously, i.e. designing an approach which applies for general policy classes without knowledge about πdata, is in our opinion the most important open question rising...

work page 2024

[44] [44]

Examples of the first kind are the emerging family of algorithms for reasoning such as GRPO Shao et al

Unknownπ data PEPO(Ours)Open Question Optimism via EnsembleIn the context of LLM post-training there are applications in which at training time it is possible to generate sentences from the model which is currently learning and getting feedback for this response under the form of a reward signal or a preference bit. Examples of the first kind are the emer...

work page 2024

[45] [45]

and Nash Learning from Human Feedback Munos et al. [2024]. While promising at an empirical level, they all miss an exploration mechanism that should be included to ensure that the learning model queries the reward or preference oracle for response which are uncertain, i.e. for which new information should really be acquired. Such lack of exploration has b...

work page 2024

[46] [46]

∃x, a∈ X × As.t.−Pessimism(x, a)≥ − X b πdata(b|x) h (λ(x, a, b))2eRmax +λ(x, a, b)e Rmax/2 i# ≤ |X | |A|e −2L/7 Therefore, choosingL= 7 2 log |X ||A| δ ensures that P

The condition1 −p ℓ tie(x, a, b) ≤ N ℓ(x,a,b) N ℓ(x,a,b)+2 holds true setting the tie weightsλℓ(x, a, b)to a large enough value. In particular, we requireλℓ(x, a, b)to satisfy for allx, a, b∈ X × A × Athat N ℓ(x, a, b) N ℓ(x, a, b) + 2 ≥ exp β∆log ˜πℓ/π(x, b, a)/2 + exp β∆log ˜πℓ/π(x, a, b)/2 exp β∆log ˜πℓ/π(x, b, a)/2 + exp β∆log ˜πℓ/π(x, a, b)/2 +λ ℓ(x,...

work page 2025

[47] [47]

Finally, to have the result holding for all x, a∈ X × A , ˜N(x, a,·) : A → [N]and k∈ [N], we conclude via a union bound. Then, choosingk = N(x, a) and ˜N(x, a, b)as the minimum count observed across the ensemble denoted asNmin(x, a, b), we have that X b πdata(b|x)pℓ tie(x, a)≤ 1 N(x, a) +γ NX n=1 4eRmax/21{Xn,A+ n =x,a} 2eRmax/2 +N min(x, a, A−n ) + 4|A|l...

work page 2020

[48] [48]

Finally, the proof is concluded by invoking Theorem F.7 which gives that the Lipschitz constant ofσ−1 pess in the interval[−Rmax, Rmax]is of ordere Rmax, i.e.L=O eRmax

−1 + 4|X |log(|X |LN 3/δ) N ≤ 6|X |log(|X |LN 3/δ) N ,(26) therefore, plugging in (26) into (25), it holds that with probability1−δ X x ν0(x) X a πdata(a|x) X b πdata(b|x) ∆r⋆(x, a, b)− b∆(x, a, b) !2 ≤ eO L2R2 max |X | |A|2 Llog(1/δ) N ! . Finally, the proof is concluded by invoking Theorem F.7 which gives that the Lipschitz constant ofσ−1 pess in the in...

work page 2020

[49] [49]

Combining these estimates, we obtain |f ′(y)| ≤2 " λ+ λ2+6 2a 2a + 1 a # = λ+ 2 a + λ2 + 6 2a2

Therefore, λ+ |D′(y)| 2 √ D(y) λy+ p D(y) ≤ λ+ λ2+6 2a 2a , 1 1−y ≤ 1 a . Combining these estimates, we obtain |f ′(y)| ≤2 " λ+ λ2+6 2a 2a + 1 a # = λ+ 2 a + λ2 + 6 2a2 . Sincea −1 =e 2Rmax + 1, the stated bound follows. 42 F.1 Importance Sampling Routine: Proof of Theorem 3.3 Proof.Let us denote fout(x, a) = min ℓ∈[L] ˜πℓ(a|x) exp(−Bptie(x, a)/β) At each...

work page 2024