pith. sign in

arxiv: 2602.06239 · v3 · pith:RITIFKOXnew · submitted 2026-02-05 · 💻 cs.LG

Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution

Pith reviewed 2026-05-21 13:02 UTC · model grok-4.3

classification 💻 cs.LG
keywords preference optimizationdirect preference optimizationover-optimizationensemble methodsconcentrability coefficienttabular settingpessimistic aggregation
0
0 comments X

The pith

PEPO uses pessimistic ensembles on disjoint data subsets to avoid over-optimization in DPO by depending only on single-policy concentrability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PEPO as a single-step DPO-like method that mitigates over-optimization in preference learning without knowing the data distribution or training an explicit reward model. It trains an ensemble of policies on disjoint data subsets and aggregates them with a worst-case construction that favors agreement across the models to enforce pessimism. In the tabular setting this yields sample complexity bounds that depend only on a single-policy concentrability coefficient, in contrast to the all-policy concentrability that appears in guarantees for standard DPO. A sympathetic reader would care because over-optimization is a known practical failure mode that makes learned policies degrade outside the training data, and PEPO keeps the training simplicity of DPO while supplying a theoretical fix.

Core claim

PEPO achieves sample complexity guarantees depending only on a single-policy concentrability coefficient by training an ensemble of preference-optimized policies on disjoint data subsets and aggregating them through a worst-case construction that favors agreement across models, thereby avoiding the all-policy concentrability which affects the guarantees of algorithms prone to over-optimization such as DPO.

What carries the argument

The pessimistic ensemble aggregation, which constructs a final policy from the worst-case agreement among models trained on disjoint data subsets to enforce pessimism without an explicit reward model.

If this is right

  • Sample complexity in the tabular setting depends only on single-policy concentrability rather than all-policy concentrability.
  • The method retains the single-step training simplicity of DPO while adding theoretical protection against over-optimization.
  • No knowledge of the data-generating distribution is required for the guarantees.
  • Practical performance matches or exceeds DPO while remaining computationally comparable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same disjoint-subset construction might reduce concentrability requirements in non-tabular or continuous settings if policy diversity can be maintained.
  • PEPO could be combined with other regularization techniques to further improve robustness in large language model alignment.
  • Empirical checks on real preference datasets would test whether the required diversity across disjoint subsets holds in practice.

Load-bearing premise

Training on disjoint data subsets produces sufficiently diverse policies whose worst-case aggregation still yields a performant and safe policy.

What would settle it

A tabular MDP in which the worst-case aggregation of the subset-trained policies produces a low-performing or unsafe policy even though each individual policy is good on its own subset.

Figures

Figures reproduced from arXiv: 2602.06239 by Adam Barla, Emanuele Nevali, Luca Viano, Volkan Cevher.

Figure 1
Figure 1. Figure 1: Win rate (%) against the initial model on AlpacaEval across training epochs. Shaded regions [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Experiment in 3 arms bandit setting for the situation of known and unknown πdata. experimental setting where the preference dataset is generated directly from πref (i.e., πdata = πref), as this would restore the conditions under which χ 2PO and SFT+DPO are theoretically sound. Known vs unknown πdata in a toy environment. For our toy experiment, we consider an empty prompt and 3 possible actions. The prefer… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation for L in a bandit setting. A.2 Details for the experiments in the controlled setting For Figure 2a, we used πdata = πref = [0.04, 0.93, 0.03]. That is, the largest mass is on the central action, which is the optimal one. This makes the optimal action well-covered, i.e. small C ⋆ , therefore χ 2PO and PEPO can work reliably. At the same time, the all-policy concentrability is large because the two … view at source ↗
Figure 4
Figure 4. Figure 4: Result in the controlled setting without regularization, i.e. with [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of the tie probability (missing mass) resulting from a right shifting of the sigmoid [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
read the original abstract

We introduce PEPO (Pessimistic Ensemble based Preference Optimization), a single-step Direct Preference Optimization (DPO)-like algorithm to mitigate the well-known over-optimization issue in preference learning without requiring the knowledge of the data-generating distribution or learning an explicit reward model. PEPO achieves pessimism via an ensemble of preference-optimized policies trained on disjoint data subsets and then aggregates them through a worst case construction that favors the agreement across models. In the tabular setting, PEPO achieves sample complexity guarantees depending only on a single-policy concentrability coefficient, thus avoiding the all-policy concentrability which affects the guarantees of algorithms prone to over-optimization, such as DPO. The theoretical findings are corroborated by a convincing practical performance, while retaining the simplicity and the practicality of DPO-style training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PEPO (Pessimistic Ensemble based Preference Optimization), a DPO-style algorithm that trains an ensemble of policies on disjoint data subsets and aggregates them via a worst-case agreement construction to induce pessimism. The central theoretical claim is that, in the tabular setting, this yields sample-complexity guarantees depending only on a single-policy concentrability coefficient, thereby avoiding the all-policy concentrability that appears in analyses of standard DPO and other over-optimization-prone methods. The approach requires neither knowledge of the data-generating distribution nor an explicit reward model, and the authors report competitive empirical performance.

Significance. If the concentrability reduction is rigorously established, the result would be significant for preference optimization and RLHF: it supplies a concrete algorithmic mechanism (disjoint-subset ensembles plus worst-case aggregation) that converts a multi-policy coverage requirement into a single-policy one while preserving the simplicity of DPO-style training. Such a guarantee would be a useful reference point for designing safe preference learners that do not rely on explicit reward modeling or distribution knowledge.

major comments (2)
  1. The reduction from all-policy to single-policy concentrability rests on the claim that worst-case aggregation over policies trained on disjoint subsets produces a policy whose effective coverage is controlled by one reference policy alone. The manuscript does not supply an explicit argument or assumption showing that disjoint subsets necessarily generate sufficient policy diversity, nor does it bound the extrapolation bias that could be shared across independently optimized policies. Without such justification, the worst-case operator may still select an over-optimized action whose concentrability coefficient is closer to the all-policy quantity than to the single-policy quantity (see the tabular analysis and the definition of the aggregated policy).
  2. The sample-complexity statement is presented without the full derivation, explicit error bounds, or the precise assumptions placed on the data-generating process and on the reference policy. Because the concentrability reduction is the load-bearing step for the main claim, the absence of these details prevents verification that no hidden multi-policy coverage terms reappear in the final bound.
minor comments (2)
  1. Notation for the worst-case aggregation operator and for the single-policy concentrability coefficient should be introduced earlier and used consistently throughout the theoretical section.
  2. The empirical section would benefit from an ablation that isolates the contribution of the disjoint-subset training versus the worst-case aggregation step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address the major comments point by point below and will revise the manuscript to strengthen the theoretical presentation.

read point-by-point responses
  1. Referee: The reduction from all-policy to single-policy concentrability rests on the claim that worst-case aggregation over policies trained on disjoint subsets produces a policy whose effective coverage is controlled by one reference policy alone. The manuscript does not supply an explicit argument or assumption showing that disjoint subsets necessarily generate sufficient policy diversity, nor does it bound the extrapolation bias that could be shared across independently optimized policies. Without such justification, the worst-case operator may still select an over-optimized action whose concentrability coefficient is closer to the all-policy quantity than to the single-policy quantity (see the tabular analysis and the definition of the aggregated policy).

    Authors: We appreciate the referee highlighting the need for greater explicitness here. The current tabular analysis and definition of the aggregated policy implicitly rely on the fact that disjoint subsets yield independently optimized policies with distinct extrapolation behaviors, and that the worst-case aggregation selects the most conservative action among them. However, we agree that a self-contained argument is warranted. In the revision we will add a dedicated lemma that (i) shows how independent optimization on disjoint subsets induces policy diversity sufficient to cover the relevant action space under the reference policy, and (ii) bounds the shared extrapolation bias so that the effective concentrability of the aggregated policy is controlled by the single-policy coefficient alone. This will directly address the concern that the worst-case operator could inadvertently revert to an all-policy quantity. revision: yes

  2. Referee: The sample-complexity statement is presented without the full derivation, explicit error bounds, or the precise assumptions placed on the data-generating process and on the reference policy. Because the concentrability reduction is the load-bearing step for the main claim, the absence of these details prevents verification that no hidden multi-policy coverage terms reappear in the final bound.

    Authors: We acknowledge that the sample-complexity guarantee is currently stated at a high level. In the revised manuscript we will supply the complete derivation, including all intermediate error bounds, the precise assumptions on the data-generating process (i.i.d. sampling from the preference distribution), and the conditions on the reference policy. The expanded proof will explicitly track concentrability terms at each step and confirm that only the single-policy coefficient appears in the final bound, with no residual multi-policy coverage factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity in PEPO's single-policy concentrability claim

full rationale

The paper introduces PEPO as a new ensemble-based algorithm that trains DPO-style policies on disjoint data subsets and aggregates via worst-case agreement to induce pessimism. The central theoretical claim is a sample-complexity bound in the tabular setting that depends only on a single-policy concentrability coefficient. No quoted equation or derivation step reduces this bound by construction to a fitted parameter, a self-citation chain, or an input that already encodes the target guarantee. The construction and its analysis appear self-contained against external benchmarks, with the concentrability reduction arising from the explicit algorithmic design rather than from re-labeling or tautological fitting.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract relies on standard offline RL concentrability assumptions and on the unstated claim that disjoint-subset training produces useful diversity; no explicit free parameters or invented entities are named.

axioms (1)
  • domain assumption Disjoint data subsets yield policies whose worst-case aggregation preserves single-policy concentrability.
    This premise is required for the sample-complexity claim to hold but is not derived in the abstract.

pith-pipeline@v0.9.0 · 5668 in / 1247 out tokens · 29396 ms · 2026-05-21T13:02:08.817489+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 5 internal anchors

  1. [1]

    Design considerations in offline preference-based rl

    Alekh Agarwal, Christoph Dann, and Teodor V Marinov. Design considerations in offline preference-based rl. arXiv preprint arXiv:2502.06861,

  2. [2]

    Xrpo: Pushing the limits of grpo with targeted exploration and exploitation.arXiv preprint arXiv:2510.06672,

    Udbhav Bamba, Minghao Fang, Yifan Yu, Haizhong Zheng, and Fan Lai. Xrpo: Pushing the limits of grpo with targeted exploration and exploitation.arXiv preprint arXiv:2510.06672,

  3. [3]

    Value-incentivized preference optimization: A unified approach to online and offline rlhf.arXiv preprint arXiv:2405.19320,

    Shicong Cen, Jincheng Mei, Katayoon Goshvadi, Hanjun Dai, Tong Yang, Sherry Yang, Dale Schuurmans, Yuejie Chi, and Bo Dai. Value-incentivized preference optimization: A unified approach to online and offline rlhf.arXiv preprint arXiv:2405.19320,

  4. [4]

    On extending direct preference optimization to accommodate ties.arXiv preprint arXiv:2409.17431,

    Jinghong Chen, Guangyu Yang, Weizhe Lin, Jingbiao Mei, and Bill Byrne. On extending direct preference optimization to accommodate ties.arXiv preprint arXiv:2409.17431,

  5. [5]

    AvoidingO(eRmax)scaling in rlhf through preference-based exploration.arXiv preprint arXiv:2502.00666,

    Mingyu Chen, Yiding Chen, Wen Sun, and Xuezhou Zhang. AvoidingO(eRmax)scaling in rlhf through preference-based exploration.arXiv preprint arXiv:2502.00666,

  6. [6]

    UCB Exploration via Q-Ensembles

    Richard Y Chen, Szymon Sidor, Pieter Abbeel, and John Schulman. Ucb exploration via q-ensembles.arXiv preprint arXiv:1706.01502,

  7. [7]

    Reward model ensembles help mitigate overoptimization.arXiv preprint arXiv:2310.02743,

    Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger. Reward model ensembles help mitigate overoptimization.arXiv preprint arXiv:2310.02743,

  8. [8]

    Active preference optimiza- tion for sample efficient rlhf.arXiv preprint arXiv:2402.10500,

    Nirjhar Das, Souradip Chakraborty, Aldo Pacchiano, and Sayak Ray Chowdhury. Active preference optimiza- tion for sample efficient rlhf.arXiv preprint arXiv:2402.10500,

  9. [9]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    URLhttps://arxiv.org/abs/2501.12948. Fei Ding, Baiqiao Wang, Zijian Zeng, and Youwei Wang. Multi-layer grpo: Enhancing reasoning and self-correction in large language models.arXiv preprint arXiv:2506.04746,

  10. [10]

    Robust preference optimization through reward model distillation.arXiv preprint arXiv:2405.19316,

    Adam Fisch, Jacob Eisenstein, Vicky Zayats, Alekh Agarwal, Ahmad Beirami, Chirag Nagpal, Pete Shaw, and Jonathan Berant. Robust preference optimization through reward model distillation.arXiv preprint arXiv:2405.19316,

  11. [11]

    Learn your reference model for real good alignment

    13 Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian Maksimov, Nikita Balagansky, and Daniil Gavrilov. Learn your reference model for real good alignment. arXiv preprint arXiv:2404.09656,

  12. [12]

    Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, and Saravan Rajmohan

    URLhttps://arxiv.org/abs/2402.04792. Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, and Saravan Rajmohan. Multi-preference optimization: Generalizing dpo via set-level contrasts.arXiv preprint arXiv:2412.04628,

  13. [13]

    Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D Lee, Wen Sun, Akshay Krishnamurthy, and Dylan J Foster

    URLhttps://openreview.net/forum?id=nZeVKeeFYf9. Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D Lee, Wen Sun, Akshay Krishnamurthy, and Dylan J Foster. Correcting the mythos of kl-regularization: Direct alignment without overoptimization via chi- squared preference optimization.arXiv preprint arXiv:2407.13399,

  14. [14]

    Provable and practical: Efficient exploration in reinforcement learning via langevin monte carlo.arXiv preprint arXiv:2305.18246,

    Haque Ishfaq, Qingfeng Lan, Pan Xu, A Rupam Mahmood, Doina Precup, Anima Anandkumar, and Kamyar Azizzadenesheli. Provable and practical: Efficient exploration in reinforcement learning via langevin monte carlo.arXiv preprint arXiv:2305.18246,

  15. [15]

    More efficient randomized exploration for reinforcement learning via approximate sampling.arXiv preprint arXiv:2406.12241,

    Haque Ishfaq, Yixin Tan, Yu Yang, Qingfeng Lan, Jianfeng Lu, A Rupam Mahmood, Doina Precup, and Pan Xu. More efficient randomized exploration for reinforcement learning via approximate sampling.arXiv preprint arXiv:2406.12241,

  16. [16]

    Langevin soft actor-critic: Efficient exploration through uncertainty-driven critic learning.arXiv preprint arXiv:2501.17827,

    Haque Ishfaq, Guangyuan Wang, Sami Nur Islam, and Doina Precup. Langevin soft actor-critic: Efficient exploration through uncertainty-driven critic learning.arXiv preprint arXiv:2501.17827,

  17. [17]

    Self-play with adversarial critic: Provable and scalable offline alignment for language models.arXiv preprint arXiv:2406.04274,

    Xiang Ji, Sanjeev Kulkarni, Mengdi Wang, and Tengyang Xie. Self-play with adversarial critic: Provable and scalable offline alignment for language models.arXiv preprint arXiv:2406.04274,

  18. [18]

    Hashimoto

    Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models.https: //github.com/tatsu-lab/alpaca_eval, 5 2023a. Zihao Li, Zhuoran Yang, and Mengdi Wang. Reinforcement learning with human feedback: Learning dynamic choices ...

  19. [19]

    Understanding R1-Zero-Like Training: A Critical Perspective

    URLhttps://arxiv. org/abs/2503.20783. Simon Matrenok, Skander Moalla, and Caglar Gulcehre. Quantile reward policy optimization: Alignment with pointwise regression and exact partition functions.arXiv preprint arXiv:2507.08068,

  20. [20]

    Optimistically optimistic exploration for provably efficient infinite-horizon reinforcement and imitation learning.arXiv preprint arXiv:2502.13900,

    14 Antoine Moulin, Gergely Neu, and Luca Viano. Optimistically optimistic exploration for provably efficient infinite-horizon reinforcement and imitation learning.arXiv preprint arXiv:2502.13900,

  21. [21]

    Brendan O’Donoghue

    URLhttps://arxiv.org/abs/2312.00886. Brendan O’Donoghue. Variational bayesian reinforcement learning with regret bounds.Advances in Neural Information Processing Systems, 34:28208–28221,

  22. [22]

    Efficient exploration via epistemic-risk-seeking policy optimization.arXiv preprint arXiv:2302.09339,

    Brendan O’Donoghue. Efficient exploration via epistemic-risk-seeking policy optimization.arXiv preprint arXiv:2302.09339,

  23. [23]

    Disentangling length from quality in direct preference optimization

    Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. InFindings of the Association for Computational Linguistics: ACL 2024, pages 4998–5017. Association for Computational Linguistics,

  24. [24]

    Andreas Schlaginhaufen, Reda Ouhamma, and Maryam Kamgarpour

    URLhttps://proceedings.mlr.press/v119/rosenberg20a.html. Andreas Schlaginhaufen, Reda Ouhamma, and Maryam Kamgarpour. Efficient preference-based reinforcement learning: Randomized exploration meets experimental design.arXiv preprint arXiv:2506.09508,

  25. [25]

    Online apprenticeship learning.arXiv:2102.06924,

    Lior Shani, Tom Zahavy, and Shie Mannor. Online apprenticeship learning.arXiv:2102.06924,

  26. [26]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    URLhttps://arxiv.org/abs/2402.03300. YudaSong, GokulSwamy, AartiSingh, JBagnell, andWenSun. Theimportanceofonlinedata: Understanding preference fine-tuning via coverage.Advances in Neural Information Processing Systems, 37:12243–12270,

  27. [27]

    Robust preference optimization via dynamic target margins.arXiv preprint arXiv:2506.03690,

    Jie Sun, Junkang Wu, Jiancan Wu, Zhibo Zhu, Xingyu Lu, Jun Zhou, Lintao Ma, and Xiang Wang. Robust preference optimization via dynamic target margins.arXiv preprint arXiv:2506.03690,

  28. [28]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

  29. [29]

    Representation-based exploration for language models: From test-time to post-training.arXiv preprint arXiv:2510.11686,

    Jens Tuyls, Dylan J Foster, Akshay Krishnamurthy, and Jordan T Ash. Representation-based exploration for language models: From test-time to post-training.arXiv preprint arXiv:2510.11686,

  30. [30]

    Luca Viano, Ruida Zhou, Yifan Sun, Mahdi Namazifar, Volkan Cevher, Shoham Sabach, and Mohammad Ghavamzadeh

    URL https://openreview.net/forum?id=DChQpB4AJy. Luca Viano, Ruida Zhou, Yifan Sun, Mahdi Namazifar, Volkan Cevher, Shoham Sabach, and Mohammad Ghavamzadeh. Direct preference optimization with rating information: Practical algorithms and provable gains.arXiv preprint arXiv:2602.00603,

  31. [31]

    Il-soar: Imitation learning with soft optimistic actor critic

    Stefano Viel, Luca Viano, and Volkan Cevher. Il-soar: Imitation learning with soft optimistic actor critic. arXiv preprint arXiv:2502.19859,

  32. [32]

    α-dpo: Adaptive reward margin is what direct preference optimization needs.arXiv preprint arXiv:2410.10148,

    Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. α-dpo: Adaptive reward margin is what direct preference optimization needs.arXiv preprint arXiv:2410.10148,

  33. [33]

    Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf.arXiv preprint arXiv:2405.21046,

    Tengyang Xie, Dylan J Foster, Akshay Krishnamurthy, Corby Rosset, Ahmed Awadallah, and Alexander Rakhlin. Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf.arXiv preprint arXiv:2405.21046,

  34. [34]

    Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719,

    Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weiling Liu, Zhiyu Mei, Guangju Wang, Chao Yu, and Yi Wu. Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719,

  35. [35]

    To believe or not to believe your llm.arXiv preprint arXiv:2406.02543,

    Yasin Abbasi Yadkori, Ilja Kuzborskij, András György, and Csaba Szepesvári. To believe or not to believe your llm.arXiv preprint arXiv:2406.02543,

  36. [36]

    Trajectory bellman residual minimization: A simple value-based method for llm reasoning.arXiv preprint arXiv:2505.15311,

    Yurun Yuan, Fan Chen, Zeyu Jia, Alexander Rakhlin, and Tengyang Xie. Trajectory bellman residual minimization: A simple value-based method for llm reasoning.arXiv preprint arXiv:2505.15311,

  37. [37]

    Token-level direct preference optimization

    Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, and Jun Wang. Token-level direct preference optimization.arXiv preprint arXiv:2404.11999,

  38. [38]

    Provable offline preference-based reinforcement learning.arXiv preprint arXiv:2305.14816,

    Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D Lee, and Wen Sun. Provable offline preference-based reinforcement learning.arXiv preprint arXiv:2305.14816,

  39. [39]

    In this case, we can not derive DPO-like algorithms but we compute the maximum likelihood estimator ˆrfor the reward function from the preference dataset and then consider three approaches: (i) RL outputs the action with highest estimated reward, (ii) a method that we nameχ2RL which outputs πout = argmax π∈Π ⟨π,ˆr⟩ −γDχ2(π, πref)where we set γ = 40and (ii...

  40. [40]

    The rejection sampling rule with the minimum in the numerator as definined in Algorithm 2 can be over conservative in practice and leads to a very high rejection rate

    A.3 Other worst case aggregation rules for the ensemble. The rejection sampling rule with the minimum in the numerator as definined in Algorithm 2 can be over conservative in practice and leads to a very high rejection rate. One possible alternative to reduce the generation time of the model is to consider this alternative numerator, fout(x, a) = "PL ℓ=1 ...

  41. [41]

    binarizing

    A.4 On the token level implementation ofPEPO As mentioned in the main text, we also introduce a version forPEPO which implements the pessimistic aggregation rule given by Theorem 3.2 at the token level rather than the trajectory level. This approach is faster at generation time because it does not reject generations, while Algorithm 2 does. In practice, t...

  42. [42]

    It is easy to notice that for any value along the horizontal axis the two values sum up to1

    The standard sigmoid is plotted in blue and σ(−x) = 1 −σ (x)is plotted with the dotted blue line. It is easy to notice that for any value along the horizontal axis the two values sum up to1. In addition, we show with a solid red line the sigmoid shifted on the right, i.e.σ(x−γ )by some margin γ and in dotted red the reflected version, i.e.σ(−x + γ), which...

  43. [43]

    Achieving those two improvements simultaneously, i.e

    In other words, Table 3 shows thatPEPO, DPO+SFT andχ2PO all applies beyond the easiest case of tabular setting with knownπdata butalong different axes. Achieving those two improvements simultaneously, i.e. designing an approach which applies for general policy classes without knowledge about πdata, is in our opinion the most important open question rising...

  44. [44]

    Examples of the first kind are the emerging family of algorithms for reasoning such as GRPO Shao et al

    Unknownπ data PEPO(Ours)Open Question Optimism via EnsembleIn the context of LLM post-training there are applications in which at training time it is possible to generate sentences from the model which is currently learning and getting feedback for this response under the form of a reward signal or a preference bit. Examples of the first kind are the emer...

  45. [45]

    and Nash Learning from Human Feedback Munos et al. [2024]. While promising at an empirical level, they all miss an exploration mechanism that should be included to ensure that the learning model queries the reward or preference oracle for response which are uncertain, i.e. for which new information should really be acquired. Such lack of exploration has b...

  46. [46]

    ∃x, a∈ X × As.t.−Pessimism(x, a)≥ − X b πdata(b|x) h (λ(x, a, b))2eRmax +λ(x, a, b)e Rmax/2 i# ≤ |X | |A|e −2L/7 Therefore, choosingL= 7 2 log |X ||A| δ ensures that P

    The condition1 −p ℓ tie(x, a, b) ≤ N ℓ(x,a,b) N ℓ(x,a,b)+2 holds true setting the tie weightsλℓ(x, a, b)to a large enough value. In particular, we requireλℓ(x, a, b)to satisfy for allx, a, b∈ X × A × Athat N ℓ(x, a, b) N ℓ(x, a, b) + 2 ≥ exp β∆log ˜πℓ/π(x, b, a)/2 + exp β∆log ˜πℓ/π(x, a, b)/2 exp β∆log ˜πℓ/π(x, b, a)/2 + exp β∆log ˜πℓ/π(x, a, b)/2 +λ ℓ(x,...

  47. [47]

    Finally, to have the result holding for all x, a∈ X × A , ˜N(x, a,·) : A → [N]and k∈ [N], we conclude via a union bound. Then, choosingk = N(x, a) and ˜N(x, a, b)as the minimum count observed across the ensemble denoted asNmin(x, a, b), we have that X b πdata(b|x)pℓ tie(x, a)≤ 1 N(x, a) +γ NX n=1 4eRmax/21{Xn,A+ n =x,a} 2eRmax/2 +N min(x, a, A−n ) + 4|A|l...

  48. [48]

    Finally, the proof is concluded by invoking Theorem F.7 which gives that the Lipschitz constant ofσ−1 pess in the interval[−Rmax, Rmax]is of ordere Rmax, i.e.L=O eRmax

    −1 + 4|X |log(|X |LN 3/δ) N ≤ 6|X |log(|X |LN 3/δ) N ,(26) therefore, plugging in (26) into (25), it holds that with probability1−δ X x ν0(x) X a πdata(a|x) X b πdata(b|x) ∆r⋆(x, a, b)− b∆(x, a, b) !2 ≤ eO L2R2 max |X | |A|2 Llog(1/δ) N ! . Finally, the proof is concluded by invoking Theorem F.7 which gives that the Lipschitz constant ofσ−1 pess in the in...

  49. [49]

    Combining these estimates, we obtain |f ′(y)| ≤2 " λ+ λ2+6 2a 2a + 1 a # = λ+ 2 a + λ2 + 6 2a2

    Therefore, λ+ |D′(y)| 2 √ D(y) λy+ p D(y) ≤ λ+ λ2+6 2a 2a , 1 1−y ≤ 1 a . Combining these estimates, we obtain |f ′(y)| ≤2 " λ+ λ2+6 2a 2a + 1 a # = λ+ 2 a + λ2 + 6 2a2 . Sincea −1 =e 2Rmax + 1, the stated bound follows. 42 F.1 Importance Sampling Routine: Proof of Theorem 3.3 Proof.Let us denote fout(x, a) = min ℓ∈[L] ˜πℓ(a|x) exp(−Bptie(x, a)/β) At each...