Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution
Pith reviewed 2026-05-21 13:02 UTC · model grok-4.3
The pith
PEPO uses pessimistic ensembles on disjoint data subsets to avoid over-optimization in DPO by depending only on single-policy concentrability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PEPO achieves sample complexity guarantees depending only on a single-policy concentrability coefficient by training an ensemble of preference-optimized policies on disjoint data subsets and aggregating them through a worst-case construction that favors agreement across models, thereby avoiding the all-policy concentrability which affects the guarantees of algorithms prone to over-optimization such as DPO.
What carries the argument
The pessimistic ensemble aggregation, which constructs a final policy from the worst-case agreement among models trained on disjoint data subsets to enforce pessimism without an explicit reward model.
If this is right
- Sample complexity in the tabular setting depends only on single-policy concentrability rather than all-policy concentrability.
- The method retains the single-step training simplicity of DPO while adding theoretical protection against over-optimization.
- No knowledge of the data-generating distribution is required for the guarantees.
- Practical performance matches or exceeds DPO while remaining computationally comparable.
Where Pith is reading between the lines
- The same disjoint-subset construction might reduce concentrability requirements in non-tabular or continuous settings if policy diversity can be maintained.
- PEPO could be combined with other regularization techniques to further improve robustness in large language model alignment.
- Empirical checks on real preference datasets would test whether the required diversity across disjoint subsets holds in practice.
Load-bearing premise
Training on disjoint data subsets produces sufficiently diverse policies whose worst-case aggregation still yields a performant and safe policy.
What would settle it
A tabular MDP in which the worst-case aggregation of the subset-trained policies produces a low-performing or unsafe policy even though each individual policy is good on its own subset.
Figures
read the original abstract
We introduce PEPO (Pessimistic Ensemble based Preference Optimization), a single-step Direct Preference Optimization (DPO)-like algorithm to mitigate the well-known over-optimization issue in preference learning without requiring the knowledge of the data-generating distribution or learning an explicit reward model. PEPO achieves pessimism via an ensemble of preference-optimized policies trained on disjoint data subsets and then aggregates them through a worst case construction that favors the agreement across models. In the tabular setting, PEPO achieves sample complexity guarantees depending only on a single-policy concentrability coefficient, thus avoiding the all-policy concentrability which affects the guarantees of algorithms prone to over-optimization, such as DPO. The theoretical findings are corroborated by a convincing practical performance, while retaining the simplicity and the practicality of DPO-style training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PEPO (Pessimistic Ensemble based Preference Optimization), a DPO-style algorithm that trains an ensemble of policies on disjoint data subsets and aggregates them via a worst-case agreement construction to induce pessimism. The central theoretical claim is that, in the tabular setting, this yields sample-complexity guarantees depending only on a single-policy concentrability coefficient, thereby avoiding the all-policy concentrability that appears in analyses of standard DPO and other over-optimization-prone methods. The approach requires neither knowledge of the data-generating distribution nor an explicit reward model, and the authors report competitive empirical performance.
Significance. If the concentrability reduction is rigorously established, the result would be significant for preference optimization and RLHF: it supplies a concrete algorithmic mechanism (disjoint-subset ensembles plus worst-case aggregation) that converts a multi-policy coverage requirement into a single-policy one while preserving the simplicity of DPO-style training. Such a guarantee would be a useful reference point for designing safe preference learners that do not rely on explicit reward modeling or distribution knowledge.
major comments (2)
- The reduction from all-policy to single-policy concentrability rests on the claim that worst-case aggregation over policies trained on disjoint subsets produces a policy whose effective coverage is controlled by one reference policy alone. The manuscript does not supply an explicit argument or assumption showing that disjoint subsets necessarily generate sufficient policy diversity, nor does it bound the extrapolation bias that could be shared across independently optimized policies. Without such justification, the worst-case operator may still select an over-optimized action whose concentrability coefficient is closer to the all-policy quantity than to the single-policy quantity (see the tabular analysis and the definition of the aggregated policy).
- The sample-complexity statement is presented without the full derivation, explicit error bounds, or the precise assumptions placed on the data-generating process and on the reference policy. Because the concentrability reduction is the load-bearing step for the main claim, the absence of these details prevents verification that no hidden multi-policy coverage terms reappear in the final bound.
minor comments (2)
- Notation for the worst-case aggregation operator and for the single-policy concentrability coefficient should be introduced earlier and used consistently throughout the theoretical section.
- The empirical section would benefit from an ablation that isolates the contribution of the disjoint-subset training versus the worst-case aggregation step.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments. We address the major comments point by point below and will revise the manuscript to strengthen the theoretical presentation.
read point-by-point responses
-
Referee: The reduction from all-policy to single-policy concentrability rests on the claim that worst-case aggregation over policies trained on disjoint subsets produces a policy whose effective coverage is controlled by one reference policy alone. The manuscript does not supply an explicit argument or assumption showing that disjoint subsets necessarily generate sufficient policy diversity, nor does it bound the extrapolation bias that could be shared across independently optimized policies. Without such justification, the worst-case operator may still select an over-optimized action whose concentrability coefficient is closer to the all-policy quantity than to the single-policy quantity (see the tabular analysis and the definition of the aggregated policy).
Authors: We appreciate the referee highlighting the need for greater explicitness here. The current tabular analysis and definition of the aggregated policy implicitly rely on the fact that disjoint subsets yield independently optimized policies with distinct extrapolation behaviors, and that the worst-case aggregation selects the most conservative action among them. However, we agree that a self-contained argument is warranted. In the revision we will add a dedicated lemma that (i) shows how independent optimization on disjoint subsets induces policy diversity sufficient to cover the relevant action space under the reference policy, and (ii) bounds the shared extrapolation bias so that the effective concentrability of the aggregated policy is controlled by the single-policy coefficient alone. This will directly address the concern that the worst-case operator could inadvertently revert to an all-policy quantity. revision: yes
-
Referee: The sample-complexity statement is presented without the full derivation, explicit error bounds, or the precise assumptions placed on the data-generating process and on the reference policy. Because the concentrability reduction is the load-bearing step for the main claim, the absence of these details prevents verification that no hidden multi-policy coverage terms reappear in the final bound.
Authors: We acknowledge that the sample-complexity guarantee is currently stated at a high level. In the revised manuscript we will supply the complete derivation, including all intermediate error bounds, the precise assumptions on the data-generating process (i.i.d. sampling from the preference distribution), and the conditions on the reference policy. The expanded proof will explicitly track concentrability terms at each step and confirm that only the single-policy coefficient appears in the final bound, with no residual multi-policy coverage factors. revision: yes
Circularity Check
No significant circularity in PEPO's single-policy concentrability claim
full rationale
The paper introduces PEPO as a new ensemble-based algorithm that trains DPO-style policies on disjoint data subsets and aggregates via worst-case agreement to induce pessimism. The central theoretical claim is a sample-complexity bound in the tabular setting that depends only on a single-policy concentrability coefficient. No quoted equation or derivation step reduces this bound by construction to a fitted parameter, a self-citation chain, or an input that already encodes the target guarantee. The construction and its analysis appear self-contained against external benchmarks, with the concentrability reduction arising from the explicit algorithmic design rather than from re-labeling or tautological fitting.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Disjoint data subsets yield policies whose worst-case aggregation preserves single-policy concentrability.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PEPO achieves sample complexity guarantees depending only on a single-policy concentrability coefficient... avoiding the all-policy concentrability which affects... DPO
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
pessimism via an ensemble of preference-optimized policies trained on disjoint data subsets and then aggregates them through a worst case construction
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Design considerations in offline preference-based rl
Alekh Agarwal, Christoph Dann, and Teodor V Marinov. Design considerations in offline preference-based rl. arXiv preprint arXiv:2502.06861,
-
[2]
Udbhav Bamba, Minghao Fang, Yifan Yu, Haizhong Zheng, and Fan Lai. Xrpo: Pushing the limits of grpo with targeted exploration and exploitation.arXiv preprint arXiv:2510.06672,
-
[3]
Shicong Cen, Jincheng Mei, Katayoon Goshvadi, Hanjun Dai, Tong Yang, Sherry Yang, Dale Schuurmans, Yuejie Chi, and Bo Dai. Value-incentivized preference optimization: A unified approach to online and offline rlhf.arXiv preprint arXiv:2405.19320,
-
[4]
On extending direct preference optimization to accommodate ties.arXiv preprint arXiv:2409.17431,
Jinghong Chen, Guangyu Yang, Weizhe Lin, Jingbiao Mei, and Bill Byrne. On extending direct preference optimization to accommodate ties.arXiv preprint arXiv:2409.17431,
-
[5]
Mingyu Chen, Yiding Chen, Wen Sun, and Xuezhou Zhang. AvoidingO(eRmax)scaling in rlhf through preference-based exploration.arXiv preprint arXiv:2502.00666,
-
[6]
UCB Exploration via Q-Ensembles
Richard Y Chen, Szymon Sidor, Pieter Abbeel, and John Schulman. Ucb exploration via q-ensembles.arXiv preprint arXiv:1706.01502,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Reward model ensembles help mitigate overoptimization.arXiv preprint arXiv:2310.02743,
Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger. Reward model ensembles help mitigate overoptimization.arXiv preprint arXiv:2310.02743,
-
[8]
Active preference optimiza- tion for sample efficient rlhf.arXiv preprint arXiv:2402.10500,
Nirjhar Das, Souradip Chakraborty, Aldo Pacchiano, and Sayak Ray Chowdhury. Active preference optimiza- tion for sample efficient rlhf.arXiv preprint arXiv:2402.10500,
-
[9]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
URLhttps://arxiv.org/abs/2501.12948. Fei Ding, Baiqiao Wang, Zijian Zeng, and Youwei Wang. Multi-layer grpo: Enhancing reasoning and self-correction in large language models.arXiv preprint arXiv:2506.04746,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Robust preference optimization through reward model distillation.arXiv preprint arXiv:2405.19316,
Adam Fisch, Jacob Eisenstein, Vicky Zayats, Alekh Agarwal, Ahmad Beirami, Chirag Nagpal, Pete Shaw, and Jonathan Berant. Robust preference optimization through reward model distillation.arXiv preprint arXiv:2405.19316,
-
[11]
Learn your reference model for real good alignment
13 Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian Maksimov, Nikita Balagansky, and Daniil Gavrilov. Learn your reference model for real good alignment. arXiv preprint arXiv:2404.09656,
-
[12]
Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, and Saravan Rajmohan
URLhttps://arxiv.org/abs/2402.04792. Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, and Saravan Rajmohan. Multi-preference optimization: Generalizing dpo via set-level contrasts.arXiv preprint arXiv:2412.04628,
-
[13]
URLhttps://openreview.net/forum?id=nZeVKeeFYf9. Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D Lee, Wen Sun, Akshay Krishnamurthy, and Dylan J Foster. Correcting the mythos of kl-regularization: Direct alignment without overoptimization via chi- squared preference optimization.arXiv preprint arXiv:2407.13399,
-
[14]
Haque Ishfaq, Qingfeng Lan, Pan Xu, A Rupam Mahmood, Doina Precup, Anima Anandkumar, and Kamyar Azizzadenesheli. Provable and practical: Efficient exploration in reinforcement learning via langevin monte carlo.arXiv preprint arXiv:2305.18246,
-
[15]
Haque Ishfaq, Yixin Tan, Yu Yang, Qingfeng Lan, Jianfeng Lu, A Rupam Mahmood, Doina Precup, and Pan Xu. More efficient randomized exploration for reinforcement learning via approximate sampling.arXiv preprint arXiv:2406.12241,
-
[16]
Haque Ishfaq, Guangyuan Wang, Sami Nur Islam, and Doina Precup. Langevin soft actor-critic: Efficient exploration through uncertainty-driven critic learning.arXiv preprint arXiv:2501.17827,
-
[17]
Xiang Ji, Sanjeev Kulkarni, Mengdi Wang, and Tengyang Xie. Self-play with adversarial critic: Provable and scalable offline alignment for language models.arXiv preprint arXiv:2406.04274,
-
[18]
Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models.https: //github.com/tatsu-lab/alpaca_eval, 5 2023a. Zihao Li, Zhuoran Yang, and Mengdi Wang. Reinforcement learning with human feedback: Learning dynamic choices ...
-
[19]
Understanding R1-Zero-Like Training: A Critical Perspective
URLhttps://arxiv. org/abs/2503.20783. Simon Matrenok, Skander Moalla, and Caglar Gulcehre. Quantile reward policy optimization: Alignment with pointwise regression and exact partition functions.arXiv preprint arXiv:2507.08068,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
14 Antoine Moulin, Gergely Neu, and Luca Viano. Optimistically optimistic exploration for provably efficient infinite-horizon reinforcement and imitation learning.arXiv preprint arXiv:2502.13900,
-
[21]
URLhttps://arxiv.org/abs/2312.00886. Brendan O’Donoghue. Variational bayesian reinforcement learning with regret bounds.Advances in Neural Information Processing Systems, 34:28208–28221,
-
[22]
Brendan O’Donoghue. Efficient exploration via epistemic-risk-seeking policy optimization.arXiv preprint arXiv:2302.09339,
-
[23]
Disentangling length from quality in direct preference optimization
Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. InFindings of the Association for Computational Linguistics: ACL 2024, pages 4998–5017. Association for Computational Linguistics,
work page 2024
-
[24]
Andreas Schlaginhaufen, Reda Ouhamma, and Maryam Kamgarpour
URLhttps://proceedings.mlr.press/v119/rosenberg20a.html. Andreas Schlaginhaufen, Reda Ouhamma, and Maryam Kamgarpour. Efficient preference-based reinforcement learning: Randomized exploration meets experimental design.arXiv preprint arXiv:2506.09508,
-
[25]
Online apprenticeship learning.arXiv:2102.06924,
Lior Shani, Tom Zahavy, and Shie Mannor. Online apprenticeship learning.arXiv:2102.06924,
-
[26]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
URLhttps://arxiv.org/abs/2402.03300. YudaSong, GokulSwamy, AartiSingh, JBagnell, andWenSun. Theimportanceofonlinedata: Understanding preference fine-tuning via coverage.Advances in Neural Information Processing Systems, 37:12243–12270,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Robust preference optimization via dynamic target margins.arXiv preprint arXiv:2506.03690,
Jie Sun, Junkang Wu, Jiancan Wu, Zhibo Zhu, Xingyu Lu, Jun Zhou, Lintao Ma, and Xiang Wang. Robust preference optimization via dynamic target margins.arXiv preprint arXiv:2506.03690,
-
[28]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Jens Tuyls, Dylan J Foster, Akshay Krishnamurthy, and Jordan T Ash. Representation-based exploration for language models: From test-time to post-training.arXiv preprint arXiv:2510.11686,
-
[30]
URL https://openreview.net/forum?id=DChQpB4AJy. Luca Viano, Ruida Zhou, Yifan Sun, Mahdi Namazifar, Volkan Cevher, Shoham Sabach, and Mohammad Ghavamzadeh. Direct preference optimization with rating information: Practical algorithms and provable gains.arXiv preprint arXiv:2602.00603,
-
[31]
Il-soar: Imitation learning with soft optimistic actor critic
Stefano Viel, Luca Viano, and Volkan Cevher. Il-soar: Imitation learning with soft optimistic actor critic. arXiv preprint arXiv:2502.19859,
-
[32]
Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. α-dpo: Adaptive reward margin is what direct preference optimization needs.arXiv preprint arXiv:2410.10148,
-
[33]
Tengyang Xie, Dylan J Foster, Akshay Krishnamurthy, Corby Rosset, Ahmed Awadallah, and Alexander Rakhlin. Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf.arXiv preprint arXiv:2405.21046,
-
[34]
Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719,
Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weiling Liu, Zhiyu Mei, Guangju Wang, Chao Yu, and Yi Wu. Is dpo superior to ppo for llm alignment? a comprehensive study.arXiv preprint arXiv:2404.10719,
-
[35]
To believe or not to believe your llm.arXiv preprint arXiv:2406.02543,
Yasin Abbasi Yadkori, Ilja Kuzborskij, András György, and Csaba Szepesvári. To believe or not to believe your llm.arXiv preprint arXiv:2406.02543,
-
[36]
Yurun Yuan, Fan Chen, Zeyu Jia, Alexander Rakhlin, and Tengyang Xie. Trajectory bellman residual minimization: A simple value-based method for llm reasoning.arXiv preprint arXiv:2505.15311,
-
[37]
Token-level direct preference optimization
Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, and Jun Wang. Token-level direct preference optimization.arXiv preprint arXiv:2404.11999,
-
[38]
Provable offline preference-based reinforcement learning.arXiv preprint arXiv:2305.14816,
Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D Lee, and Wen Sun. Provable offline preference-based reinforcement learning.arXiv preprint arXiv:2305.14816,
-
[39]
In this case, we can not derive DPO-like algorithms but we compute the maximum likelihood estimator ˆrfor the reward function from the preference dataset and then consider three approaches: (i) RL outputs the action with highest estimated reward, (ii) a method that we nameχ2RL which outputs πout = argmax π∈Π ⟨π,ˆr⟩ −γDχ2(π, πref)where we set γ = 40and (ii...
work page 2000
-
[40]
A.3 Other worst case aggregation rules for the ensemble. The rejection sampling rule with the minimum in the numerator as definined in Algorithm 2 can be over conservative in practice and leads to a very high rejection rate. One possible alternative to reduce the generation time of the model is to consider this alternative numerator, fout(x, a) = "PL ℓ=1 ...
work page 2025
-
[41]
A.4 On the token level implementation ofPEPO As mentioned in the main text, we also introduce a version forPEPO which implements the pessimistic aggregation rule given by Theorem 3.2 at the token level rather than the trajectory level. This approach is faster at generation time because it does not reject generations, while Algorithm 2 does. In practice, t...
work page 2022
-
[42]
It is easy to notice that for any value along the horizontal axis the two values sum up to1
The standard sigmoid is plotted in blue and σ(−x) = 1 −σ (x)is plotted with the dotted blue line. It is easy to notice that for any value along the horizontal axis the two values sum up to1. In addition, we show with a solid red line the sigmoid shifted on the right, i.e.σ(x−γ )by some margin γ and in dotted red the reflected version, i.e.σ(−x + γ), which...
work page 2023
-
[43]
Achieving those two improvements simultaneously, i.e
In other words, Table 3 shows thatPEPO, DPO+SFT andχ2PO all applies beyond the easiest case of tabular setting with knownπdata butalong different axes. Achieving those two improvements simultaneously, i.e. designing an approach which applies for general policy classes without knowledge about πdata, is in our opinion the most important open question rising...
work page 2024
-
[44]
Unknownπ data PEPO(Ours)Open Question Optimism via EnsembleIn the context of LLM post-training there are applications in which at training time it is possible to generate sentences from the model which is currently learning and getting feedback for this response under the form of a reward signal or a preference bit. Examples of the first kind are the emer...
work page 2024
-
[45]
and Nash Learning from Human Feedback Munos et al. [2024]. While promising at an empirical level, they all miss an exploration mechanism that should be included to ensure that the learning model queries the reward or preference oracle for response which are uncertain, i.e. for which new information should really be acquired. Such lack of exploration has b...
work page 2024
-
[46]
The condition1 −p ℓ tie(x, a, b) ≤ N ℓ(x,a,b) N ℓ(x,a,b)+2 holds true setting the tie weightsλℓ(x, a, b)to a large enough value. In particular, we requireλℓ(x, a, b)to satisfy for allx, a, b∈ X × A × Athat N ℓ(x, a, b) N ℓ(x, a, b) + 2 ≥ exp β∆log ˜πℓ/π(x, b, a)/2 + exp β∆log ˜πℓ/π(x, a, b)/2 exp β∆log ˜πℓ/π(x, b, a)/2 + exp β∆log ˜πℓ/π(x, a, b)/2 +λ ℓ(x,...
work page 2025
-
[47]
Finally, to have the result holding for all x, a∈ X × A , ˜N(x, a,·) : A → [N]and k∈ [N], we conclude via a union bound. Then, choosingk = N(x, a) and ˜N(x, a, b)as the minimum count observed across the ensemble denoted asNmin(x, a, b), we have that X b πdata(b|x)pℓ tie(x, a)≤ 1 N(x, a) +γ NX n=1 4eRmax/21{Xn,A+ n =x,a} 2eRmax/2 +N min(x, a, A−n ) + 4|A|l...
work page 2020
-
[48]
−1 + 4|X |log(|X |LN 3/δ) N ≤ 6|X |log(|X |LN 3/δ) N ,(26) therefore, plugging in (26) into (25), it holds that with probability1−δ X x ν0(x) X a πdata(a|x) X b πdata(b|x) ∆r⋆(x, a, b)− b∆(x, a, b) !2 ≤ eO L2R2 max |X | |A|2 Llog(1/δ) N ! . Finally, the proof is concluded by invoking Theorem F.7 which gives that the Lipschitz constant ofσ−1 pess in the in...
work page 2020
-
[49]
Combining these estimates, we obtain |f ′(y)| ≤2 " λ+ λ2+6 2a 2a + 1 a # = λ+ 2 a + λ2 + 6 2a2
Therefore, λ+ |D′(y)| 2 √ D(y) λy+ p D(y) ≤ λ+ λ2+6 2a 2a , 1 1−y ≤ 1 a . Combining these estimates, we obtain |f ′(y)| ≤2 " λ+ λ2+6 2a 2a + 1 a # = λ+ 2 a + λ2 + 6 2a2 . Sincea −1 =e 2Rmax + 1, the stated bound follows. 42 F.1 Importance Sampling Routine: Proof of Theorem 3.3 Proof.Let us denote fout(x, a) = min ℓ∈[L] ˜πℓ(a|x) exp(−Bptie(x, a)/β) At each...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.