Best Policy Learning from Trajectory Preference Feedback

Akhil Agnihotri; Deepak Ramachandran; Rahul Jain; Zheng Wen

arxiv: 2501.18873 · v4 · submitted 2025-01-31 · 💻 cs.LG

Best Policy Learning from Trajectory Preference Feedback

Akhil Agnihotri , Rahul Jain , Deepak Ramachandran , Zheng Wen This is my paper

Pith reviewed 2026-05-23 04:53 UTC · model grok-4.3

classification 💻 cs.LG

keywords preference-based reinforcement learningBayesian regretposterior samplingbest policy identificationtrajectory preferencesRLHFgenerative model alignmentonline exploration

0 comments

The pith

A posterior sampling algorithm provides the first Bayesian simple regret guarantees for identifying the best policy from trajectory preferences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the problem of finding the best policy in preference-based reinforcement learning, where data comes from binary comparisons of trajectories rather than explicit rewards. It combines an offline dataset that may be biased with online exploration to learn policies for applications like aligning generative models. The authors propose an algorithm that maintains posteriors over reward models and dynamics to guide exploration, delivering regret bounds that were previously unavailable. This approach is shown to work better than prior methods in simulations and image generation tasks. A sympathetic reader would care because it offers a more direct and potentially more robust way to optimize policies without relying on potentially flawed reward models.

Core claim

We study the best policy identification problem in PbRL and propose Posterior Sampling for Preference Learning (PSPL), a novel algorithm inspired by Top-Two Thompson Sampling that maintains posteriors over the reward model and dynamics. We provide the first Bayesian simple regret guarantees for PbRL and introduce an efficient approximation that outperforms existing baselines on simulation and image generation benchmarks.

What carries the argument

Posterior Sampling for Preference Learning (PSPL), which maintains and samples from posteriors over the reward model and dynamics to balance exploration and exploitation in the presence of preference feedback.

If this is right

Delivers Bayesian simple regret guarantees for PbRL where none existed before.
Allows effective use of potentially biased offline preference datasets combined with online pure exploration.
Provides an efficient approximation suitable for practical benchmarks like image generation.
Supports best policy identification in settings such as multi-turn interactions for generative models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the regret guarantees hold, it could enable more reliable policy optimization in human feedback settings without reward model mis-specification.
The approach might extend to other preference-based learning scenarios beyond RL, such as in ranking or recommendation systems.
Testing the method on larger scale generative models with real human raters would validate its robustness to out-of-distribution preferences.
Integrating this with existing RLHF pipelines could reduce instances of reward hacking in aligned AI systems.

Load-bearing premise

The approach assumes that sampling from posteriors over the reward model and dynamics remains computationally tractable and statistically reliable even with biased or out-of-distribution preference data.

What would settle it

Demonstrating that the proposed approximation fails to outperform baselines on the image generation benchmarks, or that the observed simple regret does not match the predicted Bayesian guarantees.

Figures

Figures reproduced from arXiv: 2501.18873 by Akhil Agnihotri, Deepak Ramachandran, Rahul Jain, Zheng Wen.

**Figure 1.** Figure 1: Comparison of PSPL with current state-ofthe-art offline finetuning algorithms, DPO and IPO, in two benchmark environments. Online finetuning is necessary for BPI. See Appendix A for more details. In this paper, we address the problem of BPI for an unknown episodic MDP where both the transition dynamics and reward functions are unknown. We assume that some offline data is available in the form of prefere… view at source ↗

**Figure 2.** Figure 2: PSPL with varying N, β, and λ in benchmark environments. Shaded region around mean line represents 1 standard deviation over 5 independent runs. MountainCar 10 0 10 1 10 2 10 3 10 4 10 5 K 18 15 12 9 6 3 0 Simple Regret LPbRL DPS PSPL 10 0 10 1 10 2 10 3 10 4 10 5 K 600 500 400 300 200 100 0 Cumulative Regret LPbRL DPS PSPL RiverSwim 100 101 102 103 104 105 K 42 35 28 21 14 7 0 Simple Regret LPbRL DPS PSPL… view at source ↗

**Figure 3.** Figure 3: Simple and Cumulative Regret (˜103 ) vs K plots. PSPL is run with λ “ 50, β “ 10, N “ 103 . Image Generation Tasks [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: Sensitivity to flawed expert policy with [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

**Figure 6.** Figure 6: Sample image generations with final image reward rpθp¨q over 5 independent runs. Images are enlarged for clarity [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗

read the original abstract

Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful approach for aligning generative models, but its reliance on learned reward models makes it vulnerable to mis-specification and reward hacking. Preference-based Reinforcement Learning (PbRL) offers a more robust alternative by directly leveraging noisy binary comparisons over trajectories. We study the best policy identification problem in PbRL, motivated by post-training optimization of generative models, for example, during multi-turn interactions. Learning in this setting combines an offline preference dataset - potentially biased or out-of-distribution and collected from a rater of subpar `competence' - with online pure exploration, making systematic online learning essential. To this end, we propose Posterior Sampling for Preference Learning ($\mathsf{PSPL}$), a novel algorithm inspired by Top-Two Thompson Sampling that maintains posteriors over the reward model and dynamics. We provide the first Bayesian simple regret guarantees for PbRL and introduce an efficient approximation that outperforms existing baselines on simulation and image generation benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PSPL claims the first Bayesian simple regret bounds for PbRL by extending Top-Two Thompson Sampling, but the guarantees look limited by the lack of robustness to the biased or OOD offline data the setting explicitly allows.

read the letter

The main point is that this paper supplies the first Bayesian simple regret guarantees for preference-based RL and introduces PSPL, an algorithm that keeps posteriors over reward models and dynamics to combine possibly biased offline preferences with online exploration. It draws from Top-Two Thompson Sampling and reports an efficient approximation that beats baselines on simulation and image generation tasks. That framing around post-training of generative models and the empirical edge are the clearest strengths, as they target a real pain point with reward hacking in RLHF without needing a perfectly specified reward model. The approach feels like a direct extension of existing posterior sampling ideas into the PbRL best-policy setting. The soft spot sits in the regret analysis. Standard Bayesian bounds assume the data-generating process aligns with the model and prior, yet the paper allows offline data from lower-competence raters or mismatched distributions. No misspecification-robust condition or adjusted bound appears to be stated, so the guarantees may only apply when the offline data happens to fit the learner's model class. The experiments would need checking for how they actually inject and handle that bias. The work is aimed at researchers working on theoretical alternatives to RLHF and alignment methods that use trajectory preferences. A reader focused on regret bounds or posterior sampling in preference learning would get value from the new result and algorithm, provided the derivations check out. It deserves peer review so the proofs and benchmark details can be examined directly.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes Posterior Sampling for Preference Learning (PSPL), an algorithm for best policy identification in Preference-based RL (PbRL) that maintains and samples from posteriors over reward models and dynamics. It combines a possibly biased or out-of-distribution offline preference dataset with online pure exploration, claims the first Bayesian simple regret guarantees for PbRL, and introduces an efficient approximation that outperforms baselines on simulation and image-generation benchmarks.

Significance. If the Bayesian simple regret bounds are valid, the work would be significant for supplying the first such guarantees in PbRL while explicitly allowing offline data that may be biased or OOD, which aligns with practical RLHF settings. The empirical outperformance of the approximation on benchmarks provides additional practical value.

major comments (1)

[Theoretical regret analysis (the section containing the Bayesian simple regret bounds)] The Bayesian simple regret analysis (the section deriving the guarantees for PSPL) assumes the posterior over the reward model and dynamics remains well-calibrated and statistically valid when the offline preference dataset is biased or out-of-distribution. No explicit misspecification-robustness condition, prior-likelihood consistency requirement, or sensitivity analysis is supplied to justify this under the motivating rater-competence mismatch, which is load-bearing for the central claim that the bounds apply in the stated setting.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback on the theoretical analysis. We address the single major comment below.

read point-by-point responses

Referee: The Bayesian simple regret analysis (the section deriving the guarantees for PSPL) assumes the posterior over the reward model and dynamics remains well-calibrated and statistically valid when the offline preference dataset is biased or out-of-distribution. No explicit misspecification-robustness condition, prior-likelihood consistency requirement, or sensitivity analysis is supplied to justify this under the motivating rater-competence mismatch, which is load-bearing for the central claim that the bounds apply in the stated setting.

Authors: We thank the referee for highlighting this point. The Bayesian simple regret guarantees for PSPL are derived under the standard assumption of correct model specification: the true reward model and dynamics are drawn from the prior, and both the offline preference dataset and online trajectories are generated according to this model. Under this assumption the posterior remains well-calibrated by construction, and the regret analysis follows from standard Bayesian arguments. The manuscript introduction motivates the setting with possibly biased or OOD offline data collected from a rater of subpar competence to reflect practical RLHF use cases; however, the stated theoretical results require that the offline data is consistent with the model class. We agree that an explicit statement of this modeling assumption was missing from the paper. In the revision we will add a clear remark in the theoretical section noting the correct-specification requirement and explicitly stating that robustness to misspecification (such as rater-competence mismatch) is left for future work. No sensitivity analysis is included because the focus of the current analysis is the well-specified Bayesian case. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The provided abstract and text introduce PSPL as a novel algorithm inspired by Top-Two Thompson Sampling and claim first Bayesian simple regret guarantees for PbRL, but contain no equations, proof sketches, or self-referential definitions that reduce any claimed prediction or guarantee to a fitted input or prior self-citation by construction. The central claims rest on maintaining posteriors over reward and dynamics with an offline dataset, without visible load-bearing steps that collapse to renaming or ansatz smuggling. This is the expected honest non-finding when no explicit reduction can be exhibited from the text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.0 · 5701 in / 1100 out tokens · 30232 ms · 2026-05-23T04:53:58.998703+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 1 internal anchor

[1]

URLhttp://www.jstor.org/ stable/2334029

ISSN 00063444. URLhttp://www.jstor.org/ stable/2334029. Róbert Busa-Fekete and Eyke Hüllermeier. A survey of preference-based online learning with bandit al- gorithms. InAlgorithmic Learning Theory: 25th International Conference, ALT 2014, Bled, Slove- nia, October 8-10, 2014. Proceedings 25, pages 18–39. Springer, 2014. Stephen Casper, Xander Davies, Cla...

work page arXiv 2014
[2]

Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, Aviv Tamar, et al

URL https://huggingface.co/datasets/ laion/laion2B-en-aesthetic. Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, Aviv Tamar, et al. Bayesian reinforcement learning: A survey.Foundations and Trends®in Machine Learning, 8(5-6):359–483, 2015. Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calan...

work page arXiv 2015
[3]

Introduction to the non-asymptotic analysis of random matrices

doi: https://doi.org/10.1016/j.jcss.2007.08.009. URL https://www.sciencedirect.com/science/ article/pii/S0022000008000767. Learning The- ory 2005. Gokul Swamy, David Wu, Sanjiban Choudhury, Drew Bagnell, and Steven Wu. Inverse reinforcement learn- ing without reinforcement learning. InInternational Conference on Machine Learning, pages 33299–33318. PMLR, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/j.jcss.2007.08.009 2007
[4]

Yes, please see Section 2

For all models and algorithms presented, check if you include: (a) A clear description of the mathematical set- ting, assumptions, algorithm, and/or model. Yes, please see Section 2. (b) An analysis of the properties and complexity (time, space, sample size) of any algorithm. Yes, please see Section 4. (c) (Optional) Anonymized source code, with spec- ifi...

work page
[5]

Yes, please see Sections 3 and 4, and Appendix A

For any theoretical claim, check if you include: (a) Statements of the full set of assumptions of all theoretical results. Yes, please see Sections 3 and 4, and Appendix A. (b) Complete proofs of all theoretical results. Yes, please see Appendix A. (c) Clear explanations of any assumptions. Yes, please see Appendix A

work page
[6]

Yes, code will be released later

For all figures and tables that present empirical results, check if you include: (a) The code, data, and instructions needed to re- produce the main experimental results (either in the supplemental material or as a URL). Yes, code will be released later. Please see Appendix A for instructions and experimental setup. (b) All the training details (e.g., dat...

work page
[7]

Yes, please see Appendix A

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include: (a) Citations of the creator If your work uses ex- isting assets. Yes, please see Appendix A. (b) The license information of the assets, if appli- cable. Yes, please see Appendix A. (c) New assets either in the supplemental material or as a ...

work page
[8]

xp1´xq.fis a concave function. We have for anyiP t0,1u, Prpπpiq k ‰π ‹q “E

If you used crowdsourcing or conducted research with human subjects, check if you include: (a) The full text of instructions given to partici- pants and screenshots. Not Applicable. (b) Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. Not Applicable. (c) The estimated hourly wage paid to ...

work page 2024
[9]

Notice that the eachchpsq is the difference of two binomial random variablesb1 „BinpN, 1 ´γ β,λ,N q andb 2 „BinpN, γ β,λ,N q

Under the eventE1.Let b„BinpT, qq denote a binomial random variable with parametersTPN and qP r 0, 1s. Notice that the eachchpsq is the difference of two binomial random variablesb1 „BinpN, 1 ´γ β,λ,N q andb 2 „BinpN, γ β,λ,N q. This implies thatchpsq `N„Binp2N,1´γ β,λ,N q. We then have, PrpT c |E 1q ďPrpc hpsq ăδNq ďPrpBinp2N,1´γ β,λ,N q ă p1`δqNq ďexp ´...

work page 2024
[10]

argmax θ,ϑ,η Prpθ, ϑ, η|D kq

implies Pp}ϑ´θ} 8 ětq ď2d 1{2 exp ˆ ´ t2λ2 2 ˙ . Set t“ a 2 lnp2d1{2Kq{λ and define an eventE1 :“ t}ϑ´θ} 8 ď a 2 lnp2d1{2Kq{λu such that PpE c 1q ď 1{K. We decompose Equation (17) using Union Bound as: PpA‹ RU D0q ďE « Nź n“1 `1´P `Yn “A‹ ˇˇ θ, ϑ˘˘IE1 ff `PpE c 1q ` p1´µminq2N ďE » —– Nź n“1 ¨ ˚˝1´ ¨ ˝1`exp ˜ βa2 lnp2d1{2Kq λ ¸ expp´βxA‹ ´an, θyqloooooooo...

work page
[11]

Similar idea has been proposed to estimate the expertise level in imitation learning Beliaev et al

Based on maximum likelihood estimation (MLE). Similar idea has been proposed to estimate the expertise level in imitation learning Beliaev et al. (2022); Beliaev and Pedarsani (2025)

work page 2022
[12]

1 diam ` Ft|xt ˘ ďα`Cpd^Tq `2δ T ? dT , whereδ T “max 1ďtďT diam ` Ft|x1:t ˘ andd“dim E pF, αq. Lemma B.8.If pβt ě0|tPNq is a nondecreasing sequence andFt :“

The second method is to simply look at the entropy of the empirical distribution of the action in the offline dataset. Suppose the empirical distribution ofζ. Then we usec{Hpζq as an estimation forβ, where cą 0 is a hyperparameter. The intuition is that for smallerβ, the net state-action pair visit counts tend to be more uniform and thus the entropy will ...

work page 2023

[1] [1]

URLhttp://www.jstor.org/ stable/2334029

ISSN 00063444. URLhttp://www.jstor.org/ stable/2334029. Róbert Busa-Fekete and Eyke Hüllermeier. A survey of preference-based online learning with bandit al- gorithms. InAlgorithmic Learning Theory: 25th International Conference, ALT 2014, Bled, Slove- nia, October 8-10, 2014. Proceedings 25, pages 18–39. Springer, 2014. Stephen Casper, Xander Davies, Cla...

work page arXiv 2014

[2] [2]

Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, Aviv Tamar, et al

URL https://huggingface.co/datasets/ laion/laion2B-en-aesthetic. Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, Aviv Tamar, et al. Bayesian reinforcement learning: A survey.Foundations and Trends®in Machine Learning, 8(5-6):359–483, 2015. Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calan...

work page arXiv 2015

[3] [3]

Introduction to the non-asymptotic analysis of random matrices

doi: https://doi.org/10.1016/j.jcss.2007.08.009. URL https://www.sciencedirect.com/science/ article/pii/S0022000008000767. Learning The- ory 2005. Gokul Swamy, David Wu, Sanjiban Choudhury, Drew Bagnell, and Steven Wu. Inverse reinforcement learn- ing without reinforcement learning. InInternational Conference on Machine Learning, pages 33299–33318. PMLR, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/j.jcss.2007.08.009 2007

[4] [4]

Yes, please see Section 2

For all models and algorithms presented, check if you include: (a) A clear description of the mathematical set- ting, assumptions, algorithm, and/or model. Yes, please see Section 2. (b) An analysis of the properties and complexity (time, space, sample size) of any algorithm. Yes, please see Section 4. (c) (Optional) Anonymized source code, with spec- ifi...

work page

[5] [5]

Yes, please see Sections 3 and 4, and Appendix A

For any theoretical claim, check if you include: (a) Statements of the full set of assumptions of all theoretical results. Yes, please see Sections 3 and 4, and Appendix A. (b) Complete proofs of all theoretical results. Yes, please see Appendix A. (c) Clear explanations of any assumptions. Yes, please see Appendix A

work page

[6] [6]

Yes, code will be released later

For all figures and tables that present empirical results, check if you include: (a) The code, data, and instructions needed to re- produce the main experimental results (either in the supplemental material or as a URL). Yes, code will be released later. Please see Appendix A for instructions and experimental setup. (b) All the training details (e.g., dat...

work page

[7] [7]

Yes, please see Appendix A

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include: (a) Citations of the creator If your work uses ex- isting assets. Yes, please see Appendix A. (b) The license information of the assets, if appli- cable. Yes, please see Appendix A. (c) New assets either in the supplemental material or as a ...

work page

[8] [8]

xp1´xq.fis a concave function. We have for anyiP t0,1u, Prpπpiq k ‰π ‹q “E

If you used crowdsourcing or conducted research with human subjects, check if you include: (a) The full text of instructions given to partici- pants and screenshots. Not Applicable. (b) Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. Not Applicable. (c) The estimated hourly wage paid to ...

work page 2024

[9] [9]

Notice that the eachchpsq is the difference of two binomial random variablesb1 „BinpN, 1 ´γ β,λ,N q andb 2 „BinpN, γ β,λ,N q

Under the eventE1.Let b„BinpT, qq denote a binomial random variable with parametersTPN and qP r 0, 1s. Notice that the eachchpsq is the difference of two binomial random variablesb1 „BinpN, 1 ´γ β,λ,N q andb 2 „BinpN, γ β,λ,N q. This implies thatchpsq `N„Binp2N,1´γ β,λ,N q. We then have, PrpT c |E 1q ďPrpc hpsq ăδNq ďPrpBinp2N,1´γ β,λ,N q ă p1`δqNq ďexp ´...

work page 2024

[10] [10]

argmax θ,ϑ,η Prpθ, ϑ, η|D kq

implies Pp}ϑ´θ} 8 ětq ď2d 1{2 exp ˆ ´ t2λ2 2 ˙ . Set t“ a 2 lnp2d1{2Kq{λ and define an eventE1 :“ t}ϑ´θ} 8 ď a 2 lnp2d1{2Kq{λu such that PpE c 1q ď 1{K. We decompose Equation (17) using Union Bound as: PpA‹ RU D0q ďE « Nź n“1 `1´P `Yn “A‹ ˇˇ θ, ϑ˘˘IE1 ff `PpE c 1q ` p1´µminq2N ďE » —– Nź n“1 ¨ ˚˝1´ ¨ ˝1`exp ˜ βa2 lnp2d1{2Kq λ ¸ expp´βxA‹ ´an, θyqloooooooo...

work page

[11] [11]

Similar idea has been proposed to estimate the expertise level in imitation learning Beliaev et al

Based on maximum likelihood estimation (MLE). Similar idea has been proposed to estimate the expertise level in imitation learning Beliaev et al. (2022); Beliaev and Pedarsani (2025)

work page 2022

[12] [12]

1 diam ` Ft|xt ˘ ďα`Cpd^Tq `2δ T ? dT , whereδ T “max 1ďtďT diam ` Ft|x1:t ˘ andd“dim E pF, αq. Lemma B.8.If pβt ě0|tPNq is a nondecreasing sequence andFt :“

The second method is to simply look at the entropy of the empirical distribution of the action in the offline dataset. Suppose the empirical distribution ofζ. Then we usec{Hpζq as an estimation forβ, where cą 0 is a hyperparameter. The intuition is that for smallerβ, the net state-action pair visit counts tend to be more uniform and thus the entropy will ...

work page 2023