Near-optimal Bayesian Solution For Unknown Discrete Markov Decision Process
Pith reviewed 2026-05-25 20:04 UTC · model grok-4.3
The pith
A polynomial-time Bayesian algorithm achieves near-optimal frequentist regret for unknown finite MDPs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We derive the first polynomial time Bayesian algorithm, BUCRL that achieves up to logarithm factors, a regret of the optimal order Õ(√(DSAT)). Importantly, our result holds with high probability for the worst-case (frequentist) regret and not the weaker notion of Bayesian regret.
What carries the argument
BUCRL, a Bayesian algorithm that maintains a posterior over unknown transition and reward distributions while using diameter-bounded planning to control exploration and deliver frequentist regret bounds.
If this is right
- BUCRL runs in time polynomial in S, A, T and D even though T and D are unknown.
- The same regret order holds for the frequentist worst-case regret, not merely in expectation over a prior.
- The algorithm outperforms prior methods in experiments across multiple environments.
- The analysis supplies a sharper upper bound on the KL divergence between any two Bernoulli distributions.
- Sharper elementary upper and lower bounds are obtained for the quantiles of Beta and binomial distributions.
Where Pith is reading between the lines
- Similar posterior-sampling techniques may be adapted to obtain frequentist optimality in other Bayesian RL settings without losing computational tractability.
- The new quantile and KL bounds can be reused to tighten analyses in any algorithm that relies on Beta or Bernoulli concentration.
- If an online estimator for diameter can be incorporated, the regret guarantee could extend to MDPs where D is not known to be finite in advance.
Load-bearing premise
The unknown MDP has a finite diameter D that bounds the expected shortest path between any pair of states.
What would settle it
An MDP with finite S and A where BUCRL's realized regret exceeds C √(DSAT) times polylog factors for some constant C, over a worst-case choice of transitions and rewards.
Figures
read the original abstract
We tackle the problem of acting in an unknown finite and discrete Markov Decision Process (MDP) for which the expected shortest path from any state to any other state is bounded by a finite number $D$. An MDP consists of $S$ states and $A$ possible actions per state. Upon choosing an action $a_t$ at state $s_t$, one receives a real value reward $r_t$, then one transits to a next state $s_{t+1}$. The reward $r_t$ is generated from a fixed reward distribution depending only on $(s_t, a_t)$ and similarly, the next state $s_{t+1}$ is generated from a fixed transition distribution depending only on $(s_t, a_t)$. The objective is to maximize the accumulated rewards after $T$ interactions. In this paper, we consider the case where the reward distributions, the transitions, $T$ and $D$ are all unknown. We derive the first polynomial time Bayesian algorithm, BUCRL{} that achieves up to logarithm factors, a regret (i.e the difference between the accumulated rewards of the optimal policy and our algorithm) of the optimal order $\tilde{\mathcal{O}}(\sqrt{DSAT})$. Importantly, our result holds with high probability for the worst-case (frequentist) regret and not the weaker notion of Bayesian regret. We perform experiments in a variety of environments that demonstrate the superiority of our algorithm over previous techniques. Our work also illustrates several results that will be of independent interest. In particular, we derive a sharper upper bound for the KL-divergence of Bernoulli random variables. We also derive sharper upper and lower bounds for Beta and Binomial quantiles. All the bound are very simple and only use elementary functions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents BUCRL, the first polynomial-time Bayesian algorithm for unknown finite discrete MDPs whose diameter is bounded by some finite (but unknown) D. The central claim is that BUCRL attains the near-optimal frequentist regret bound Õ(√(DSAT)) with high probability (not merely Bayesian regret), without knowledge of the horizon T or the diameter D. The paper also derives sharper elementary bounds on the KL divergence between Bernoulli random variables and on Beta and Binomial quantiles, and reports experiments showing superiority over prior methods.
Significance. If the regret claim holds, the result would be significant: it supplies the first poly-time Bayesian procedure whose worst-case frequentist performance matches the known information-theoretic rate up to logarithmic factors for bounded-diameter MDPs. The auxiliary bounds on KL divergence and quantiles are of independent interest and are presented in elementary closed form. The experimental comparisons provide supporting evidence but are secondary to the theoretical contribution.
major comments (2)
- [Abstract, §1] Abstract and §1: The frequentist high-probability regret claim of Õ(√(DSAT)) is load-bearing on the structural assumption that the unknown MDP has finite diameter D. The analysis must establish that the Bayesian posterior updates yield the stated rate uniformly over all MDPs whose diameter is at most D, without the algorithm ever receiving D or estimating it. No derivation, concentration argument, or error decomposition is supplied in the abstract to confirm this uniformity.
- [§1, related-work discussion] The claim that the result is the first polynomial-time Bayesian algorithm achieving this frequentist rate requires explicit comparison to prior work on posterior sampling or Bayesian exploration in MDPs; if the proof in the analysis section reduces to standard UCRL-style optimism after posterior sampling, the novelty of the Bayesian component needs to be clarified.
minor comments (2)
- [Abstract] Notation: the diameter D is introduced as unknown yet the regret expression contains D; a brief remark on how the algorithm remains well-defined when D is unavailable would improve clarity.
- [Experiments] The experiments section would benefit from reporting the number of independent runs and confidence intervals on the plotted regret curves.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and for recognizing the potential significance of the result. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract, §1] Abstract and §1: The frequentist high-probability regret claim of Õ(√(DSAT)) is load-bearing on the structural assumption that the unknown MDP has finite diameter D. The analysis must establish that the Bayesian posterior updates yield the stated rate uniformly over all MDPs whose diameter is at most D, without the algorithm ever receiving D or estimating it. No derivation, concentration argument, or error decomposition is supplied in the abstract to confirm this uniformity.
Authors: The abstract is a concise summary and does not contain full derivations; the complete analysis establishing uniformity appears in Sections 3–5. The proof shows that posterior sampling combined with the subsequent optimistic planning yields the Õ(√(DSAT)) high-probability frequentist bound uniformly over all MDPs with diameter at most the unknown D, without the algorithm receiving or estimating D. The diameter enters only the analysis (via the standard diameter-dependent concentration and covering arguments), not the algorithm itself. We can add one sentence to the abstract referencing this uniformity if the editor deems it necessary. revision: partial
-
Referee: [§1, related-work discussion] The claim that the result is the first polynomial-time Bayesian algorithm achieving this frequentist rate requires explicit comparison to prior work on posterior sampling or Bayesian exploration in MDPs; if the proof in the analysis section reduces to standard UCRL-style optimism after posterior sampling, the novelty of the Bayesian component needs to be clarified.
Authors: We will expand the related-work discussion in §1 to include direct comparisons with prior posterior-sampling and Bayesian exploration results for MDPs (e.g., Thompson sampling variants and Bayesian UCRL-style methods). The novelty is that BUCRL is the first polynomial-time Bayesian procedure whose analysis delivers a high-probability frequentist regret bound matching the information-theoretic rate up to logs, without knowledge of T or D. While the algorithm does draw a model from the posterior and then applies an optimistic planner, the Bayesian posterior is essential: it supplies the concentration that holds uniformly over the unknown bounded-diameter class without requiring the algorithm to know or estimate D, which prior frequentist methods needed explicitly. revision: yes
Circularity Check
No significant circularity detected; derivation is self-contained.
full rationale
The paper introduces the BUCRL algorithm as a new polynomial-time Bayesian procedure for finite MDPs with unknown but bounded diameter D, deriving a high-probability frequentist regret bound of Õ(√(DSAT)). It also presents independent auxiliary results on KL-divergence bounds for Bernoulli variables and sharper Beta/Binomial quantile bounds using elementary functions. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central claim to prior author work are present. The diameter assumption is an explicit class restriction on the unknown MDP, not a circular input, and the regret analysis is positioned against external benchmarks without reducing to the algorithm's own fitted quantities.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The unknown MDP has finite S states, A actions, and finite diameter D (expected shortest path from any state to any other bounded by D).
Reference graph
Works this paper leans on
-
[1]
Agrawal, S. and Goyal, N. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pp. 39–1, 2012
work page 2012
-
[2]
Minimax Regret Bounds for Reinforcement Learning
Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. arXiv preprint arXiv:1703.05449, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[3]
Babu, G. J. and Rao, C. R. Joint asymptotic distribution of marginal quantiles and quantile functions in samples from a multivariate population. In Multivariate Statistics and Probability, pp. 15–23. Elsevier, 1989
work page 1989
-
[4]
Bartlett, P. L. and Tewari, A. Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, pp. 35–42. AUAI Press, 2009
work page 2009
-
[5]
Chiani, M., Dardari, D., and Simon, M. K. New exponential bounds and approximations for the computation of error probability in fading channels. IEEE Transactions on Wireless Communications, 2 (4):840–845, 2003
work page 2003
-
[6]
Dragomir, S. S., Scholz, M., and Sunde, J. Some upper bounds for relative entropy and applications. Computers & Mathematics with Applications, 39(9-10):91–100, 2000
work page 2000
-
[7]
Optimism in reinforcement learning and kullback-leibler diver- gence
Filippi, S., Cappé, O., and Garivier, A. Optimism in reinforcement learning and kullback-leibler diver- gence. In Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, pp. 115–122. IEEE, 2010. 8
work page 2010
-
[8]
Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning
Fruit, R., Pirotta, M., Lazaric, A., and Ortner, R. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. arXiv preprint arXiv:1802.04020, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
Gocgun, Y ., Bresnahan, B. W., Ghate, A., and Gunn, M. L. A markov decision process approach to multi-category patient scheduling in a diagnostic facility. Artificial intelligence in medicine , 53(2): 73–81, 2011
work page 2011
-
[10]
Near-optimal regret bounds for reinforcement learning
Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010
work page 2010
-
[11]
Large deviation inequalities for sums of indicator variables
Janson, S. Large deviation inequalities for sums of indicator variables.arXiv preprint arXiv:1609.00533, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[12]
Kaas, R. and Buhrman, J. M. Mean, median and mode in binomial distributions. Statistica Neerlandica, 34(1):13–18, 1980
work page 1980
-
[13]
On bayesian upper confidence bounds for bandit problems
Kaufmann, E., Garivier, A., and Paristech, T. On bayesian upper confidence bounds for bandit problems. In In AISTATS, 2012
work page 2012
-
[14]
Showing relevant ads via context multi-armed bandits
Lu, T., Pál, D., and Pál, M. Showing relevant ads via context multi-armed bandits. In Proceedings of AISTATS, 2009
work page 2009
-
[15]
Posterior Sampling for Reinforcement Learning Without Episodes
Osband, I. and Van Roy, B. Posterior sampling for reinforcement learning without episodes. arXiv preprint arXiv:1608.02731, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[16]
Osband, I. and Van Roy, B. Why is posterior sampling better than optimism for reinforcement learning? In Precup, D. and Teh, Y . W. (eds.),Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 2701–2710, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL ...
work page 2017
-
[17]
(more) efficient reinforcement learning via posterior sampling
Osband, I., Russo, D., and Van Roy, B. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pp. 3003–3011, 2013
work page 2013
-
[18]
Learning unknown markov decision processes: A thompson sampling approach
Ouyang, Y ., Gagrani, M., Nayyar, A., and Jain, R. Learning unknown markov decision processes: A thompson sampling approach. In Advances in Neural Information Processing Systems, pp. 1333–1342, 2017
work page 2017
-
[19]
Psaraftis, H. N., Wen, M., and Kontovas, C. A. Dynamic vehicle routing problems: Three decades and counting. Networks, 67(1):3–31, 2016
work page 2016
-
[20]
Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014
work page 2014
-
[21]
Sturm’s theorem with endpoints
Pébay, P., Rojas, J., and C Thompson, D. Sturm’s theorem with endpoints. 07 2019
work page 2019
-
[22]
Contextual combinatorial bandit and its application on diversified online recommendation
Qin, L., Chen, S., and Zhu, X. Contextual combinatorial bandit and its application on diversified online recommendation. In Proceedings of the 2014 SIAM International Conference on Data Mining , pp. 461–469. SIAM, 2014
work page 2014
-
[23]
Improved inequalities for the poisson and binomial distribution and upper tail quantile functions
Short, M. Improved inequalities for the poisson and binomial distribution and upper tail quantile functions. ISRN Probability and Statistics, 2013, 2013. 9
work page 2013
-
[24]
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[25]
Mastering the game of go without human knowledge
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017
work page 2017
-
[26]
Thulin, M. et al. The cost of using exact confidence intervals for a binomial proportion. Electronic Journal of Statistics, 8(1):817–840, 2014
work page 2014
-
[27]
Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities
Tossou, A., Basu, D., and Dimitrakakis, C. Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities. arXiv e-prints, art. arXiv:1905.12425, May 2019
-
[28]
Fundamental problems of algorithmic algebra, volume 49
Yap, C.-K. Fundamental problems of algorithmic algebra, volume 49. Oxford University Press Oxford, 2000
work page 2000
-
[29]
Zubkov, A. M. and Serov, A. A. A complete proof of universal inequalities for the distribution function of the binomial law. Theory of Probability & Its Applications, 57(3):539–544, 2013. A Proofs A.1 Proof of Theorem 1 Our proof is a direct application of the generic proof provided in Section B.2 of Tossou et al.[27]. To use that generic proof we need to...
work page 2013
-
[30]
Alsoq −x = 1 −p −x = 1 − k n ≥ 0 so thatx ≤q
Now let’s observe that since sgn(k n −p) = 1, it means thatx ≥ 0. Alsoq −x = 1 −p −x = 1 − k n ≥ 0 so thatx ≤q. So the condition of Theorem 4 are satisfied and our goal becomes finding anx ≥ 0 such that: x2 2(pq +x(q −p)/3) ≥ Φ−1(1 −δ)2 2n Solving for this inequality leads to the upper bound part of the Theorem. Proof of the Lower bound IfQ(Binom(n,p ), 1−δ...
-
[31]
We will try to find this number
≤k. We will try to find this number. Let’s observe thatQ(Binom(n,p ), 1
-
[32]
So, k ≥Q(Binom(n,p ), 1 −δ) (13) ≥Q(Binom(n,p ), 1
is the (smallest) median of the binomial distribution and thus we have:Q(Binom(n,p ), 1 2 ≥ ⌊np⌋ [12]. So, k ≥Q(Binom(n,p ), 1 −δ) (13) ≥Q(Binom(n,p ), 1
-
[33]
Then our objective is to find ak ≥ ⌊np⌋ satisfying (11)
(14) ≥ ⌊np⌋ (15) As a result, we havek + 1 ≥ ⌊np⌋ + 1>np and sgn(k+1 n −p) = 1. Then our objective is to find ak ≥ ⌊np⌋ satisfying (11). Letx a number such that k+1 n =p +x. Applying the inverse Φ−1 to (11) and replacing k+1 n byx, our objective becomes finding anx ≥ 0 such that: D(p +x ∥p) ≤ Φ−1(1 −δ)2 2n 12 Our objective is equivalent to finding anx such t...
-
[34]
This completes the proof for this lemma
We can do that since even if we are looking for a lower bound, all the term previously upper bounded in Lemma 2 are multiplied by −. This completes the proof for this lemma. A.2 Useful Results Proposition 4 (Lower and Upper Bound on Bernoulli KL-Divergence). For any numberp andx such that 0<p< 1, 0 ≤x ≤q whereq = 1 −p, we have: x2 2(pq +x(q −p)/3) ≤D(p +x...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.