pith. sign in

arxiv: 1906.09114 · v2 · pith:QC3QNDPXnew · submitted 2019-06-20 · 💻 cs.LG · cs.AI· cs.GT· stat.ML

Near-optimal Bayesian Solution For Unknown Discrete Markov Decision Process

Pith reviewed 2026-05-25 20:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.GTstat.ML
keywords Markov decision processesBayesian reinforcement learningregret boundsfrequentist guaranteesexploration-exploitationpolynomial-time algorithmsconcentration inequalities
0
0 comments X

The pith

A polynomial-time Bayesian algorithm achieves near-optimal frequentist regret for unknown finite MDPs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BUCRL as the first polynomial-time Bayesian method for acting optimally in an unknown discrete MDP whose diameter is finite but unknown to the learner. It shows that BUCRL attains a regret of order Õ(√(DSAT)) with high probability against the worst-case optimal policy, matching known lower bounds up to logarithmic factors. The guarantee is for frequentist regret rather than the weaker Bayesian regret that averages over a prior. The result is supported by new elementary concentration bounds on Bernoulli KL divergence and on Beta and binomial quantiles.

Core claim

We derive the first polynomial time Bayesian algorithm, BUCRL that achieves up to logarithm factors, a regret of the optimal order Õ(√(DSAT)). Importantly, our result holds with high probability for the worst-case (frequentist) regret and not the weaker notion of Bayesian regret.

What carries the argument

BUCRL, a Bayesian algorithm that maintains a posterior over unknown transition and reward distributions while using diameter-bounded planning to control exploration and deliver frequentist regret bounds.

If this is right

  • BUCRL runs in time polynomial in S, A, T and D even though T and D are unknown.
  • The same regret order holds for the frequentist worst-case regret, not merely in expectation over a prior.
  • The algorithm outperforms prior methods in experiments across multiple environments.
  • The analysis supplies a sharper upper bound on the KL divergence between any two Bernoulli distributions.
  • Sharper elementary upper and lower bounds are obtained for the quantiles of Beta and binomial distributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar posterior-sampling techniques may be adapted to obtain frequentist optimality in other Bayesian RL settings without losing computational tractability.
  • The new quantile and KL bounds can be reused to tighten analyses in any algorithm that relies on Beta or Bernoulli concentration.
  • If an online estimator for diameter can be incorporated, the regret guarantee could extend to MDPs where D is not known to be finite in advance.

Load-bearing premise

The unknown MDP has a finite diameter D that bounds the expected shortest path between any pair of states.

What would settle it

An MDP with finite S and A where BUCRL's realized regret exceeds C √(DSAT) times polylog factors for some constant C, over a worst-case choice of transitions and rewards.

Figures

Figures reproduced from arXiv: 1906.09114 by Aristide Tossou, Christos Dimitrakakis, Debabrota Basu.

Figure 1
Figure 1. Figure 1: Time evolution of average regret for BUCRL, UCRL-V, TSDE, KL-UCRL, and UCRL2. [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
read the original abstract

We tackle the problem of acting in an unknown finite and discrete Markov Decision Process (MDP) for which the expected shortest path from any state to any other state is bounded by a finite number $D$. An MDP consists of $S$ states and $A$ possible actions per state. Upon choosing an action $a_t$ at state $s_t$, one receives a real value reward $r_t$, then one transits to a next state $s_{t+1}$. The reward $r_t$ is generated from a fixed reward distribution depending only on $(s_t, a_t)$ and similarly, the next state $s_{t+1}$ is generated from a fixed transition distribution depending only on $(s_t, a_t)$. The objective is to maximize the accumulated rewards after $T$ interactions. In this paper, we consider the case where the reward distributions, the transitions, $T$ and $D$ are all unknown. We derive the first polynomial time Bayesian algorithm, BUCRL{} that achieves up to logarithm factors, a regret (i.e the difference between the accumulated rewards of the optimal policy and our algorithm) of the optimal order $\tilde{\mathcal{O}}(\sqrt{DSAT})$. Importantly, our result holds with high probability for the worst-case (frequentist) regret and not the weaker notion of Bayesian regret. We perform experiments in a variety of environments that demonstrate the superiority of our algorithm over previous techniques. Our work also illustrates several results that will be of independent interest. In particular, we derive a sharper upper bound for the KL-divergence of Bernoulli random variables. We also derive sharper upper and lower bounds for Beta and Binomial quantiles. All the bound are very simple and only use elementary functions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents BUCRL, the first polynomial-time Bayesian algorithm for unknown finite discrete MDPs whose diameter is bounded by some finite (but unknown) D. The central claim is that BUCRL attains the near-optimal frequentist regret bound Õ(√(DSAT)) with high probability (not merely Bayesian regret), without knowledge of the horizon T or the diameter D. The paper also derives sharper elementary bounds on the KL divergence between Bernoulli random variables and on Beta and Binomial quantiles, and reports experiments showing superiority over prior methods.

Significance. If the regret claim holds, the result would be significant: it supplies the first poly-time Bayesian procedure whose worst-case frequentist performance matches the known information-theoretic rate up to logarithmic factors for bounded-diameter MDPs. The auxiliary bounds on KL divergence and quantiles are of independent interest and are presented in elementary closed form. The experimental comparisons provide supporting evidence but are secondary to the theoretical contribution.

major comments (2)
  1. [Abstract, §1] Abstract and §1: The frequentist high-probability regret claim of Õ(√(DSAT)) is load-bearing on the structural assumption that the unknown MDP has finite diameter D. The analysis must establish that the Bayesian posterior updates yield the stated rate uniformly over all MDPs whose diameter is at most D, without the algorithm ever receiving D or estimating it. No derivation, concentration argument, or error decomposition is supplied in the abstract to confirm this uniformity.
  2. [§1, related-work discussion] The claim that the result is the first polynomial-time Bayesian algorithm achieving this frequentist rate requires explicit comparison to prior work on posterior sampling or Bayesian exploration in MDPs; if the proof in the analysis section reduces to standard UCRL-style optimism after posterior sampling, the novelty of the Bayesian component needs to be clarified.
minor comments (2)
  1. [Abstract] Notation: the diameter D is introduced as unknown yet the regret expression contains D; a brief remark on how the algorithm remains well-defined when D is unavailable would improve clarity.
  2. [Experiments] The experiments section would benefit from reporting the number of independent runs and confidence intervals on the plotted regret curves.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and for recognizing the potential significance of the result. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract, §1] Abstract and §1: The frequentist high-probability regret claim of Õ(√(DSAT)) is load-bearing on the structural assumption that the unknown MDP has finite diameter D. The analysis must establish that the Bayesian posterior updates yield the stated rate uniformly over all MDPs whose diameter is at most D, without the algorithm ever receiving D or estimating it. No derivation, concentration argument, or error decomposition is supplied in the abstract to confirm this uniformity.

    Authors: The abstract is a concise summary and does not contain full derivations; the complete analysis establishing uniformity appears in Sections 3–5. The proof shows that posterior sampling combined with the subsequent optimistic planning yields the Õ(√(DSAT)) high-probability frequentist bound uniformly over all MDPs with diameter at most the unknown D, without the algorithm receiving or estimating D. The diameter enters only the analysis (via the standard diameter-dependent concentration and covering arguments), not the algorithm itself. We can add one sentence to the abstract referencing this uniformity if the editor deems it necessary. revision: partial

  2. Referee: [§1, related-work discussion] The claim that the result is the first polynomial-time Bayesian algorithm achieving this frequentist rate requires explicit comparison to prior work on posterior sampling or Bayesian exploration in MDPs; if the proof in the analysis section reduces to standard UCRL-style optimism after posterior sampling, the novelty of the Bayesian component needs to be clarified.

    Authors: We will expand the related-work discussion in §1 to include direct comparisons with prior posterior-sampling and Bayesian exploration results for MDPs (e.g., Thompson sampling variants and Bayesian UCRL-style methods). The novelty is that BUCRL is the first polynomial-time Bayesian procedure whose analysis delivers a high-probability frequentist regret bound matching the information-theoretic rate up to logs, without knowledge of T or D. While the algorithm does draw a model from the posterior and then applies an optimistic planner, the Bayesian posterior is essential: it supplies the concentration that holds uniformly over the unknown bounded-diameter class without requiring the algorithm to know or estimate D, which prior frequentist methods needed explicitly. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation is self-contained.

full rationale

The paper introduces the BUCRL algorithm as a new polynomial-time Bayesian procedure for finite MDPs with unknown but bounded diameter D, deriving a high-probability frequentist regret bound of Õ(√(DSAT)). It also presents independent auxiliary results on KL-divergence bounds for Bernoulli variables and sharper Beta/Binomial quantile bounds using elementary functions. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central claim to prior author work are present. The diameter assumption is an explicit class restriction on the unknown MDP, not a circular input, and the regret analysis is positioned against external benchmarks without reducing to the algorithm's own fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the domain assumption of bounded-diameter MDPs with unknown parameters; additional bounds use standard properties of Bernoulli, Beta, and Binomial distributions.

axioms (1)
  • domain assumption The unknown MDP has finite S states, A actions, and finite diameter D (expected shortest path from any state to any other bounded by D).
    Explicitly stated as the problem setting in the abstract.

pith-pipeline@v0.9.0 · 5867 in / 1170 out tokens · 29588 ms · 2026-05-25T20:04:43.453556+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 5 internal anchors

  1. [1]

    and Goyal, N

    Agrawal, S. and Goyal, N. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pp. 39–1, 2012

  2. [2]

    Minimax Regret Bounds for Reinforcement Learning

    Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. arXiv preprint arXiv:1703.05449, 2017

  3. [3]

    Babu, G. J. and Rao, C. R. Joint asymptotic distribution of marginal quantiles and quantile functions in samples from a multivariate population. In Multivariate Statistics and Probability, pp. 15–23. Elsevier, 1989

  4. [4]

    Bartlett, P. L. and Tewari, A. Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, pp. 35–42. AUAI Press, 2009

  5. [5]

    Chiani, M., Dardari, D., and Simon, M. K. New exponential bounds and approximations for the computation of error probability in fading channels. IEEE Transactions on Wireless Communications, 2 (4):840–845, 2003

  6. [6]

    S., Scholz, M., and Sunde, J

    Dragomir, S. S., Scholz, M., and Sunde, J. Some upper bounds for relative entropy and applications. Computers & Mathematics with Applications, 39(9-10):91–100, 2000

  7. [7]

    Optimism in reinforcement learning and kullback-leibler diver- gence

    Filippi, S., Cappé, O., and Garivier, A. Optimism in reinforcement learning and kullback-leibler diver- gence. In Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, pp. 115–122. IEEE, 2010. 8

  8. [8]

    Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning

    Fruit, R., Pirotta, M., Lazaric, A., and Ortner, R. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. arXiv preprint arXiv:1802.04020, 2018

  9. [9]

    W., Ghate, A., and Gunn, M

    Gocgun, Y ., Bresnahan, B. W., Ghate, A., and Gunn, M. L. A markov decision process approach to multi-category patient scheduling in a diagnostic facility. Artificial intelligence in medicine , 53(2): 73–81, 2011

  10. [10]

    Near-optimal regret bounds for reinforcement learning

    Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010

  11. [11]

    Large deviation inequalities for sums of indicator variables

    Janson, S. Large deviation inequalities for sums of indicator variables.arXiv preprint arXiv:1609.00533, 2016

  12. [12]

    and Buhrman, J

    Kaas, R. and Buhrman, J. M. Mean, median and mode in binomial distributions. Statistica Neerlandica, 34(1):13–18, 1980

  13. [13]

    On bayesian upper confidence bounds for bandit problems

    Kaufmann, E., Garivier, A., and Paristech, T. On bayesian upper confidence bounds for bandit problems. In In AISTATS, 2012

  14. [14]

    Showing relevant ads via context multi-armed bandits

    Lu, T., Pál, D., and Pál, M. Showing relevant ads via context multi-armed bandits. In Proceedings of AISTATS, 2009

  15. [15]

    Posterior Sampling for Reinforcement Learning Without Episodes

    Osband, I. and Van Roy, B. Posterior sampling for reinforcement learning without episodes. arXiv preprint arXiv:1608.02731, 2016

  16. [16]

    and Van Roy, B

    Osband, I. and Van Roy, B. Why is posterior sampling better than optimism for reinforcement learning? In Precup, D. and Teh, Y . W. (eds.),Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 2701–2710, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL ...

  17. [17]

    (more) efficient reinforcement learning via posterior sampling

    Osband, I., Russo, D., and Van Roy, B. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pp. 3003–3011, 2013

  18. [18]

    Learning unknown markov decision processes: A thompson sampling approach

    Ouyang, Y ., Gagrani, M., Nayyar, A., and Jain, R. Learning unknown markov decision processes: A thompson sampling approach. In Advances in Neural Information Processing Systems, pp. 1333–1342, 2017

  19. [19]

    N., Wen, M., and Kontovas, C

    Psaraftis, H. N., Wen, M., and Kontovas, C. A. Dynamic vehicle routing problems: Three decades and counting. Networks, 67(1):3–31, 2016

  20. [20]

    Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

  21. [21]

    Sturm’s theorem with endpoints

    Pébay, P., Rojas, J., and C Thompson, D. Sturm’s theorem with endpoints. 07 2019

  22. [22]

    Contextual combinatorial bandit and its application on diversified online recommendation

    Qin, L., Chen, S., and Zhu, X. Contextual combinatorial bandit and its application on diversified online recommendation. In Proceedings of the 2014 SIAM International Conference on Data Mining , pp. 461–469. SIAM, 2014

  23. [23]

    Improved inequalities for the poisson and binomial distribution and upper tail quantile functions

    Short, M. Improved inequalities for the poisson and binomial distribution and upper tail quantile functions. ISRN Probability and Statistics, 2013, 2013. 9

  24. [24]

    Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

    Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017

  25. [25]

    Mastering the game of go without human knowledge

    Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017

  26. [26]

    Thulin, M. et al. The cost of using exact confidence intervals for a binomial proportion. Electronic Journal of Statistics, 8(1):817–840, 2014

  27. [27]

    Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities

    Tossou, A., Basu, D., and Dimitrakakis, C. Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities. arXiv e-prints, art. arXiv:1905.12425, May 2019

  28. [28]

    Fundamental problems of algorithmic algebra, volume 49

    Yap, C.-K. Fundamental problems of algorithmic algebra, volume 49. Oxford University Press Oxford, 2000

  29. [29]

    Zubkov, A. M. and Serov, A. A. A complete proof of universal inequalities for the distribution function of the binomial law. Theory of Probability & Its Applications, 57(3):539–544, 2013. A Proofs A.1 Proof of Theorem 1 Our proof is a direct application of the generic proof provided in Section B.2 of Tossou et al.[27]. To use that generic proof we need to...

  30. [30]

    Alsoq −x = 1 −p −x = 1 − k n ≥ 0 so thatx ≤q

    Now let’s observe that since sgn(k n −p) = 1, it means thatx ≥ 0. Alsoq −x = 1 −p −x = 1 − k n ≥ 0 so thatx ≤q. So the condition of Theorem 4 are satisfied and our goal becomes finding anx ≥ 0 such that: x2 2(pq +x(q −p)/3) ≥ Φ−1(1 −δ)2 2n Solving for this inequality leads to the upper bound part of the Theorem. Proof of the Lower bound IfQ(Binom(n,p ), 1−δ...

  31. [31]

    We will try to find this number

    ≤k. We will try to find this number. Let’s observe thatQ(Binom(n,p ), 1

  32. [32]

    So, k ≥Q(Binom(n,p ), 1 −δ) (13) ≥Q(Binom(n,p ), 1

    is the (smallest) median of the binomial distribution and thus we have:Q(Binom(n,p ), 1 2 ≥ ⌊np⌋ [12]. So, k ≥Q(Binom(n,p ), 1 −δ) (13) ≥Q(Binom(n,p ), 1

  33. [33]

    Then our objective is to find ak ≥ ⌊np⌋ satisfying (11)

    (14) ≥ ⌊np⌋ (15) As a result, we havek + 1 ≥ ⌊np⌋ + 1>np and sgn(k+1 n −p) = 1. Then our objective is to find ak ≥ ⌊np⌋ satisfying (11). Letx a number such that k+1 n =p +x. Applying the inverse Φ−1 to (11) and replacing k+1 n byx, our objective becomes finding anx ≥ 0 such that: D(p +x ∥p) ≤ Φ−1(1 −δ)2 2n 12 Our objective is equivalent to finding anx such t...

  34. [34]

    This completes the proof for this lemma

    We can do that since even if we are looking for a lower bound, all the term previously upper bounded in Lemma 2 are multiplied by −. This completes the proof for this lemma. A.2 Useful Results Proposition 4 (Lower and Upper Bound on Bernoulli KL-Divergence). For any numberp andx such that 0<p< 1, 0 ≤x ≤q whereq = 1 −p, we have: x2 2(pq +x(q −p)/3) ≤D(p +x...