Near-optimal Bayesian Solution For Unknown Discrete Markov Decision Process

Aristide Tossou; Christos Dimitrakakis; Debabrota Basu

arxiv: 1906.09114 · v2 · pith:QC3QNDPXnew · submitted 2019-06-20 · 💻 cs.LG · cs.AI· cs.GT· stat.ML

Near-optimal Bayesian Solution For Unknown Discrete Markov Decision Process

Aristide Tossou , Christos Dimitrakakis , Debabrota Basu This is my paper

Pith reviewed 2026-05-25 20:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.GTstat.ML

keywords Markov decision processesBayesian reinforcement learningregret boundsfrequentist guaranteesexploration-exploitationpolynomial-time algorithmsconcentration inequalities

0 comments

The pith

A polynomial-time Bayesian algorithm achieves near-optimal frequentist regret for unknown finite MDPs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BUCRL as the first polynomial-time Bayesian method for acting optimally in an unknown discrete MDP whose diameter is finite but unknown to the learner. It shows that BUCRL attains a regret of order Õ(√(DSAT)) with high probability against the worst-case optimal policy, matching known lower bounds up to logarithmic factors. The guarantee is for frequentist regret rather than the weaker Bayesian regret that averages over a prior. The result is supported by new elementary concentration bounds on Bernoulli KL divergence and on Beta and binomial quantiles.

Core claim

We derive the first polynomial time Bayesian algorithm, BUCRL that achieves up to logarithm factors, a regret of the optimal order Õ(√(DSAT)). Importantly, our result holds with high probability for the worst-case (frequentist) regret and not the weaker notion of Bayesian regret.

What carries the argument

BUCRL, a Bayesian algorithm that maintains a posterior over unknown transition and reward distributions while using diameter-bounded planning to control exploration and deliver frequentist regret bounds.

If this is right

BUCRL runs in time polynomial in S, A, T and D even though T and D are unknown.
The same regret order holds for the frequentist worst-case regret, not merely in expectation over a prior.
The algorithm outperforms prior methods in experiments across multiple environments.
The analysis supplies a sharper upper bound on the KL divergence between any two Bernoulli distributions.
Sharper elementary upper and lower bounds are obtained for the quantiles of Beta and binomial distributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar posterior-sampling techniques may be adapted to obtain frequentist optimality in other Bayesian RL settings without losing computational tractability.
The new quantile and KL bounds can be reused to tighten analyses in any algorithm that relies on Beta or Bernoulli concentration.
If an online estimator for diameter can be incorporated, the regret guarantee could extend to MDPs where D is not known to be finite in advance.

Load-bearing premise

The unknown MDP has a finite diameter D that bounds the expected shortest path between any pair of states.

What would settle it

An MDP with finite S and A where BUCRL's realized regret exceeds C √(DSAT) times polylog factors for some constant C, over a worst-case choice of transitions and rewards.

Figures

Figures reproduced from arXiv: 1906.09114 by Aristide Tossou, Christos Dimitrakakis, Debabrota Basu.

read the original abstract

We tackle the problem of acting in an unknown finite and discrete Markov Decision Process (MDP) for which the expected shortest path from any state to any other state is bounded by a finite number $D$. An MDP consists of $S$ states and $A$ possible actions per state. Upon choosing an action $a_t$ at state $s_t$, one receives a real value reward $r_t$, then one transits to a next state $s_{t+1}$. The reward $r_t$ is generated from a fixed reward distribution depending only on $(s_t, a_t)$ and similarly, the next state $s_{t+1}$ is generated from a fixed transition distribution depending only on $(s_t, a_t)$. The objective is to maximize the accumulated rewards after $T$ interactions. In this paper, we consider the case where the reward distributions, the transitions, $T$ and $D$ are all unknown. We derive the first polynomial time Bayesian algorithm, BUCRL{} that achieves up to logarithm factors, a regret (i.e the difference between the accumulated rewards of the optimal policy and our algorithm) of the optimal order $\tilde{\mathcal{O}}(\sqrt{DSAT})$. Importantly, our result holds with high probability for the worst-case (frequentist) regret and not the weaker notion of Bayesian regret. We perform experiments in a variety of environments that demonstrate the superiority of our algorithm over previous techniques. Our work also illustrates several results that will be of independent interest. In particular, we derive a sharper upper bound for the KL-divergence of Bernoulli random variables. We also derive sharper upper and lower bounds for Beta and Binomial quantiles. All the bound are very simple and only use elementary functions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BUCRL is the first claimed poly-time Bayesian method with Õ(√(DSAT)) frequentist regret for bounded-diameter MDPs with unknown T and D, plus some elementary quantile and KL bounds.

read the letter

The central point is that this paper gives BUCRL, a Bayesian algorithm that runs in polynomial time and matches the optimal frequentist regret rate Õ(√(DSAT)) with high probability for finite MDPs whose diameter is at most some unknown D. It also supplies a few supporting bounds on Bernoulli KL divergence and on Beta and Binomial quantiles that use only elementary functions. Those bounds look like they could be reusable on their own. The experiments are said to show better performance than prior methods across several environments, which is at least consistent with the claim. The work sits in the line of UCRL-style frequentist analyses but moves the algorithm itself into the Bayesian setting while preserving the worst-case guarantee rather than settling for Bayesian regret. That distinction matters for the subfield. The main soft spot is that the abstract states the regret result and the new bounds without any derivation outline or error analysis, so independent verification of the central argument requires the full proofs. The bounded-diameter assumption is load-bearing: if the unknown MDP violates it, the rate does not hold and linear regret remains possible, and the algorithm never receives D as input. That is an explicit modeling choice rather than a hidden flaw, but it limits the scope. The citation pattern looks standard for the area. This paper is for people working on regret bounds and exploration in MDPs who want to see a Bayesian route to the same frequentist rate. It is coherent on its own terms and makes a concrete claim that, if the proofs check out, fills a stated gap, so it deserves a serious referee to examine the derivations and the quantile bounds.

Referee Report

2 major / 2 minor

Summary. The manuscript presents BUCRL, the first polynomial-time Bayesian algorithm for unknown finite discrete MDPs whose diameter is bounded by some finite (but unknown) D. The central claim is that BUCRL attains the near-optimal frequentist regret bound Õ(√(DSAT)) with high probability (not merely Bayesian regret), without knowledge of the horizon T or the diameter D. The paper also derives sharper elementary bounds on the KL divergence between Bernoulli random variables and on Beta and Binomial quantiles, and reports experiments showing superiority over prior methods.

Significance. If the regret claim holds, the result would be significant: it supplies the first poly-time Bayesian procedure whose worst-case frequentist performance matches the known information-theoretic rate up to logarithmic factors for bounded-diameter MDPs. The auxiliary bounds on KL divergence and quantiles are of independent interest and are presented in elementary closed form. The experimental comparisons provide supporting evidence but are secondary to the theoretical contribution.

major comments (2)

[Abstract, §1] Abstract and §1: The frequentist high-probability regret claim of Õ(√(DSAT)) is load-bearing on the structural assumption that the unknown MDP has finite diameter D. The analysis must establish that the Bayesian posterior updates yield the stated rate uniformly over all MDPs whose diameter is at most D, without the algorithm ever receiving D or estimating it. No derivation, concentration argument, or error decomposition is supplied in the abstract to confirm this uniformity.
[§1, related-work discussion] The claim that the result is the first polynomial-time Bayesian algorithm achieving this frequentist rate requires explicit comparison to prior work on posterior sampling or Bayesian exploration in MDPs; if the proof in the analysis section reduces to standard UCRL-style optimism after posterior sampling, the novelty of the Bayesian component needs to be clarified.

minor comments (2)

[Abstract] Notation: the diameter D is introduced as unknown yet the regret expression contains D; a brief remark on how the algorithm remains well-defined when D is unavailable would improve clarity.
[Experiments] The experiments section would benefit from reporting the number of independent runs and confidence intervals on the plotted regret curves.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and for recognizing the potential significance of the result. We address each major comment below.

read point-by-point responses

Referee: [Abstract, §1] Abstract and §1: The frequentist high-probability regret claim of Õ(√(DSAT)) is load-bearing on the structural assumption that the unknown MDP has finite diameter D. The analysis must establish that the Bayesian posterior updates yield the stated rate uniformly over all MDPs whose diameter is at most D, without the algorithm ever receiving D or estimating it. No derivation, concentration argument, or error decomposition is supplied in the abstract to confirm this uniformity.

Authors: The abstract is a concise summary and does not contain full derivations; the complete analysis establishing uniformity appears in Sections 3–5. The proof shows that posterior sampling combined with the subsequent optimistic planning yields the Õ(√(DSAT)) high-probability frequentist bound uniformly over all MDPs with diameter at most the unknown D, without the algorithm receiving or estimating D. The diameter enters only the analysis (via the standard diameter-dependent concentration and covering arguments), not the algorithm itself. We can add one sentence to the abstract referencing this uniformity if the editor deems it necessary. revision: partial
Referee: [§1, related-work discussion] The claim that the result is the first polynomial-time Bayesian algorithm achieving this frequentist rate requires explicit comparison to prior work on posterior sampling or Bayesian exploration in MDPs; if the proof in the analysis section reduces to standard UCRL-style optimism after posterior sampling, the novelty of the Bayesian component needs to be clarified.

Authors: We will expand the related-work discussion in §1 to include direct comparisons with prior posterior-sampling and Bayesian exploration results for MDPs (e.g., Thompson sampling variants and Bayesian UCRL-style methods). The novelty is that BUCRL is the first polynomial-time Bayesian procedure whose analysis delivers a high-probability frequentist regret bound matching the information-theoretic rate up to logs, without knowledge of T or D. While the algorithm does draw a model from the posterior and then applies an optimistic planner, the Bayesian posterior is essential: it supplies the concentration that holds uniformly over the unknown bounded-diameter class without requiring the algorithm to know or estimate D, which prior frequentist methods needed explicitly. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation is self-contained.

full rationale

The paper introduces the BUCRL algorithm as a new polynomial-time Bayesian procedure for finite MDPs with unknown but bounded diameter D, deriving a high-probability frequentist regret bound of Õ(√(DSAT)). It also presents independent auxiliary results on KL-divergence bounds for Bernoulli variables and sharper Beta/Binomial quantile bounds using elementary functions. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central claim to prior author work are present. The diameter assumption is an explicit class restriction on the unknown MDP, not a circular input, and the regret analysis is positioned against external benchmarks without reducing to the algorithm's own fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the domain assumption of bounded-diameter MDPs with unknown parameters; additional bounds use standard properties of Bernoulli, Beta, and Binomial distributions.

axioms (1)

domain assumption The unknown MDP has finite S states, A actions, and finite diameter D (expected shortest path from any state to any other bounded by D).
Explicitly stated as the problem setting in the abstract.

pith-pipeline@v0.9.0 · 5867 in / 1170 out tokens · 29588 ms · 2026-05-25T20:04:43.453556+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 5 internal anchors

[1]

and Goyal, N

Agrawal, S. and Goyal, N. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pp. 39–1, 2012

work page 2012
[2]

Minimax Regret Bounds for Reinforcement Learning

Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. arXiv preprint arXiv:1703.05449, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

Babu, G. J. and Rao, C. R. Joint asymptotic distribution of marginal quantiles and quantile functions in samples from a multivariate population. In Multivariate Statistics and Probability, pp. 15–23. Elsevier, 1989

work page 1989
[4]

Bartlett, P. L. and Tewari, A. Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence, UAI ’09, pp. 35–42. AUAI Press, 2009

work page 2009
[5]

Chiani, M., Dardari, D., and Simon, M. K. New exponential bounds and approximations for the computation of error probability in fading channels. IEEE Transactions on Wireless Communications, 2 (4):840–845, 2003

work page 2003
[6]

S., Scholz, M., and Sunde, J

Dragomir, S. S., Scholz, M., and Sunde, J. Some upper bounds for relative entropy and applications. Computers & Mathematics with Applications, 39(9-10):91–100, 2000

work page 2000
[7]

Optimism in reinforcement learning and kullback-leibler diver- gence

Filippi, S., Cappé, O., and Garivier, A. Optimism in reinforcement learning and kullback-leibler diver- gence. In Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, pp. 115–122. IEEE, 2010. 8

work page 2010
[8]

Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning

Fruit, R., Pirotta, M., Lazaric, A., and Ortner, R. Efﬁcient bias-span-constrained exploration-exploitation in reinforcement learning. arXiv preprint arXiv:1802.04020, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

W., Ghate, A., and Gunn, M

Gocgun, Y ., Bresnahan, B. W., Ghate, A., and Gunn, M. L. A markov decision process approach to multi-category patient scheduling in a diagnostic facility. Artiﬁcial intelligence in medicine , 53(2): 73–81, 2011

work page 2011
[10]

Near-optimal regret bounds for reinforcement learning

Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010

work page 2010
[11]

Large deviation inequalities for sums of indicator variables

Janson, S. Large deviation inequalities for sums of indicator variables.arXiv preprint arXiv:1609.00533, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[12]

and Buhrman, J

Kaas, R. and Buhrman, J. M. Mean, median and mode in binomial distributions. Statistica Neerlandica, 34(1):13–18, 1980

work page 1980
[13]

On bayesian upper conﬁdence bounds for bandit problems

Kaufmann, E., Garivier, A., and Paristech, T. On bayesian upper conﬁdence bounds for bandit problems. In In AISTATS, 2012

work page 2012
[14]

Showing relevant ads via context multi-armed bandits

Lu, T., Pál, D., and Pál, M. Showing relevant ads via context multi-armed bandits. In Proceedings of AISTATS, 2009

work page 2009
[15]

Posterior Sampling for Reinforcement Learning Without Episodes

Osband, I. and Van Roy, B. Posterior sampling for reinforcement learning without episodes. arXiv preprint arXiv:1608.02731, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[16]

and Van Roy, B

Osband, I. and Van Roy, B. Why is posterior sampling better than optimism for reinforcement learning? In Precup, D. and Teh, Y . W. (eds.),Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 2701–2710, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL ...

work page 2017
[17]

(more) efﬁcient reinforcement learning via posterior sampling

Osband, I., Russo, D., and Van Roy, B. (more) efﬁcient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pp. 3003–3011, 2013

work page 2013
[18]

Learning unknown markov decision processes: A thompson sampling approach

Ouyang, Y ., Gagrani, M., Nayyar, A., and Jain, R. Learning unknown markov decision processes: A thompson sampling approach. In Advances in Neural Information Processing Systems, pp. 1333–1342, 2017

work page 2017
[19]

N., Wen, M., and Kontovas, C

Psaraftis, H. N., Wen, M., and Kontovas, C. A. Dynamic vehicle routing problems: Three decades and counting. Networks, 67(1):3–31, 2016

work page 2016
[20]

Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

work page 2014
[21]

Sturm’s theorem with endpoints

Pébay, P., Rojas, J., and C Thompson, D. Sturm’s theorem with endpoints. 07 2019

work page 2019
[22]

Contextual combinatorial bandit and its application on diversiﬁed online recommendation

Qin, L., Chen, S., and Zhu, X. Contextual combinatorial bandit and its application on diversiﬁed online recommendation. In Proceedings of the 2014 SIAM International Conference on Data Mining , pp. 461–469. SIAM, 2014

work page 2014
[23]

Improved inequalities for the poisson and binomial distribution and upper tail quantile functions

Short, M. Improved inequalities for the poisson and binomial distribution and upper tail quantile functions. ISRN Probability and Statistics, 2013, 2013. 9

work page 2013
[24]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

Mastering the game of go without human knowledge

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017

work page 2017
[26]

Thulin, M. et al. The cost of using exact conﬁdence intervals for a binomial proportion. Electronic Journal of Statistics, 8(1):817–840, 2014

work page 2014
[27]

Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities

Tossou, A., Basu, D., and Dimitrakakis, C. Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities. arXiv e-prints, art. arXiv:1905.12425, May 2019

work page arXiv 1905
[28]

Fundamental problems of algorithmic algebra, volume 49

Yap, C.-K. Fundamental problems of algorithmic algebra, volume 49. Oxford University Press Oxford, 2000

work page 2000
[29]

Zubkov, A. M. and Serov, A. A. A complete proof of universal inequalities for the distribution function of the binomial law. Theory of Probability & Its Applications, 57(3):539–544, 2013. A Proofs A.1 Proof of Theorem 1 Our proof is a direct application of the generic proof provided in Section B.2 of Tossou et al.[27]. To use that generic proof we need to...

work page 2013
[30]

Alsoq −x = 1 −p −x = 1 − k n ≥ 0 so thatx ≤q

Now let’s observe that since sgn(k n −p) = 1, it means thatx ≥ 0. Alsoq −x = 1 −p −x = 1 − k n ≥ 0 so thatx ≤q. So the condition of Theorem 4 are satisﬁed and our goal becomes ﬁnding anx ≥ 0 such that: x2 2(pq +x(q −p)/3) ≥ Φ−1(1 −δ)2 2n Solving for this inequality leads to the upper bound part of the Theorem. Proof of the Lower bound IfQ(Binom(n,p ), 1−δ...

work page
[31]

We will try to ﬁnd this number

≤k. We will try to ﬁnd this number. Let’s observe thatQ(Binom(n,p ), 1

work page
[32]

So, k ≥Q(Binom(n,p ), 1 −δ) (13) ≥Q(Binom(n,p ), 1

is the (smallest) median of the binomial distribution and thus we have:Q(Binom(n,p ), 1 2 ≥ ⌊np⌋ [12]. So, k ≥Q(Binom(n,p ), 1 −δ) (13) ≥Q(Binom(n,p ), 1

work page
[33]

Then our objective is to ﬁnd ak ≥ ⌊np⌋ satisfying (11)

(14) ≥ ⌊np⌋ (15) As a result, we havek + 1 ≥ ⌊np⌋ + 1>np and sgn(k+1 n −p) = 1. Then our objective is to ﬁnd ak ≥ ⌊np⌋ satisfying (11). Letx a number such that k+1 n =p +x. Applying the inverse Φ−1 to (11) and replacing k+1 n byx, our objective becomes ﬁnding anx ≥ 0 such that: D(p +x ∥p) ≤ Φ−1(1 −δ)2 2n 12 Our objective is equivalent to ﬁnding anx such t...

work page
[34]

This completes the proof for this lemma

We can do that since even if we are looking for a lower bound, all the term previously upper bounded in Lemma 2 are multiplied by −. This completes the proof for this lemma. A.2 Useful Results Proposition 4 (Lower and Upper Bound on Bernoulli KL-Divergence). For any numberp andx such that 0<p< 1, 0 ≤x ≤q whereq = 1 −p, we have: x2 2(pq +x(q −p)/3) ≤D(p +x...

work page

[1] [1]

and Goyal, N

Agrawal, S. and Goyal, N. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pp. 39–1, 2012

work page 2012

[2] [2]

Minimax Regret Bounds for Reinforcement Learning

Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. arXiv preprint arXiv:1703.05449, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[3] [3]

Babu, G. J. and Rao, C. R. Joint asymptotic distribution of marginal quantiles and quantile functions in samples from a multivariate population. In Multivariate Statistics and Probability, pp. 15–23. Elsevier, 1989

work page 1989

[4] [4]

Bartlett, P. L. and Tewari, A. Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence, UAI ’09, pp. 35–42. AUAI Press, 2009

work page 2009

[5] [5]

Chiani, M., Dardari, D., and Simon, M. K. New exponential bounds and approximations for the computation of error probability in fading channels. IEEE Transactions on Wireless Communications, 2 (4):840–845, 2003

work page 2003

[6] [6]

S., Scholz, M., and Sunde, J

Dragomir, S. S., Scholz, M., and Sunde, J. Some upper bounds for relative entropy and applications. Computers & Mathematics with Applications, 39(9-10):91–100, 2000

work page 2000

[7] [7]

Optimism in reinforcement learning and kullback-leibler diver- gence

Filippi, S., Cappé, O., and Garivier, A. Optimism in reinforcement learning and kullback-leibler diver- gence. In Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, pp. 115–122. IEEE, 2010. 8

work page 2010

[8] [8]

Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning

Fruit, R., Pirotta, M., Lazaric, A., and Ortner, R. Efﬁcient bias-span-constrained exploration-exploitation in reinforcement learning. arXiv preprint arXiv:1802.04020, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

W., Ghate, A., and Gunn, M

Gocgun, Y ., Bresnahan, B. W., Ghate, A., and Gunn, M. L. A markov decision process approach to multi-category patient scheduling in a diagnostic facility. Artiﬁcial intelligence in medicine , 53(2): 73–81, 2011

work page 2011

[10] [10]

Near-optimal regret bounds for reinforcement learning

Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010

work page 2010

[11] [11]

Large deviation inequalities for sums of indicator variables

Janson, S. Large deviation inequalities for sums of indicator variables.arXiv preprint arXiv:1609.00533, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[12] [12]

and Buhrman, J

Kaas, R. and Buhrman, J. M. Mean, median and mode in binomial distributions. Statistica Neerlandica, 34(1):13–18, 1980

work page 1980

[13] [13]

On bayesian upper conﬁdence bounds for bandit problems

Kaufmann, E., Garivier, A., and Paristech, T. On bayesian upper conﬁdence bounds for bandit problems. In In AISTATS, 2012

work page 2012

[14] [14]

Showing relevant ads via context multi-armed bandits

Lu, T., Pál, D., and Pál, M. Showing relevant ads via context multi-armed bandits. In Proceedings of AISTATS, 2009

work page 2009

[15] [15]

Posterior Sampling for Reinforcement Learning Without Episodes

Osband, I. and Van Roy, B. Posterior sampling for reinforcement learning without episodes. arXiv preprint arXiv:1608.02731, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[16] [16]

and Van Roy, B

Osband, I. and Van Roy, B. Why is posterior sampling better than optimism for reinforcement learning? In Precup, D. and Teh, Y . W. (eds.),Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 2701–2710, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL ...

work page 2017

[17] [17]

(more) efﬁcient reinforcement learning via posterior sampling

Osband, I., Russo, D., and Van Roy, B. (more) efﬁcient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pp. 3003–3011, 2013

work page 2013

[18] [18]

Learning unknown markov decision processes: A thompson sampling approach

Ouyang, Y ., Gagrani, M., Nayyar, A., and Jain, R. Learning unknown markov decision processes: A thompson sampling approach. In Advances in Neural Information Processing Systems, pp. 1333–1342, 2017

work page 2017

[19] [19]

N., Wen, M., and Kontovas, C

Psaraftis, H. N., Wen, M., and Kontovas, C. A. Dynamic vehicle routing problems: Three decades and counting. Networks, 67(1):3–31, 2016

work page 2016

[20] [20]

Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

work page 2014

[21] [21]

Sturm’s theorem with endpoints

Pébay, P., Rojas, J., and C Thompson, D. Sturm’s theorem with endpoints. 07 2019

work page 2019

[22] [22]

Contextual combinatorial bandit and its application on diversiﬁed online recommendation

Qin, L., Chen, S., and Zhu, X. Contextual combinatorial bandit and its application on diversiﬁed online recommendation. In Proceedings of the 2014 SIAM International Conference on Data Mining , pp. 461–469. SIAM, 2014

work page 2014

[23] [23]

Improved inequalities for the poisson and binomial distribution and upper tail quantile functions

Short, M. Improved inequalities for the poisson and binomial distribution and upper tail quantile functions. ISRN Probability and Statistics, 2013, 2013. 9

work page 2013

[24] [24]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[25] [25]

Mastering the game of go without human knowledge

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017

work page 2017

[26] [26]

Thulin, M. et al. The cost of using exact conﬁdence intervals for a binomial proportion. Electronic Journal of Statistics, 8(1):817–840, 2014

work page 2014

[27] [27]

Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities

Tossou, A., Basu, D., and Dimitrakakis, C. Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities. arXiv e-prints, art. arXiv:1905.12425, May 2019

work page arXiv 1905

[28] [28]

Fundamental problems of algorithmic algebra, volume 49

Yap, C.-K. Fundamental problems of algorithmic algebra, volume 49. Oxford University Press Oxford, 2000

work page 2000

[29] [29]

Zubkov, A. M. and Serov, A. A. A complete proof of universal inequalities for the distribution function of the binomial law. Theory of Probability & Its Applications, 57(3):539–544, 2013. A Proofs A.1 Proof of Theorem 1 Our proof is a direct application of the generic proof provided in Section B.2 of Tossou et al.[27]. To use that generic proof we need to...

work page 2013

[30] [30]

Alsoq −x = 1 −p −x = 1 − k n ≥ 0 so thatx ≤q

Now let’s observe that since sgn(k n −p) = 1, it means thatx ≥ 0. Alsoq −x = 1 −p −x = 1 − k n ≥ 0 so thatx ≤q. So the condition of Theorem 4 are satisﬁed and our goal becomes ﬁnding anx ≥ 0 such that: x2 2(pq +x(q −p)/3) ≥ Φ−1(1 −δ)2 2n Solving for this inequality leads to the upper bound part of the Theorem. Proof of the Lower bound IfQ(Binom(n,p ), 1−δ...

work page

[31] [31]

We will try to ﬁnd this number

≤k. We will try to ﬁnd this number. Let’s observe thatQ(Binom(n,p ), 1

work page

[32] [32]

So, k ≥Q(Binom(n,p ), 1 −δ) (13) ≥Q(Binom(n,p ), 1

is the (smallest) median of the binomial distribution and thus we have:Q(Binom(n,p ), 1 2 ≥ ⌊np⌋ [12]. So, k ≥Q(Binom(n,p ), 1 −δ) (13) ≥Q(Binom(n,p ), 1

work page

[33] [33]

Then our objective is to ﬁnd ak ≥ ⌊np⌋ satisfying (11)

(14) ≥ ⌊np⌋ (15) As a result, we havek + 1 ≥ ⌊np⌋ + 1>np and sgn(k+1 n −p) = 1. Then our objective is to ﬁnd ak ≥ ⌊np⌋ satisfying (11). Letx a number such that k+1 n =p +x. Applying the inverse Φ−1 to (11) and replacing k+1 n byx, our objective becomes ﬁnding anx ≥ 0 such that: D(p +x ∥p) ≤ Φ−1(1 −δ)2 2n 12 Our objective is equivalent to ﬁnding anx such t...

work page

[34] [34]

This completes the proof for this lemma

We can do that since even if we are looking for a lower bound, all the term previously upper bounded in Lemma 2 are multiplied by −. This completes the proof for this lemma. A.2 Useful Results Proposition 4 (Lower and Upper Bound on Bernoulli KL-Divergence). For any numberp andx such that 0<p< 1, 0 ≤x ≤q whereq = 1 −p, we have: x2 2(pq +x(q −p)/3) ≤D(p +x...

work page