pith. sign in

arxiv: 2506.00286 · v3 · pith:YJ6I6UYLnew · submitted 2025-05-30 · 💻 cs.LG · cs.AI· math.OC· stat.ML

Recursive Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

Pith reviewed 2026-05-22 02:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OCstat.ML
keywords risk-sensitive reinforcement learningentropic risk measuresample complexityPAC boundsgenerative modeldiscounted MDPsvalue iterationrisk-averse and risk-seeking
0
0 comments X

The pith

A model-based Q-value iteration algorithm learns optimal recursive entropic risk values and policies with sample complexity exponential in the risk parameter over one minus the discount factor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies risk-sensitive reinforcement learning where agents optimize recursive entropic risk measures in finite discounted Markov decision processes. It presents a model-based algorithm that draws samples from a generative model and performs value iteration under the risk measure. The resulting PAC bounds on the number of samples needed for accurate value and policy learning grow exponentially with the absolute value of the risk parameter divided by one minus the discount factor. Matching lower bounds prove that no algorithm can avoid this exponential cost in the worst case. These guarantees apply equally to risk-averse and risk-seeking regimes and are tight in the number of states and actions.

Core claim

The authors introduce the Model-Based ERM Q-Value Iteration algorithm and prove PAC-type upper bounds on its sample complexity for both approximating the optimal state-action value function and recovering an optimal policy. These upper bounds scale exponentially with |β|/(1-γ). They also prove matching lower bounds for value and policy learning that establish the exponential dependence on |β|/(1-γ) as information-theoretically necessary in the worst case. The bounds are tight in the number of states S and actions A.

What carries the argument

The Model-Based ERM Q-Value Iteration (MB-RS-QVI) algorithm, which uses samples from the generative model to estimate the transition and reward functions and then runs value iteration with the recursive entropic risk operator.

If this is right

  • Value learning requires a number of samples polynomial in S and A yet exponential in |β|/(1-γ).
  • Policy learning requires the same exponential scaling in |β|/(1-γ).
  • The exponential term in |β|/(1-γ) cannot be removed by any algorithm in the worst case.
  • The upper and lower bounds match up to constants in their dependence on S and A.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Environments with discount factors near one or extreme risk attitudes will demand dramatically larger data sets under this formulation.
  • The same exponential cost may appear for other recursive risk measures that satisfy similar contraction properties.
  • Practical implementations might still benefit from variance reduction techniques even if the worst-case scaling remains unchanged.

Load-bearing premise

A generative model is available that can produce independent next-state and reward samples for any chosen state-action pair.

What would settle it

An MDP and risk parameter β where every algorithm requires more than exponentially many samples in |β|/(1-γ) to learn the optimal value function to fixed accuracy would falsify the upper bound, while an algorithm achieving only polynomial dependence would falsify the lower bound.

read the original abstract

We study risk-sensitive reinforcement learning in finite discounted MDPs with recursive entropic risk measures (ERM), where the risk parameter $\beta \neq 0$ controls the agent's risk attitude: $\beta>0$ for risk-averse and $\beta<0$ for risk-seeking behavior. A generative model of the MDP is assumed to be available. Our focus is on the sample complexities of learning the optimal state-action value function (value learning) and an optimal policy (policy learning) under recursive ERM. We introduce a model-based algorithm, called Model-Based ERM $Q$-Value Iteration (MB-RS-QVI), and derive PAC-type bounds on its sample complexity for both value and policy learning. Both PAC bounds scale exponentially with $|\beta|/(1-\gamma)$, where $\gamma$ is the discount factor. We also establish corresponding lower bounds for both value and policy learning, showing that exponential dependence on $|\beta|/(1-\gamma)$ is unavoidable in the worst case. The bounds are tight in the number of states and actions ($S$ and $A$), providing the first rigorous sample complexity guarantees for recursive ERM across both risk-averse and risk-seeking regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript studies risk-sensitive RL in finite discounted MDPs under recursive entropic risk measures (ERM) with risk parameter β. Assuming access to a generative model, it proposes the model-based algorithm MB-RS-QVI that performs empirical estimation of the recursive ERM Bellman operator followed by value iteration. PAC-type upper bounds are derived for both value and policy learning that scale exponentially with |β|/(1-γ); matching information-theoretic lower bounds are constructed on a hard MDP family to show that the exponential dependence is unavoidable in the worst case. The bounds are stated to be tight in S and A.

Significance. If the derivations hold, the work supplies the first rigorous sample-complexity guarantees for recursive ERM across both risk-averse (β>0) and risk-seeking (β<0) regimes. The explicit exponential factor is traced to the Lipschitz constant of the ERM operator (bounded by exp(|β|R_max/(1-γ))), the upper-bound proof combines concentration of the entropic estimator with standard contraction arguments, and the lower bound is obtained by reduction to a carefully chosen MDP family. These elements together give a tight characterization of the sample cost induced by the risk parameter.

minor comments (3)
  1. [Abstract] The abstract and introduction state that the PAC bounds scale exponentially with |β|/(1-γ), but the precise dependence on the reward bound R_max and the number of samples per state-action pair is not summarized in the same sentence; adding a compact statement of the full scaling would improve readability.
  2. [Section 4] Notation for the empirical ERM operator (e.g., the definition of the sample-based fixed-point operator) is introduced in the algorithm section but referenced without a forward pointer in the proof of the contraction property; a single cross-reference would reduce reader effort.
  3. [Section 6] The lower-bound construction is described as forcing exponential samples to resolve the risk-adjusted value, yet the precise choice of the hard MDP family (number of states, reward structure) is only sketched; expanding the description of the reduction by one paragraph would make the information-theoretic argument easier to follow.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our work on sample complexity bounds for recursive entropic risk measures and for recommending minor revision. The summary accurately captures the main contributions, including the PAC upper bounds, matching lower bounds, and the exponential dependence on |β|/(1-γ). We provide point-by-point responses below to the major comments raised.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The manuscript defines MB-RS-QVI via direct empirical estimation of the recursive ERM Bellman operator followed by value iteration. PAC upper bounds follow from standard concentration of the entropic risk estimator combined with contraction arguments whose Lipschitz factor is explicitly bounded by exp(|β|R_max/(1-γ)). Lower bounds are obtained by reduction to a hard MDP family that forces exponential samples to distinguish risk-adjusted values; this construction is independent of the upper-bound analysis and uses only information-theoretic arguments. No step reduces a claimed result to a fitted parameter, self-citation, or definitional identity. The exponential scaling is a direct consequence of the operator's properties rather than an input assumption.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard finite MDP assumptions and generative model access; no new free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Finite state and action spaces in discounted MDPs
    Stated as finite discounted MDPs with generative model.

pith-pipeline@v0.9.0 · 5751 in / 1113 out tokens · 36014 ms · 2026-05-22T02:12:12.166580+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents

    cs.LG 2026-05 unverdicted novelty 7.0

    Characterizes utility functions making recursive OCE objectives PAC-learnable and derives matching upper and lower PAC sample complexity bounds for value and policy learning, with improved tau dependence for CVaR.

  2. Tight Sample Complexity Bounds for Entropic Best Policy Identification

    cs.LG 2026-05 unverdicted novelty 7.0

    New concentration bounds and stopping rule close the exponential gap to match the lower bound for entropic best policy identification.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 2 Pith papers · 1 internal anchor

  1. [1]

    Model-based reinforcement learning with a generative model is minimax optimal

    Alekh Agarwal, Sham Kakade, and Lin F Yang. “Model-based reinforcement learning with a generative model is minimax optimal”. In: Conference on Learning Theory . PMLR. 2020, pp. 67–83

  2. [2]

    Entropic value-at-risk: A new coherent risk measure

    Amir Ahmadi-Javid. “Entropic value-at-risk: A new coherent risk measure”. In: Journal of Optimization Theory and Applications 155 (2012), pp. 1105–1123

  3. [3]

    Adaptive sampling for best policy identi- fication in Markov decision processes

    Aymen Al Marjani and Alexandre Proutiere. “Adaptive sampling for best policy identi- fication in Markov decision processes”. In:International Conference on Machine Learn- ing. PMLR. 2021, pp. 7459–7468

  4. [4]

    A note on a new class of recursive utilities in Markov decision processes

    Hubert Asienkiewicz and Anna Ja´ skiewicz. “A note on a new class of recursive utilities in Markov decision processes”. In: Applicationes Mathematicae 44 (2017), pp. 149–161. 11

  5. [5]

    Reinforcement learning with a near optimal rate of convergence

    Mohammad Gheshlaghi Azar et al. “Reinforcement learning with a near optimal rate of convergence”. In: (2011)

  6. [6]

    Markov decision processes with recursive risk measures

    Nicole B¨ auerle and Alexander Glauner. “Markov decision processes with recursive risk measures”. In: European Journal of Operational Research 296.3 (2022), pp. 953–966

  7. [7]

    Markov decision processes with average-value-at- risk criteria

    Nicole B¨ auerle and Jonathan Ott. “Markov decision processes with average-value-at- risk criteria”. In: Mathematical Methods of Operations Research 74 (2011), pp. 361– 379

  8. [8]

    Distributional reinforcement learning

    Marc G Bellemare, Will Dabney, and Mark Rowland. Distributional reinforcement learning. MIT Press, 2023

  9. [9]

    Risk-sensitive dynamic asset management

    Tomasz R Bielecki and Stanley R Pliska. “Risk-sensitive dynamic asset management”. In: Applied Mathematics and Optimization 39 (1999), pp. 337–360

  10. [10]

    Risk-averse policy optimization via risk-neutral policy optimiza- tion

    Lorenzo Bisi et al. “Risk-averse policy optimization via risk-neutral policy optimiza- tion”. In: Artificial Intelligence 311 (2022), p. 103765

  11. [11]

    Risk-sensitive optimal control for Markov decision processes with monotone cost

    Vivek S Borkar and Sean P Meyn. “Risk-sensitive optimal control for Markov decision processes with monotone cost”. In: Mathematics of Operations Research 27.1 (2002), pp. 192–209

  12. [12]

    Bayesian robust optimization for imitation learning

    Daniel Brown, Scott Niekum, and Marek Petrik. “Bayesian robust optimization for imitation learning”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 2479–2491

  13. [13]

    Provably Efficient Iterated CVaR Reinforcement Learning with Func- tion Approximation and Human Feedback

    Yu Chen et al. “Provably Efficient Iterated CVaR Reinforcement Learning with Func- tion Approximation and Human Feedback”. In: The Twelfth International Conference on Learning Representations. 2024

  14. [14]

    Finite-sample analysis of contractive stochastic approximation us- ing smooth convex envelopes

    Zaiwei Chen et al. “Finite-sample analysis of contractive stochastic approximation us- ing smooth convex envelopes”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 8223–8234

  15. [15]

    Algorithms for CVaR optimization in MDPs

    Yinlam Chow and Mohammad Ghavamzadeh. “Algorithms for CVaR optimization in MDPs”. In: Advances in neural information processing systems 27 (2014)

  16. [16]

    Percentile optimization for Markov decision processes with parameter uncertainty

    Erick Delage and Shie Mannor. “Percentile optimization for Markov decision processes with parameter uncertainty”. In: Operations research 58.1 (2010), pp. 203–213

  17. [17]

    Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model

    Zilong Deng, Simon Khan, and Shaofeng Zou. “Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model”. In:AISTATS. 2025

  18. [18]

    Provably Efficient Risk-Sensitive Rein- forcement Learning: Iterated CVaR and Worst Path

    Yihan Du, Siwei Wang, and Longbo Huang. “Provably Efficient Risk-Sensitive Rein- forcement Learning: Iterated CVaR and Worst Path”. In: The Eleventh International Conference on Learning Representations. 2023

  19. [19]

    Clinical data based optimal STI strategies for HIV: A reinforce- ment learning approach

    Damien Ernst et al. “Clinical data based optimal STI strategies for HIV: A reinforce- ment learning approach”. In: Proceedings of the 45th IEEE Conference on Decision and Control. IEEE. 2006, pp. 667–672

  20. [20]

    Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret

    Yingjie Fei et al. “Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 22384–22395

  21. [21]

    Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

    Mohammad Gheshlaghi Azar, R´ emi Munos, and Hilbert J Kappen. “Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model”. In: Machine learning 91 (2013), pp. 325–349

  22. [22]

    Entropic risk optimiza- tion in discounted MDPs

    Jia Lin Hau, Marek Petrik, and Mohammad Ghavamzadeh. “Entropic risk optimiza- tion in discounted MDPs”. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2023, pp. 47–76. 12

  23. [23]

    Risk-sensitive Markov decision processes

    Ronald A Howard and James E Matheson. “Risk-sensitive Markov decision processes”. In: Management science 18.7 (1972), pp. 356–369

  24. [24]

    A tighter problem-dependent regret bound for risk- sensitive reinforcement learning

    Xiaoyan Hu and Ho-fung Leung. “A tighter problem-dependent regret bound for risk- sensitive reinforcement learning”. In:International Conference on Artificial Intelligence and Statistics. PMLR. 2023, pp. 5411–5437

  25. [25]

    A utility criterion for Markov decision processes

    Stratton C Jaquette. “A utility criterion for Markov decision processes”. In: Manage- ment Science 23.1 (1976), pp. 43–49

  26. [26]

    On the sample complexity of reinforcement learning

    Sham Machandranath Kakade. On the sample complexity of reinforcement learning . University of London, University College London (United Kingdom), 2003

  27. [27]

    Finite-sample convergence rates for Q-learning and indirect algorithms

    Michael Kearns and Satinder Singh. “Finite-sample convergence rates for Q-learning and indirect algorithms”. In: Advances in neural information processing systems 11 (1998)

  28. [28]

    Near-optimal reinforcement learning in polyno- mial time

    Michael Kearns and Satinder Singh. “Near-optimal reinforcement learning in polyno- mial time”. In: Machine learning 49 (2002), pp. 209–232

  29. [29]

    A revised ap- proach for risk-averse multi-armed bandits under CVaR criterion

    Najakorn Khajonchotpanya, Yilin Xue, and Napat Rujeerapaiboon. “A revised ap- proach for risk-averse multi-armed bandits under CVaR criterion”. In: Operations Re- search Letters 49.4 (2021), pp. 465–472

  30. [30]

    Risk-aware reinforcement learning with coherent risk measures and non-linear function approximation

    Thanh Lam et al. “Risk-aware reinforcement learning with coherent risk measures and non-linear function approximation”. In: The Eleventh International Conference on Learning Representations. 2022

  31. [31]

    Near-optimal PAC bounds for discounted MDPs

    Tor Lattimore and Marcus Hutter. “Near-optimal PAC bounds for discounted MDPs”. In: Theoretical Computer Science 558 (2014), pp. 125–143

  32. [32]

    Breaking the sample size barrier in model-based reinforcement learning with a generative model

    Gen Li et al. “Breaking the sample size barrier in model-based reinforcement learning with a generative model”. In: Advances in neural information processing systems 33 (2020), pp. 12861–12872

  33. [33]

    Settling the sample complexity of model-based offline reinforcement learning

    Gen Li et al. “Settling the sample complexity of model-based offline reinforcement learning”. In: The Annals of Statistics 52.1 (2024), pp. 233–260

  34. [34]

    Regret bounds for risk-sensitive reinforcement learn- ing with lipschitz dynamic risk measures

    Hao Liang and Zhiquan Luo. “Regret bounds for risk-sensitive reinforcement learn- ing with lipschitz dynamic risk measures”. In: International Conference on Artificial Intelligence and Statistics . PMLR. 2024, pp. 1774–1782

  35. [35]

    Robust risk-averse stochastic multi-armed bandits

    Odalric-Ambrym Maillard. “Robust risk-averse stochastic multi-armed bandits”. In: Algorithmic Learning Theory: 24th International Conference, ALT 2013, Singapore, October 6-9, 2013. Proceedings 24. Springer. 2013, pp. 218–233

  36. [36]

    Beyond average return in markov decision processes

    Alexandre Marthe, Aur´ elien Garivier, and Claire Vernade. “Beyond average return in markov decision processes”. In: Advances in Neural Information Processing Systems 36 (2023), pp. 56488–56507

  37. [37]

    Efficient Risk-sensitive Planning via Entropic Risk Measures

    Alexandre Marthe et al. “Efficient Risk-sensitive Planning via Entropic Risk Mea- sures”. In: arXiv preprint arXiv:2502.20423 (2025)

  38. [38]

    A unified view of entropy-regularized Markov decision processes

    Gergely Neu, Anders Jonsson, and Vicen¸ c G´ omez. “A unified view of entropy-regularized markov decision processes”. In: arXiv preprint arXiv:1705.07798 (2017)

  39. [39]

    Sample complexity of robust reinforcement learning with a generative model

    Kishan Panaganti and Dileep Kalathil. “Sample complexity of robust reinforcement learning with a generative model”. In: International Conference on Artificial Intelli- gence and Statistics . PMLR. 2022, pp. 9582–9602

  40. [40]

    Puterman

    Martin L. Puterman. Markov decision processes: discrete stochastic dynamic program- ming. John Wiley & Sons, 2014. 13

  41. [41]

    Bridging offline reinforcement learning and imitation learn- ing: A tale of pessimism

    Paria Rashidinejad et al. “Bridging offline reinforcement learning and imitation learn- ing: A tale of pessimism”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 11702–11716

  42. [42]

    Risk-aversion in multi-armed ban- dits

    Amir Sani, Alessandro Lazaric, and R´ emi Munos. “Risk-aversion in multi-armed ban- dits”. In: Advances in neural information processing systems 25 (2012)

  43. [43]

    Robust portfolio asset allocation and risk measures

    Maria Grazia Scutella and Raffaella Recchia. “Robust portfolio asset allocation and risk measures”. In: Annals of Operations Research 204.1 (2013), pp. 145–169

  44. [44]

    Lectures on stochas- tic programming: Modeling and theory

    Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski. Lectures on stochas- tic programming: Modeling and theory . SIAM, 2021

  45. [45]

    The curious price of distributional robustness in reinforcement learning with a generative model

    Laixi Shi et al. “The curious price of distributional robustness in reinforcement learning with a generative model”. In: Advances in Neural Information Processing Systems 36 (2023), pp. 79903–79917

  46. [46]

    Variance reduced value iteration and faster algorithms for solving Markov decision processes

    Aaron Sidford et al. “Variance reduced value iteration and faster algorithms for solving Markov decision processes”. In: Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms . 2018, pp. 770–787

  47. [47]

    An analysis of model-based interval esti- mation for Markov decision processes

    Alexander L Strehl and Michael L Littman. “An analysis of model-based interval esti- mation for Markov decision processes”. In: Journal of Computer and System Sciences 74.8 (2008), pp. 1309–1331

  48. [48]

    Risk-averse Total-reward MDPs with ERM and EVaR

    Xihong Su, Marek Petrik, and Julien Grand-Cl´ ement. “Risk-averse Total-reward MDPs with ERM and EVaR”. In: Proceedings of the AAAI Conference on Artificial Intelli- gence. Vol. 39. 19. 2025, pp. 20646–20654

  49. [49]

    Reinforcement learning: An introduction

    Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction . Vol. 1. 1. MIT press Cambridge, 1998

  50. [50]

    Policy gradient for coherent risk measures

    Aviv Tamar et al. “Policy gradient for coherent risk measures”. In: Advances in neural information processing systems 28 (2015)

  51. [51]

    Randomized linear programming solves the Markov decision problem in nearly linear (sometimes sublinear) time

    Mengdi Wang. “Randomized linear programming solves the Markov decision problem in nearly linear (sometimes sublinear) time”. In: Mathematics of Operations Research 45.2 (2020), pp. 517–546

  52. [52]

    Inequalities for the L1 deviation of the empirical distribution

    Tsachy Weissman et al. “Inequalities for the L1 deviation of the empirical distribution”. In: Hewlett-Packard Labs, Tech. Rep (2003), p. 125

  53. [53]

    Almost horizon-free structure-aware best policy identification with a generative model

    Andrea Zanette, Mykel J Kochenderfer, and Emma Brunskill. “Almost horizon-free structure-aware best policy identification with a generative model”. In: Advances in Neural Information Processing Systems 32 (2019)

  54. [54]

    Provably efficient reinforcement learn- ing for discounted mdps with feature mapping

    Dongruo Zhou, Jiafan He, and Quanquan Gu. “Provably efficient reinforcement learn- ing for discounted mdps with feature mapping”. In: International Conference on Ma- chine Learning. PMLR. 2021, pp. 12793–12802

  55. [55]

    Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs

    Matthew Zurek and Yudong Chen. “Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs”. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems . 2024

  56. [56]

    The Plugin Approach for Average-Reward and Discounted MDPs: Optimal Sample Complexity Analysis

    Matthew Zurek and Yudong Chen. “The Plugin Approach for Average-Reward and Discounted MDPs: Optimal Sample Complexity Analysis”. In: 36th International Con- ference on Algorithmic Learning Theory . 2025. 14 A Technical Lemmas Recall that Qk is the Q-function output by the algorithm after k iterations, πk is the greedy policy with respect to Qk, and that π...

  57. [57]

    The likelihood ratio can be written as L1(W ) L2(W ) = (p + α)k(1 − p − α)t−k pk(1 − p)t−k = 1 + α p k 1 − α 1 − p t−k = 1 + α p k 1 − α 1 − p k 1−p p 1 − α 1 − p t− k p . We start by bounding the second factor using that log(1− x) ≥ −x − x2 + x3 for x ∈ [0, 1 5] (Lemma 6) and that exp( x) ≥ 1 + x for all x along with our assumption that α ≤ 1−p 5 : 1 − α...

  58. [58]

    Hence, L1(W ) L0(W ) = (1 − p − α)m(p + α)t−m (1 − p)mpt−m = 1 − α 1 − p m 1 + α p t−m = 1 − α 1 − p m 1 + α p m p 1−p 1 + α p t− m 1−p

    Define m = t − k, which is now the number of failed coin flips. Hence, L1(W ) L0(W ) = (1 − p − α)m(p + α)t−m (1 − p)mpt−m = 1 − α 1 − p m 1 + α p t−m = 1 − α 1 − p m 1 + α p m p 1−p 1 + α p t− m 1−p . Again, using exp(1 + x) ≥ x for all x ∈ R and using that log(1 + x) ≥ x − x2 for all x ≥ 0, we get that 1 + α p p 1−p ≥ exp p 1 − p α p − α2 p2 ≥ 1 + α 1 −...

  59. [59]

    We start by showing that with these parameter we have that Q∗ M1(z0 i ) − Q∗ M0(z0 i ) > 2ε, which we do by casing on the sign of β : 21 Case 1: β < 0

    Fix any ( ε, δ)-correct Q-algorithm U. We start by showing that with these parameter we have that Q∗ M1(z0 i ) − Q∗ M0(z0 i ) > 2ε, which we do by casing on the sign of β : 21 Case 1: β < 0. In this case p = e−|β| 1 1−γ . We then have Q∗ M1(z0 i ) − Q∗ M0(z0 i ) = γ |β| log (p + α)e|β| 1 1−γ + 1 − p − α pe|β| 1 1−γ + 1 − p = γ |β| log 1 + α(e|β| 1 1−γ − 1...

  60. [60]

    Now by theorem 5 we get that P1(B0) ≥ P1(B0 ∩ E) = E1[1E1B0] = E0 L1 L0 1E1B0 ≥ θ 4 E0[1E1B0] = θ 4 P0(E ∩ B0) ≥ θ 8 Solving for t in θ 8 > δ we find t < p(1 − p) 32α2 log( 1 8δ ) and since p(1 − p) α2 = γ2 |β|2 e−|β| 1 1−γ (1 − e−|β| 1 1−γ ) 64ε2 (e|β| 1 1−γ − 1)2 ≥ γ2 64ε2 e|β| 1 1−γ − 3 |β|2 we conclude that if the algorithm U tries the state-action pa...

  61. [61]

    Let θ = exp − 32α2t p(1−p)

    and using that the algorithm is ( ε, δ)-correct we have that P0(B) ≥ 1 − δ ≥ 3 4 where B = {πU(s0 i ) = a0} is the event that the algorithm outputs the action a0. Let θ = exp − 32α2t p(1−p) . Fix some t ∈ N and let k be the number of transitions to sG i in the t trials and ˜k = ( k if p ≥ 1 2 t − k if p < 1 2 Finally we define the event E as E = ˜pt − ˜k ...

  62. [62]

    From Theorem 5 we get that P1(B) ≥ P1(B ∩ E) = E1[1B1E] ≥ E0 L1(W ) L0(W )1E1B ≥ E0 θ 41E1B = θ 4 P0(E ∩ B) ≥ θ 8 (25) Now solving for θ 8 > δ we see that if t < ˜t(ε, δ) := 1 800 log( 1 8δ ) γ2 ε2 · e|β| 1 1−γ − 3 |β|2 (26) then P1(B) > δ and the event B is containing the event that the algorithm does not choose the optimal action al. Since this holds fo...