Recursive Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model
Pith reviewed 2026-05-22 02:12 UTC · model grok-4.3
The pith
A model-based Q-value iteration algorithm learns optimal recursive entropic risk values and policies with sample complexity exponential in the risk parameter over one minus the discount factor.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors introduce the Model-Based ERM Q-Value Iteration algorithm and prove PAC-type upper bounds on its sample complexity for both approximating the optimal state-action value function and recovering an optimal policy. These upper bounds scale exponentially with |β|/(1-γ). They also prove matching lower bounds for value and policy learning that establish the exponential dependence on |β|/(1-γ) as information-theoretically necessary in the worst case. The bounds are tight in the number of states S and actions A.
What carries the argument
The Model-Based ERM Q-Value Iteration (MB-RS-QVI) algorithm, which uses samples from the generative model to estimate the transition and reward functions and then runs value iteration with the recursive entropic risk operator.
If this is right
- Value learning requires a number of samples polynomial in S and A yet exponential in |β|/(1-γ).
- Policy learning requires the same exponential scaling in |β|/(1-γ).
- The exponential term in |β|/(1-γ) cannot be removed by any algorithm in the worst case.
- The upper and lower bounds match up to constants in their dependence on S and A.
Where Pith is reading between the lines
- Environments with discount factors near one or extreme risk attitudes will demand dramatically larger data sets under this formulation.
- The same exponential cost may appear for other recursive risk measures that satisfy similar contraction properties.
- Practical implementations might still benefit from variance reduction techniques even if the worst-case scaling remains unchanged.
Load-bearing premise
A generative model is available that can produce independent next-state and reward samples for any chosen state-action pair.
What would settle it
An MDP and risk parameter β where every algorithm requires more than exponentially many samples in |β|/(1-γ) to learn the optimal value function to fixed accuracy would falsify the upper bound, while an algorithm achieving only polynomial dependence would falsify the lower bound.
read the original abstract
We study risk-sensitive reinforcement learning in finite discounted MDPs with recursive entropic risk measures (ERM), where the risk parameter $\beta \neq 0$ controls the agent's risk attitude: $\beta>0$ for risk-averse and $\beta<0$ for risk-seeking behavior. A generative model of the MDP is assumed to be available. Our focus is on the sample complexities of learning the optimal state-action value function (value learning) and an optimal policy (policy learning) under recursive ERM. We introduce a model-based algorithm, called Model-Based ERM $Q$-Value Iteration (MB-RS-QVI), and derive PAC-type bounds on its sample complexity for both value and policy learning. Both PAC bounds scale exponentially with $|\beta|/(1-\gamma)$, where $\gamma$ is the discount factor. We also establish corresponding lower bounds for both value and policy learning, showing that exponential dependence on $|\beta|/(1-\gamma)$ is unavoidable in the worst case. The bounds are tight in the number of states and actions ($S$ and $A$), providing the first rigorous sample complexity guarantees for recursive ERM across both risk-averse and risk-seeking regimes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies risk-sensitive RL in finite discounted MDPs under recursive entropic risk measures (ERM) with risk parameter β. Assuming access to a generative model, it proposes the model-based algorithm MB-RS-QVI that performs empirical estimation of the recursive ERM Bellman operator followed by value iteration. PAC-type upper bounds are derived for both value and policy learning that scale exponentially with |β|/(1-γ); matching information-theoretic lower bounds are constructed on a hard MDP family to show that the exponential dependence is unavoidable in the worst case. The bounds are stated to be tight in S and A.
Significance. If the derivations hold, the work supplies the first rigorous sample-complexity guarantees for recursive ERM across both risk-averse (β>0) and risk-seeking (β<0) regimes. The explicit exponential factor is traced to the Lipschitz constant of the ERM operator (bounded by exp(|β|R_max/(1-γ))), the upper-bound proof combines concentration of the entropic estimator with standard contraction arguments, and the lower bound is obtained by reduction to a carefully chosen MDP family. These elements together give a tight characterization of the sample cost induced by the risk parameter.
minor comments (3)
- [Abstract] The abstract and introduction state that the PAC bounds scale exponentially with |β|/(1-γ), but the precise dependence on the reward bound R_max and the number of samples per state-action pair is not summarized in the same sentence; adding a compact statement of the full scaling would improve readability.
- [Section 4] Notation for the empirical ERM operator (e.g., the definition of the sample-based fixed-point operator) is introduced in the algorithm section but referenced without a forward pointer in the proof of the contraction property; a single cross-reference would reduce reader effort.
- [Section 6] The lower-bound construction is described as forcing exponential samples to resolve the risk-adjusted value, yet the precise choice of the hard MDP family (number of states, reward structure) is only sketched; expanding the description of the reduction by one paragraph would make the information-theoretic argument easier to follow.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of our work on sample complexity bounds for recursive entropic risk measures and for recommending minor revision. The summary accurately captures the main contributions, including the PAC upper bounds, matching lower bounds, and the exponential dependence on |β|/(1-γ). We provide point-by-point responses below to the major comments raised.
Circularity Check
No significant circularity; derivation self-contained
full rationale
The manuscript defines MB-RS-QVI via direct empirical estimation of the recursive ERM Bellman operator followed by value iteration. PAC upper bounds follow from standard concentration of the entropic risk estimator combined with contraction arguments whose Lipschitz factor is explicitly bounded by exp(|β|R_max/(1-γ)). Lower bounds are obtained by reduction to a hard MDP family that forces exponential samples to distinguish risk-adjusted values; this construction is independent of the upper-bound analysis and uses only information-theoretic arguments. No step reduces a claimed result to a fitted parameter, self-citation, or definitional identity. The exponential scaling is a direct consequence of the operator's properties rather than an input assumption.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Finite state and action spaces in discounted MDPs
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The operators T, Tπ are γ-contractions... ∥T f1 − T f2∥ ≤ γ∥f1 − f2∥; simulation lemma yields ∥Q1−Q2∥ ≤ γ/(1−γ) e^{|β|/(1−γ)} |β| τ
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents
Characterizes utility functions making recursive OCE objectives PAC-learnable and derives matching upper and lower PAC sample complexity bounds for value and policy learning, with improved tau dependence for CVaR.
-
Tight Sample Complexity Bounds for Entropic Best Policy Identification
New concentration bounds and stopping rule close the exponential gap to match the lower bound for entropic best policy identification.
Reference graph
Works this paper leans on
-
[1]
Model-based reinforcement learning with a generative model is minimax optimal
Alekh Agarwal, Sham Kakade, and Lin F Yang. “Model-based reinforcement learning with a generative model is minimax optimal”. In: Conference on Learning Theory . PMLR. 2020, pp. 67–83
work page 2020
-
[2]
Entropic value-at-risk: A new coherent risk measure
Amir Ahmadi-Javid. “Entropic value-at-risk: A new coherent risk measure”. In: Journal of Optimization Theory and Applications 155 (2012), pp. 1105–1123
work page 2012
-
[3]
Adaptive sampling for best policy identi- fication in Markov decision processes
Aymen Al Marjani and Alexandre Proutiere. “Adaptive sampling for best policy identi- fication in Markov decision processes”. In:International Conference on Machine Learn- ing. PMLR. 2021, pp. 7459–7468
work page 2021
-
[4]
A note on a new class of recursive utilities in Markov decision processes
Hubert Asienkiewicz and Anna Ja´ skiewicz. “A note on a new class of recursive utilities in Markov decision processes”. In: Applicationes Mathematicae 44 (2017), pp. 149–161. 11
work page 2017
-
[5]
Reinforcement learning with a near optimal rate of convergence
Mohammad Gheshlaghi Azar et al. “Reinforcement learning with a near optimal rate of convergence”. In: (2011)
work page 2011
-
[6]
Markov decision processes with recursive risk measures
Nicole B¨ auerle and Alexander Glauner. “Markov decision processes with recursive risk measures”. In: European Journal of Operational Research 296.3 (2022), pp. 953–966
work page 2022
-
[7]
Markov decision processes with average-value-at- risk criteria
Nicole B¨ auerle and Jonathan Ott. “Markov decision processes with average-value-at- risk criteria”. In: Mathematical Methods of Operations Research 74 (2011), pp. 361– 379
work page 2011
-
[8]
Distributional reinforcement learning
Marc G Bellemare, Will Dabney, and Mark Rowland. Distributional reinforcement learning. MIT Press, 2023
work page 2023
-
[9]
Risk-sensitive dynamic asset management
Tomasz R Bielecki and Stanley R Pliska. “Risk-sensitive dynamic asset management”. In: Applied Mathematics and Optimization 39 (1999), pp. 337–360
work page 1999
-
[10]
Risk-averse policy optimization via risk-neutral policy optimiza- tion
Lorenzo Bisi et al. “Risk-averse policy optimization via risk-neutral policy optimiza- tion”. In: Artificial Intelligence 311 (2022), p. 103765
work page 2022
-
[11]
Risk-sensitive optimal control for Markov decision processes with monotone cost
Vivek S Borkar and Sean P Meyn. “Risk-sensitive optimal control for Markov decision processes with monotone cost”. In: Mathematics of Operations Research 27.1 (2002), pp. 192–209
work page 2002
-
[12]
Bayesian robust optimization for imitation learning
Daniel Brown, Scott Niekum, and Marek Petrik. “Bayesian robust optimization for imitation learning”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 2479–2491
work page 2020
-
[13]
Yu Chen et al. “Provably Efficient Iterated CVaR Reinforcement Learning with Func- tion Approximation and Human Feedback”. In: The Twelfth International Conference on Learning Representations. 2024
work page 2024
-
[14]
Finite-sample analysis of contractive stochastic approximation us- ing smooth convex envelopes
Zaiwei Chen et al. “Finite-sample analysis of contractive stochastic approximation us- ing smooth convex envelopes”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 8223–8234
work page 2020
-
[15]
Algorithms for CVaR optimization in MDPs
Yinlam Chow and Mohammad Ghavamzadeh. “Algorithms for CVaR optimization in MDPs”. In: Advances in neural information processing systems 27 (2014)
work page 2014
-
[16]
Percentile optimization for Markov decision processes with parameter uncertainty
Erick Delage and Shie Mannor. “Percentile optimization for Markov decision processes with parameter uncertainty”. In: Operations research 58.1 (2010), pp. 203–213
work page 2010
-
[17]
Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model
Zilong Deng, Simon Khan, and Shaofeng Zou. “Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model”. In:AISTATS. 2025
work page 2025
-
[18]
Provably Efficient Risk-Sensitive Rein- forcement Learning: Iterated CVaR and Worst Path
Yihan Du, Siwei Wang, and Longbo Huang. “Provably Efficient Risk-Sensitive Rein- forcement Learning: Iterated CVaR and Worst Path”. In: The Eleventh International Conference on Learning Representations. 2023
work page 2023
-
[19]
Clinical data based optimal STI strategies for HIV: A reinforce- ment learning approach
Damien Ernst et al. “Clinical data based optimal STI strategies for HIV: A reinforce- ment learning approach”. In: Proceedings of the 45th IEEE Conference on Decision and Control. IEEE. 2006, pp. 667–672
work page 2006
-
[20]
Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret
Yingjie Fei et al. “Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 22384–22395
work page 2020
-
[21]
Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model
Mohammad Gheshlaghi Azar, R´ emi Munos, and Hilbert J Kappen. “Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model”. In: Machine learning 91 (2013), pp. 325–349
work page 2013
-
[22]
Entropic risk optimiza- tion in discounted MDPs
Jia Lin Hau, Marek Petrik, and Mohammad Ghavamzadeh. “Entropic risk optimiza- tion in discounted MDPs”. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2023, pp. 47–76. 12
work page 2023
-
[23]
Risk-sensitive Markov decision processes
Ronald A Howard and James E Matheson. “Risk-sensitive Markov decision processes”. In: Management science 18.7 (1972), pp. 356–369
work page 1972
-
[24]
A tighter problem-dependent regret bound for risk- sensitive reinforcement learning
Xiaoyan Hu and Ho-fung Leung. “A tighter problem-dependent regret bound for risk- sensitive reinforcement learning”. In:International Conference on Artificial Intelligence and Statistics. PMLR. 2023, pp. 5411–5437
work page 2023
-
[25]
A utility criterion for Markov decision processes
Stratton C Jaquette. “A utility criterion for Markov decision processes”. In: Manage- ment Science 23.1 (1976), pp. 43–49
work page 1976
-
[26]
On the sample complexity of reinforcement learning
Sham Machandranath Kakade. On the sample complexity of reinforcement learning . University of London, University College London (United Kingdom), 2003
work page 2003
-
[27]
Finite-sample convergence rates for Q-learning and indirect algorithms
Michael Kearns and Satinder Singh. “Finite-sample convergence rates for Q-learning and indirect algorithms”. In: Advances in neural information processing systems 11 (1998)
work page 1998
-
[28]
Near-optimal reinforcement learning in polyno- mial time
Michael Kearns and Satinder Singh. “Near-optimal reinforcement learning in polyno- mial time”. In: Machine learning 49 (2002), pp. 209–232
work page 2002
-
[29]
A revised ap- proach for risk-averse multi-armed bandits under CVaR criterion
Najakorn Khajonchotpanya, Yilin Xue, and Napat Rujeerapaiboon. “A revised ap- proach for risk-averse multi-armed bandits under CVaR criterion”. In: Operations Re- search Letters 49.4 (2021), pp. 465–472
work page 2021
-
[30]
Risk-aware reinforcement learning with coherent risk measures and non-linear function approximation
Thanh Lam et al. “Risk-aware reinforcement learning with coherent risk measures and non-linear function approximation”. In: The Eleventh International Conference on Learning Representations. 2022
work page 2022
-
[31]
Near-optimal PAC bounds for discounted MDPs
Tor Lattimore and Marcus Hutter. “Near-optimal PAC bounds for discounted MDPs”. In: Theoretical Computer Science 558 (2014), pp. 125–143
work page 2014
-
[32]
Breaking the sample size barrier in model-based reinforcement learning with a generative model
Gen Li et al. “Breaking the sample size barrier in model-based reinforcement learning with a generative model”. In: Advances in neural information processing systems 33 (2020), pp. 12861–12872
work page 2020
-
[33]
Settling the sample complexity of model-based offline reinforcement learning
Gen Li et al. “Settling the sample complexity of model-based offline reinforcement learning”. In: The Annals of Statistics 52.1 (2024), pp. 233–260
work page 2024
-
[34]
Regret bounds for risk-sensitive reinforcement learn- ing with lipschitz dynamic risk measures
Hao Liang and Zhiquan Luo. “Regret bounds for risk-sensitive reinforcement learn- ing with lipschitz dynamic risk measures”. In: International Conference on Artificial Intelligence and Statistics . PMLR. 2024, pp. 1774–1782
work page 2024
-
[35]
Robust risk-averse stochastic multi-armed bandits
Odalric-Ambrym Maillard. “Robust risk-averse stochastic multi-armed bandits”. In: Algorithmic Learning Theory: 24th International Conference, ALT 2013, Singapore, October 6-9, 2013. Proceedings 24. Springer. 2013, pp. 218–233
work page 2013
-
[36]
Beyond average return in markov decision processes
Alexandre Marthe, Aur´ elien Garivier, and Claire Vernade. “Beyond average return in markov decision processes”. In: Advances in Neural Information Processing Systems 36 (2023), pp. 56488–56507
work page 2023
-
[37]
Efficient Risk-sensitive Planning via Entropic Risk Measures
Alexandre Marthe et al. “Efficient Risk-sensitive Planning via Entropic Risk Mea- sures”. In: arXiv preprint arXiv:2502.20423 (2025)
-
[38]
A unified view of entropy-regularized Markov decision processes
Gergely Neu, Anders Jonsson, and Vicen¸ c G´ omez. “A unified view of entropy-regularized markov decision processes”. In: arXiv preprint arXiv:1705.07798 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[39]
Sample complexity of robust reinforcement learning with a generative model
Kishan Panaganti and Dileep Kalathil. “Sample complexity of robust reinforcement learning with a generative model”. In: International Conference on Artificial Intelli- gence and Statistics . PMLR. 2022, pp. 9582–9602
work page 2022
- [40]
-
[41]
Bridging offline reinforcement learning and imitation learn- ing: A tale of pessimism
Paria Rashidinejad et al. “Bridging offline reinforcement learning and imitation learn- ing: A tale of pessimism”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 11702–11716
work page 2021
-
[42]
Risk-aversion in multi-armed ban- dits
Amir Sani, Alessandro Lazaric, and R´ emi Munos. “Risk-aversion in multi-armed ban- dits”. In: Advances in neural information processing systems 25 (2012)
work page 2012
-
[43]
Robust portfolio asset allocation and risk measures
Maria Grazia Scutella and Raffaella Recchia. “Robust portfolio asset allocation and risk measures”. In: Annals of Operations Research 204.1 (2013), pp. 145–169
work page 2013
-
[44]
Lectures on stochas- tic programming: Modeling and theory
Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski. Lectures on stochas- tic programming: Modeling and theory . SIAM, 2021
work page 2021
-
[45]
The curious price of distributional robustness in reinforcement learning with a generative model
Laixi Shi et al. “The curious price of distributional robustness in reinforcement learning with a generative model”. In: Advances in Neural Information Processing Systems 36 (2023), pp. 79903–79917
work page 2023
-
[46]
Variance reduced value iteration and faster algorithms for solving Markov decision processes
Aaron Sidford et al. “Variance reduced value iteration and faster algorithms for solving Markov decision processes”. In: Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms . 2018, pp. 770–787
work page 2018
-
[47]
An analysis of model-based interval esti- mation for Markov decision processes
Alexander L Strehl and Michael L Littman. “An analysis of model-based interval esti- mation for Markov decision processes”. In: Journal of Computer and System Sciences 74.8 (2008), pp. 1309–1331
work page 2008
-
[48]
Risk-averse Total-reward MDPs with ERM and EVaR
Xihong Su, Marek Petrik, and Julien Grand-Cl´ ement. “Risk-averse Total-reward MDPs with ERM and EVaR”. In: Proceedings of the AAAI Conference on Artificial Intelli- gence. Vol. 39. 19. 2025, pp. 20646–20654
work page 2025
-
[49]
Reinforcement learning: An introduction
Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction . Vol. 1. 1. MIT press Cambridge, 1998
work page 1998
-
[50]
Policy gradient for coherent risk measures
Aviv Tamar et al. “Policy gradient for coherent risk measures”. In: Advances in neural information processing systems 28 (2015)
work page 2015
-
[51]
Mengdi Wang. “Randomized linear programming solves the Markov decision problem in nearly linear (sometimes sublinear) time”. In: Mathematics of Operations Research 45.2 (2020), pp. 517–546
work page 2020
-
[52]
Inequalities for the L1 deviation of the empirical distribution
Tsachy Weissman et al. “Inequalities for the L1 deviation of the empirical distribution”. In: Hewlett-Packard Labs, Tech. Rep (2003), p. 125
work page 2003
-
[53]
Almost horizon-free structure-aware best policy identification with a generative model
Andrea Zanette, Mykel J Kochenderfer, and Emma Brunskill. “Almost horizon-free structure-aware best policy identification with a generative model”. In: Advances in Neural Information Processing Systems 32 (2019)
work page 2019
-
[54]
Provably efficient reinforcement learn- ing for discounted mdps with feature mapping
Dongruo Zhou, Jiafan He, and Quanquan Gu. “Provably efficient reinforcement learn- ing for discounted mdps with feature mapping”. In: International Conference on Ma- chine Learning. PMLR. 2021, pp. 12793–12802
work page 2021
-
[55]
Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs
Matthew Zurek and Yudong Chen. “Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs”. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems . 2024
work page 2024
-
[56]
The Plugin Approach for Average-Reward and Discounted MDPs: Optimal Sample Complexity Analysis
Matthew Zurek and Yudong Chen. “The Plugin Approach for Average-Reward and Discounted MDPs: Optimal Sample Complexity Analysis”. In: 36th International Con- ference on Algorithmic Learning Theory . 2025. 14 A Technical Lemmas Recall that Qk is the Q-function output by the algorithm after k iterations, πk is the greedy policy with respect to Qk, and that π...
work page 2025
-
[57]
The likelihood ratio can be written as L1(W ) L2(W ) = (p + α)k(1 − p − α)t−k pk(1 − p)t−k = 1 + α p k 1 − α 1 − p t−k = 1 + α p k 1 − α 1 − p k 1−p p 1 − α 1 − p t− k p . We start by bounding the second factor using that log(1− x) ≥ −x − x2 + x3 for x ∈ [0, 1 5] (Lemma 6) and that exp( x) ≥ 1 + x for all x along with our assumption that α ≤ 1−p 5 : 1 − α...
-
[58]
Define m = t − k, which is now the number of failed coin flips. Hence, L1(W ) L0(W ) = (1 − p − α)m(p + α)t−m (1 − p)mpt−m = 1 − α 1 − p m 1 + α p t−m = 1 − α 1 − p m 1 + α p m p 1−p 1 + α p t− m 1−p . Again, using exp(1 + x) ≥ x for all x ∈ R and using that log(1 + x) ≥ x − x2 for all x ≥ 0, we get that 1 + α p p 1−p ≥ exp p 1 − p α p − α2 p2 ≥ 1 + α 1 −...
-
[59]
Fix any ( ε, δ)-correct Q-algorithm U. We start by showing that with these parameter we have that Q∗ M1(z0 i ) − Q∗ M0(z0 i ) > 2ε, which we do by casing on the sign of β : 21 Case 1: β < 0. In this case p = e−|β| 1 1−γ . We then have Q∗ M1(z0 i ) − Q∗ M0(z0 i ) = γ |β| log (p + α)e|β| 1 1−γ + 1 − p − α pe|β| 1 1−γ + 1 − p = γ |β| log 1 + α(e|β| 1 1−γ − 1...
-
[60]
Now by theorem 5 we get that P1(B0) ≥ P1(B0 ∩ E) = E1[1E1B0] = E0 L1 L0 1E1B0 ≥ θ 4 E0[1E1B0] = θ 4 P0(E ∩ B0) ≥ θ 8 Solving for t in θ 8 > δ we find t < p(1 − p) 32α2 log( 1 8δ ) and since p(1 − p) α2 = γ2 |β|2 e−|β| 1 1−γ (1 − e−|β| 1 1−γ ) 64ε2 (e|β| 1 1−γ − 1)2 ≥ γ2 64ε2 e|β| 1 1−γ − 3 |β|2 we conclude that if the algorithm U tries the state-action pa...
-
[61]
and using that the algorithm is ( ε, δ)-correct we have that P0(B) ≥ 1 − δ ≥ 3 4 where B = {πU(s0 i ) = a0} is the event that the algorithm outputs the action a0. Let θ = exp − 32α2t p(1−p) . Fix some t ∈ N and let k be the number of transitions to sG i in the t trials and ˜k = ( k if p ≥ 1 2 t − k if p < 1 2 Finally we define the event E as E = ˜pt − ˜k ...
-
[62]
From Theorem 5 we get that P1(B) ≥ P1(B ∩ E) = E1[1B1E] ≥ E0 L1(W ) L0(W )1E1B ≥ E0 θ 41E1B = θ 4 P0(E ∩ B) ≥ θ 8 (25) Now solving for θ 8 > δ we see that if t < ˜t(ε, δ) := 1 800 log( 1 8δ ) γ2 ε2 · e|β| 1 1−γ − 3 |β|2 (26) then P1(B) > δ and the event B is containing the event that the algorithm does not choose the optimal action al. Since this holds fo...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.