Recursive Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

Mohammad Sadegh Talebi; Oliver Mortensen

arxiv: 2506.00286 · v3 · pith:YJ6I6UYLnew · submitted 2025-05-30 · 💻 cs.LG · cs.AI· math.OC· stat.ML

Recursive Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

Oliver Mortensen , Mohammad Sadegh Talebi This is my paper

Pith reviewed 2026-05-22 02:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OCstat.ML

keywords risk-sensitive reinforcement learningentropic risk measuresample complexityPAC boundsgenerative modeldiscounted MDPsvalue iterationrisk-averse and risk-seeking

0 comments

The pith

A model-based Q-value iteration algorithm learns optimal recursive entropic risk values and policies with sample complexity exponential in the risk parameter over one minus the discount factor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies risk-sensitive reinforcement learning where agents optimize recursive entropic risk measures in finite discounted Markov decision processes. It presents a model-based algorithm that draws samples from a generative model and performs value iteration under the risk measure. The resulting PAC bounds on the number of samples needed for accurate value and policy learning grow exponentially with the absolute value of the risk parameter divided by one minus the discount factor. Matching lower bounds prove that no algorithm can avoid this exponential cost in the worst case. These guarantees apply equally to risk-averse and risk-seeking regimes and are tight in the number of states and actions.

Core claim

The authors introduce the Model-Based ERM Q-Value Iteration algorithm and prove PAC-type upper bounds on its sample complexity for both approximating the optimal state-action value function and recovering an optimal policy. These upper bounds scale exponentially with |β|/(1-γ). They also prove matching lower bounds for value and policy learning that establish the exponential dependence on |β|/(1-γ) as information-theoretically necessary in the worst case. The bounds are tight in the number of states S and actions A.

What carries the argument

The Model-Based ERM Q-Value Iteration (MB-RS-QVI) algorithm, which uses samples from the generative model to estimate the transition and reward functions and then runs value iteration with the recursive entropic risk operator.

If this is right

Value learning requires a number of samples polynomial in S and A yet exponential in |β|/(1-γ).
Policy learning requires the same exponential scaling in |β|/(1-γ).
The exponential term in |β|/(1-γ) cannot be removed by any algorithm in the worst case.
The upper and lower bounds match up to constants in their dependence on S and A.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Environments with discount factors near one or extreme risk attitudes will demand dramatically larger data sets under this formulation.
The same exponential cost may appear for other recursive risk measures that satisfy similar contraction properties.
Practical implementations might still benefit from variance reduction techniques even if the worst-case scaling remains unchanged.

Load-bearing premise

A generative model is available that can produce independent next-state and reward samples for any chosen state-action pair.

What would settle it

An MDP and risk parameter β where every algorithm requires more than exponentially many samples in |β|/(1-γ) to learn the optimal value function to fixed accuracy would falsify the upper bound, while an algorithm achieving only polynomial dependence would falsify the lower bound.

read the original abstract

We study risk-sensitive reinforcement learning in finite discounted MDPs with recursive entropic risk measures (ERM), where the risk parameter $\beta \neq 0$ controls the agent's risk attitude: $\beta>0$ for risk-averse and $\beta<0$ for risk-seeking behavior. A generative model of the MDP is assumed to be available. Our focus is on the sample complexities of learning the optimal state-action value function (value learning) and an optimal policy (policy learning) under recursive ERM. We introduce a model-based algorithm, called Model-Based ERM $Q$-Value Iteration (MB-RS-QVI), and derive PAC-type bounds on its sample complexity for both value and policy learning. Both PAC bounds scale exponentially with $|\beta|/(1-\gamma)$, where $\gamma$ is the discount factor. We also establish corresponding lower bounds for both value and policy learning, showing that exponential dependence on $|\beta|/(1-\gamma)$ is unavoidable in the worst case. The bounds are tight in the number of states and actions ($S$ and $A$), providing the first rigorous sample complexity guarantees for recursive ERM across both risk-averse and risk-seeking regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives the first matching PAC upper and lower sample complexity bounds for recursive entropic risk measures in discounted MDPs under a generative model, with unavoidable exponential scaling in |β|/(1-γ).

read the letter

Hey colleague, the main thing here is that the paper delivers the first PAC bounds with matching lower bounds for recursive entropic risk optimization in finite discounted MDPs with a generative model. The results cover both value and policy learning and apply to both risk-averse (β > 0) and risk-seeking (β < 0) regimes, with the sample complexity scaling exponentially in |β|/(1-γ). They introduce MB-RS-QVI, which estimates the recursive ERM Bellman operator from samples and runs value iteration. The upper bounds come from concentration on the entropic estimator plus contraction arguments. The lower bounds use a reduction to a hard MDP family that forces the exponential sample count to distinguish risk-adjusted values. The exponential term traces to the Lipschitz constant of the ERM operator, and the stress-test note indicates the fixed-point, sample allocation, and information-theoretic arguments line up without evident gaps. The bounds are also tight in S and A, which is a clean result. A soft spot is the generative model assumption, which lets you sample arbitrarily from any state-action pair. This is standard for these analyses but stronger than typical online interaction, so the results do not directly transfer to settings without that access. The exponential dependence is presented as inherent rather than an artifact, which is fair given the lower bounds. This work targets researchers in risk-sensitive RL who want finite-sample guarantees for non-expected-value objectives. A reader focused on theoretical sample efficiency for recursive risk measures would get concrete value from the matching bounds and the handling of both signs of β. I would send it for peer review. The claims are specific, the approach builds on standard MDP tools in a careful way, and the lower bounds give it more weight than upper-bound-only work.

Referee Report

0 major / 3 minor

Summary. The manuscript studies risk-sensitive RL in finite discounted MDPs under recursive entropic risk measures (ERM) with risk parameter β. Assuming access to a generative model, it proposes the model-based algorithm MB-RS-QVI that performs empirical estimation of the recursive ERM Bellman operator followed by value iteration. PAC-type upper bounds are derived for both value and policy learning that scale exponentially with |β|/(1-γ); matching information-theoretic lower bounds are constructed on a hard MDP family to show that the exponential dependence is unavoidable in the worst case. The bounds are stated to be tight in S and A.

Significance. If the derivations hold, the work supplies the first rigorous sample-complexity guarantees for recursive ERM across both risk-averse (β>0) and risk-seeking (β<0) regimes. The explicit exponential factor is traced to the Lipschitz constant of the ERM operator (bounded by exp(|β|R_max/(1-γ))), the upper-bound proof combines concentration of the entropic estimator with standard contraction arguments, and the lower bound is obtained by reduction to a carefully chosen MDP family. These elements together give a tight characterization of the sample cost induced by the risk parameter.

minor comments (3)

[Abstract] The abstract and introduction state that the PAC bounds scale exponentially with |β|/(1-γ), but the precise dependence on the reward bound R_max and the number of samples per state-action pair is not summarized in the same sentence; adding a compact statement of the full scaling would improve readability.
[Section 4] Notation for the empirical ERM operator (e.g., the definition of the sample-based fixed-point operator) is introduced in the algorithm section but referenced without a forward pointer in the proof of the contraction property; a single cross-reference would reduce reader effort.
[Section 6] The lower-bound construction is described as forcing exponential samples to resolve the risk-adjusted value, yet the precise choice of the hard MDP family (number of states, reward structure) is only sketched; expanding the description of the reduction by one paragraph would make the information-theoretic argument easier to follow.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our work on sample complexity bounds for recursive entropic risk measures and for recommending minor revision. The summary accurately captures the main contributions, including the PAC upper bounds, matching lower bounds, and the exponential dependence on |β|/(1-γ). We provide point-by-point responses below to the major comments raised.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The manuscript defines MB-RS-QVI via direct empirical estimation of the recursive ERM Bellman operator followed by value iteration. PAC upper bounds follow from standard concentration of the entropic risk estimator combined with contraction arguments whose Lipschitz factor is explicitly bounded by exp(|β|R_max/(1-γ)). Lower bounds are obtained by reduction to a hard MDP family that forces exponential samples to distinguish risk-adjusted values; this construction is independent of the upper-bound analysis and uses only information-theoretic arguments. No step reduces a claimed result to a fitted parameter, self-citation, or definitional identity. The exponential scaling is a direct consequence of the operator's properties rather than an input assumption.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard finite MDP assumptions and generative model access; no new free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Finite state and action spaces in discounted MDPs
Stated as finite discounted MDPs with generative model.

pith-pipeline@v0.9.0 · 5751 in / 1113 out tokens · 36014 ms · 2026-05-22T02:12:12.166580+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The operators T, Tπ are γ-contractions... ∥T f1 − T f2∥ ≤ γ∥f1 − f2∥; simulation lemma yields ∥Q1−Q2∥ ≤ γ/(1−γ) e^{|β|/(1−γ)} |β| τ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents
cs.LG 2026-05 unverdicted novelty 7.0

Characterizes utility functions making recursive OCE objectives PAC-learnable and derives matching upper and lower PAC sample complexity bounds for value and policy learning, with improved tau dependence for CVaR.
Tight Sample Complexity Bounds for Entropic Best Policy Identification
cs.LG 2026-05 unverdicted novelty 7.0

New concentration bounds and stopping rule close the exponential gap to match the lower bound for entropic best policy identification.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 2 Pith papers · 1 internal anchor

[1]

Model-based reinforcement learning with a generative model is minimax optimal

Alekh Agarwal, Sham Kakade, and Lin F Yang. “Model-based reinforcement learning with a generative model is minimax optimal”. In: Conference on Learning Theory . PMLR. 2020, pp. 67–83

work page 2020
[2]

Entropic value-at-risk: A new coherent risk measure

Amir Ahmadi-Javid. “Entropic value-at-risk: A new coherent risk measure”. In: Journal of Optimization Theory and Applications 155 (2012), pp. 1105–1123

work page 2012
[3]

Adaptive sampling for best policy identi- fication in Markov decision processes

Aymen Al Marjani and Alexandre Proutiere. “Adaptive sampling for best policy identi- fication in Markov decision processes”. In:International Conference on Machine Learn- ing. PMLR. 2021, pp. 7459–7468

work page 2021
[4]

A note on a new class of recursive utilities in Markov decision processes

Hubert Asienkiewicz and Anna Ja´ skiewicz. “A note on a new class of recursive utilities in Markov decision processes”. In: Applicationes Mathematicae 44 (2017), pp. 149–161. 11

work page 2017
[5]

Reinforcement learning with a near optimal rate of convergence

Mohammad Gheshlaghi Azar et al. “Reinforcement learning with a near optimal rate of convergence”. In: (2011)

work page 2011
[6]

Markov decision processes with recursive risk measures

Nicole B¨ auerle and Alexander Glauner. “Markov decision processes with recursive risk measures”. In: European Journal of Operational Research 296.3 (2022), pp. 953–966

work page 2022
[7]

Markov decision processes with average-value-at- risk criteria

Nicole B¨ auerle and Jonathan Ott. “Markov decision processes with average-value-at- risk criteria”. In: Mathematical Methods of Operations Research 74 (2011), pp. 361– 379

work page 2011
[8]

Distributional reinforcement learning

Marc G Bellemare, Will Dabney, and Mark Rowland. Distributional reinforcement learning. MIT Press, 2023

work page 2023
[9]

Risk-sensitive dynamic asset management

Tomasz R Bielecki and Stanley R Pliska. “Risk-sensitive dynamic asset management”. In: Applied Mathematics and Optimization 39 (1999), pp. 337–360

work page 1999
[10]

Risk-averse policy optimization via risk-neutral policy optimiza- tion

Lorenzo Bisi et al. “Risk-averse policy optimization via risk-neutral policy optimiza- tion”. In: Artificial Intelligence 311 (2022), p. 103765

work page 2022
[11]

Risk-sensitive optimal control for Markov decision processes with monotone cost

Vivek S Borkar and Sean P Meyn. “Risk-sensitive optimal control for Markov decision processes with monotone cost”. In: Mathematics of Operations Research 27.1 (2002), pp. 192–209

work page 2002
[12]

Bayesian robust optimization for imitation learning

Daniel Brown, Scott Niekum, and Marek Petrik. “Bayesian robust optimization for imitation learning”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 2479–2491

work page 2020
[13]

Provably Efficient Iterated CVaR Reinforcement Learning with Func- tion Approximation and Human Feedback

Yu Chen et al. “Provably Efficient Iterated CVaR Reinforcement Learning with Func- tion Approximation and Human Feedback”. In: The Twelfth International Conference on Learning Representations. 2024

work page 2024
[14]

Finite-sample analysis of contractive stochastic approximation us- ing smooth convex envelopes

Zaiwei Chen et al. “Finite-sample analysis of contractive stochastic approximation us- ing smooth convex envelopes”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 8223–8234

work page 2020
[15]

Algorithms for CVaR optimization in MDPs

Yinlam Chow and Mohammad Ghavamzadeh. “Algorithms for CVaR optimization in MDPs”. In: Advances in neural information processing systems 27 (2014)

work page 2014
[16]

Percentile optimization for Markov decision processes with parameter uncertainty

Erick Delage and Shie Mannor. “Percentile optimization for Markov decision processes with parameter uncertainty”. In: Operations research 58.1 (2010), pp. 203–213

work page 2010
[17]

Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model

Zilong Deng, Simon Khan, and Shaofeng Zou. “Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model”. In:AISTATS. 2025

work page 2025
[18]

Provably Efficient Risk-Sensitive Rein- forcement Learning: Iterated CVaR and Worst Path

Yihan Du, Siwei Wang, and Longbo Huang. “Provably Efficient Risk-Sensitive Rein- forcement Learning: Iterated CVaR and Worst Path”. In: The Eleventh International Conference on Learning Representations. 2023

work page 2023
[19]

Clinical data based optimal STI strategies for HIV: A reinforce- ment learning approach

Damien Ernst et al. “Clinical data based optimal STI strategies for HIV: A reinforce- ment learning approach”. In: Proceedings of the 45th IEEE Conference on Decision and Control. IEEE. 2006, pp. 667–672

work page 2006
[20]

Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret

Yingjie Fei et al. “Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 22384–22395

work page 2020
[21]

Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

Mohammad Gheshlaghi Azar, R´ emi Munos, and Hilbert J Kappen. “Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model”. In: Machine learning 91 (2013), pp. 325–349

work page 2013
[22]

Entropic risk optimiza- tion in discounted MDPs

Jia Lin Hau, Marek Petrik, and Mohammad Ghavamzadeh. “Entropic risk optimiza- tion in discounted MDPs”. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2023, pp. 47–76. 12

work page 2023
[23]

Risk-sensitive Markov decision processes

Ronald A Howard and James E Matheson. “Risk-sensitive Markov decision processes”. In: Management science 18.7 (1972), pp. 356–369

work page 1972
[24]

A tighter problem-dependent regret bound for risk- sensitive reinforcement learning

Xiaoyan Hu and Ho-fung Leung. “A tighter problem-dependent regret bound for risk- sensitive reinforcement learning”. In:International Conference on Artificial Intelligence and Statistics. PMLR. 2023, pp. 5411–5437

work page 2023
[25]

A utility criterion for Markov decision processes

Stratton C Jaquette. “A utility criterion for Markov decision processes”. In: Manage- ment Science 23.1 (1976), pp. 43–49

work page 1976
[26]

On the sample complexity of reinforcement learning

Sham Machandranath Kakade. On the sample complexity of reinforcement learning . University of London, University College London (United Kingdom), 2003

work page 2003
[27]

Finite-sample convergence rates for Q-learning and indirect algorithms

Michael Kearns and Satinder Singh. “Finite-sample convergence rates for Q-learning and indirect algorithms”. In: Advances in neural information processing systems 11 (1998)

work page 1998
[28]

Near-optimal reinforcement learning in polyno- mial time

Michael Kearns and Satinder Singh. “Near-optimal reinforcement learning in polyno- mial time”. In: Machine learning 49 (2002), pp. 209–232

work page 2002
[29]

A revised ap- proach for risk-averse multi-armed bandits under CVaR criterion

Najakorn Khajonchotpanya, Yilin Xue, and Napat Rujeerapaiboon. “A revised ap- proach for risk-averse multi-armed bandits under CVaR criterion”. In: Operations Re- search Letters 49.4 (2021), pp. 465–472

work page 2021
[30]

Risk-aware reinforcement learning with coherent risk measures and non-linear function approximation

Thanh Lam et al. “Risk-aware reinforcement learning with coherent risk measures and non-linear function approximation”. In: The Eleventh International Conference on Learning Representations. 2022

work page 2022
[31]

Near-optimal PAC bounds for discounted MDPs

Tor Lattimore and Marcus Hutter. “Near-optimal PAC bounds for discounted MDPs”. In: Theoretical Computer Science 558 (2014), pp. 125–143

work page 2014
[32]

Breaking the sample size barrier in model-based reinforcement learning with a generative model

Gen Li et al. “Breaking the sample size barrier in model-based reinforcement learning with a generative model”. In: Advances in neural information processing systems 33 (2020), pp. 12861–12872

work page 2020
[33]

Settling the sample complexity of model-based offline reinforcement learning

Gen Li et al. “Settling the sample complexity of model-based offline reinforcement learning”. In: The Annals of Statistics 52.1 (2024), pp. 233–260

work page 2024
[34]

Regret bounds for risk-sensitive reinforcement learn- ing with lipschitz dynamic risk measures

Hao Liang and Zhiquan Luo. “Regret bounds for risk-sensitive reinforcement learn- ing with lipschitz dynamic risk measures”. In: International Conference on Artificial Intelligence and Statistics . PMLR. 2024, pp. 1774–1782

work page 2024
[35]

Robust risk-averse stochastic multi-armed bandits

Odalric-Ambrym Maillard. “Robust risk-averse stochastic multi-armed bandits”. In: Algorithmic Learning Theory: 24th International Conference, ALT 2013, Singapore, October 6-9, 2013. Proceedings 24. Springer. 2013, pp. 218–233

work page 2013
[36]

Beyond average return in markov decision processes

Alexandre Marthe, Aur´ elien Garivier, and Claire Vernade. “Beyond average return in markov decision processes”. In: Advances in Neural Information Processing Systems 36 (2023), pp. 56488–56507

work page 2023
[37]

Efficient Risk-sensitive Planning via Entropic Risk Measures

Alexandre Marthe et al. “Efficient Risk-sensitive Planning via Entropic Risk Mea- sures”. In: arXiv preprint arXiv:2502.20423 (2025)

work page arXiv 2025
[38]

A unified view of entropy-regularized Markov decision processes

Gergely Neu, Anders Jonsson, and Vicen¸ c G´ omez. “A unified view of entropy-regularized markov decision processes”. In: arXiv preprint arXiv:1705.07798 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[39]

Sample complexity of robust reinforcement learning with a generative model

Kishan Panaganti and Dileep Kalathil. “Sample complexity of robust reinforcement learning with a generative model”. In: International Conference on Artificial Intelli- gence and Statistics . PMLR. 2022, pp. 9582–9602

work page 2022
[40]

Puterman

Martin L. Puterman. Markov decision processes: discrete stochastic dynamic program- ming. John Wiley & Sons, 2014. 13

work page 2014
[41]

Bridging offline reinforcement learning and imitation learn- ing: A tale of pessimism

Paria Rashidinejad et al. “Bridging offline reinforcement learning and imitation learn- ing: A tale of pessimism”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 11702–11716

work page 2021
[42]

Risk-aversion in multi-armed ban- dits

Amir Sani, Alessandro Lazaric, and R´ emi Munos. “Risk-aversion in multi-armed ban- dits”. In: Advances in neural information processing systems 25 (2012)

work page 2012
[43]

Robust portfolio asset allocation and risk measures

Maria Grazia Scutella and Raffaella Recchia. “Robust portfolio asset allocation and risk measures”. In: Annals of Operations Research 204.1 (2013), pp. 145–169

work page 2013
[44]

Lectures on stochas- tic programming: Modeling and theory

Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski. Lectures on stochas- tic programming: Modeling and theory . SIAM, 2021

work page 2021
[45]

The curious price of distributional robustness in reinforcement learning with a generative model

Laixi Shi et al. “The curious price of distributional robustness in reinforcement learning with a generative model”. In: Advances in Neural Information Processing Systems 36 (2023), pp. 79903–79917

work page 2023
[46]

Variance reduced value iteration and faster algorithms for solving Markov decision processes

Aaron Sidford et al. “Variance reduced value iteration and faster algorithms for solving Markov decision processes”. In: Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms . 2018, pp. 770–787

work page 2018
[47]

An analysis of model-based interval esti- mation for Markov decision processes

Alexander L Strehl and Michael L Littman. “An analysis of model-based interval esti- mation for Markov decision processes”. In: Journal of Computer and System Sciences 74.8 (2008), pp. 1309–1331

work page 2008
[48]

Risk-averse Total-reward MDPs with ERM and EVaR

Xihong Su, Marek Petrik, and Julien Grand-Cl´ ement. “Risk-averse Total-reward MDPs with ERM and EVaR”. In: Proceedings of the AAAI Conference on Artificial Intelli- gence. Vol. 39. 19. 2025, pp. 20646–20654

work page 2025
[49]

Reinforcement learning: An introduction

Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction . Vol. 1. 1. MIT press Cambridge, 1998

work page 1998
[50]

Policy gradient for coherent risk measures

Aviv Tamar et al. “Policy gradient for coherent risk measures”. In: Advances in neural information processing systems 28 (2015)

work page 2015
[51]

Randomized linear programming solves the Markov decision problem in nearly linear (sometimes sublinear) time

Mengdi Wang. “Randomized linear programming solves the Markov decision problem in nearly linear (sometimes sublinear) time”. In: Mathematics of Operations Research 45.2 (2020), pp. 517–546

work page 2020
[52]

Inequalities for the L1 deviation of the empirical distribution

Tsachy Weissman et al. “Inequalities for the L1 deviation of the empirical distribution”. In: Hewlett-Packard Labs, Tech. Rep (2003), p. 125

work page 2003
[53]

Almost horizon-free structure-aware best policy identification with a generative model

Andrea Zanette, Mykel J Kochenderfer, and Emma Brunskill. “Almost horizon-free structure-aware best policy identification with a generative model”. In: Advances in Neural Information Processing Systems 32 (2019)

work page 2019
[54]

Provably efficient reinforcement learn- ing for discounted mdps with feature mapping

Dongruo Zhou, Jiafan He, and Quanquan Gu. “Provably efficient reinforcement learn- ing for discounted mdps with feature mapping”. In: International Conference on Ma- chine Learning. PMLR. 2021, pp. 12793–12802

work page 2021
[55]

Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs

Matthew Zurek and Yudong Chen. “Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs”. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems . 2024

work page 2024
[56]

The Plugin Approach for Average-Reward and Discounted MDPs: Optimal Sample Complexity Analysis

Matthew Zurek and Yudong Chen. “The Plugin Approach for Average-Reward and Discounted MDPs: Optimal Sample Complexity Analysis”. In: 36th International Con- ference on Algorithmic Learning Theory . 2025. 14 A Technical Lemmas Recall that Qk is the Q-function output by the algorithm after k iterations, πk is the greedy policy with respect to Qk, and that π...

work page 2025
[57]

The likelihood ratio can be written as L1(W ) L2(W ) = (p + α)k(1 − p − α)t−k pk(1 − p)t−k = 1 + α p k 1 − α 1 − p t−k = 1 + α p k 1 − α 1 − p k 1−p p 1 − α 1 − p t− k p . We start by bounding the second factor using that log(1− x) ≥ −x − x2 + x3 for x ∈ [0, 1 5] (Lemma 6) and that exp( x) ≥ 1 + x for all x along with our assumption that α ≤ 1−p 5 : 1 − α...

work page
[58]

Hence, L1(W ) L0(W ) = (1 − p − α)m(p + α)t−m (1 − p)mpt−m = 1 − α 1 − p m 1 + α p t−m = 1 − α 1 − p m 1 + α p m p 1−p 1 + α p t− m 1−p

Define m = t − k, which is now the number of failed coin flips. Hence, L1(W ) L0(W ) = (1 − p − α)m(p + α)t−m (1 − p)mpt−m = 1 − α 1 − p m 1 + α p t−m = 1 − α 1 − p m 1 + α p m p 1−p 1 + α p t− m 1−p . Again, using exp(1 + x) ≥ x for all x ∈ R and using that log(1 + x) ≥ x − x2 for all x ≥ 0, we get that 1 + α p p 1−p ≥ exp p 1 − p α p − α2 p2 ≥ 1 + α 1 −...

work page
[59]

We start by showing that with these parameter we have that Q∗ M1(z0 i ) − Q∗ M0(z0 i ) > 2ε, which we do by casing on the sign of β : 21 Case 1: β < 0

Fix any ( ε, δ)-correct Q-algorithm U. We start by showing that with these parameter we have that Q∗ M1(z0 i ) − Q∗ M0(z0 i ) > 2ε, which we do by casing on the sign of β : 21 Case 1: β < 0. In this case p = e−|β| 1 1−γ . We then have Q∗ M1(z0 i ) − Q∗ M0(z0 i ) = γ |β| log (p + α)e|β| 1 1−γ + 1 − p − α pe|β| 1 1−γ + 1 − p = γ |β| log 1 + α(e|β| 1 1−γ − 1...

work page
[60]

Now by theorem 5 we get that P1(B0) ≥ P1(B0 ∩ E) = E1[1E1B0] = E0 L1 L0 1E1B0 ≥ θ 4 E0[1E1B0] = θ 4 P0(E ∩ B0) ≥ θ 8 Solving for t in θ 8 > δ we find t < p(1 − p) 32α2 log( 1 8δ ) and since p(1 − p) α2 = γ2 |β|2 e−|β| 1 1−γ (1 − e−|β| 1 1−γ ) 64ε2 (e|β| 1 1−γ − 1)2 ≥ γ2 64ε2 e|β| 1 1−γ − 3 |β|2 we conclude that if the algorithm U tries the state-action pa...

work page
[61]

Let θ = exp − 32α2t p(1−p)

and using that the algorithm is ( ε, δ)-correct we have that P0(B) ≥ 1 − δ ≥ 3 4 where B = {πU(s0 i ) = a0} is the event that the algorithm outputs the action a0. Let θ = exp − 32α2t p(1−p) . Fix some t ∈ N and let k be the number of transitions to sG i in the t trials and ˜k = ( k if p ≥ 1 2 t − k if p < 1 2 Finally we define the event E as E = ˜pt − ˜k ...

work page
[62]

From Theorem 5 we get that P1(B) ≥ P1(B ∩ E) = E1[1B1E] ≥ E0 L1(W ) L0(W )1E1B ≥ E0 θ 41E1B = θ 4 P0(E ∩ B) ≥ θ 8 (25) Now solving for θ 8 > δ we see that if t < ˜t(ε, δ) := 1 800 log( 1 8δ ) γ2 ε2 · e|β| 1 1−γ − 3 |β|2 (26) then P1(B) > δ and the event B is containing the event that the algorithm does not choose the optimal action al. Since this holds fo...

work page

[1] [1]

Model-based reinforcement learning with a generative model is minimax optimal

Alekh Agarwal, Sham Kakade, and Lin F Yang. “Model-based reinforcement learning with a generative model is minimax optimal”. In: Conference on Learning Theory . PMLR. 2020, pp. 67–83

work page 2020

[2] [2]

Entropic value-at-risk: A new coherent risk measure

Amir Ahmadi-Javid. “Entropic value-at-risk: A new coherent risk measure”. In: Journal of Optimization Theory and Applications 155 (2012), pp. 1105–1123

work page 2012

[3] [3]

Adaptive sampling for best policy identi- fication in Markov decision processes

Aymen Al Marjani and Alexandre Proutiere. “Adaptive sampling for best policy identi- fication in Markov decision processes”. In:International Conference on Machine Learn- ing. PMLR. 2021, pp. 7459–7468

work page 2021

[4] [4]

A note on a new class of recursive utilities in Markov decision processes

Hubert Asienkiewicz and Anna Ja´ skiewicz. “A note on a new class of recursive utilities in Markov decision processes”. In: Applicationes Mathematicae 44 (2017), pp. 149–161. 11

work page 2017

[5] [5]

Reinforcement learning with a near optimal rate of convergence

Mohammad Gheshlaghi Azar et al. “Reinforcement learning with a near optimal rate of convergence”. In: (2011)

work page 2011

[6] [6]

Markov decision processes with recursive risk measures

Nicole B¨ auerle and Alexander Glauner. “Markov decision processes with recursive risk measures”. In: European Journal of Operational Research 296.3 (2022), pp. 953–966

work page 2022

[7] [7]

Markov decision processes with average-value-at- risk criteria

Nicole B¨ auerle and Jonathan Ott. “Markov decision processes with average-value-at- risk criteria”. In: Mathematical Methods of Operations Research 74 (2011), pp. 361– 379

work page 2011

[8] [8]

Distributional reinforcement learning

Marc G Bellemare, Will Dabney, and Mark Rowland. Distributional reinforcement learning. MIT Press, 2023

work page 2023

[9] [9]

Risk-sensitive dynamic asset management

Tomasz R Bielecki and Stanley R Pliska. “Risk-sensitive dynamic asset management”. In: Applied Mathematics and Optimization 39 (1999), pp. 337–360

work page 1999

[10] [10]

Risk-averse policy optimization via risk-neutral policy optimiza- tion

Lorenzo Bisi et al. “Risk-averse policy optimization via risk-neutral policy optimiza- tion”. In: Artificial Intelligence 311 (2022), p. 103765

work page 2022

[11] [11]

Risk-sensitive optimal control for Markov decision processes with monotone cost

Vivek S Borkar and Sean P Meyn. “Risk-sensitive optimal control for Markov decision processes with monotone cost”. In: Mathematics of Operations Research 27.1 (2002), pp. 192–209

work page 2002

[12] [12]

Bayesian robust optimization for imitation learning

Daniel Brown, Scott Niekum, and Marek Petrik. “Bayesian robust optimization for imitation learning”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 2479–2491

work page 2020

[13] [13]

Provably Efficient Iterated CVaR Reinforcement Learning with Func- tion Approximation and Human Feedback

Yu Chen et al. “Provably Efficient Iterated CVaR Reinforcement Learning with Func- tion Approximation and Human Feedback”. In: The Twelfth International Conference on Learning Representations. 2024

work page 2024

[14] [14]

Finite-sample analysis of contractive stochastic approximation us- ing smooth convex envelopes

Zaiwei Chen et al. “Finite-sample analysis of contractive stochastic approximation us- ing smooth convex envelopes”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 8223–8234

work page 2020

[15] [15]

Algorithms for CVaR optimization in MDPs

Yinlam Chow and Mohammad Ghavamzadeh. “Algorithms for CVaR optimization in MDPs”. In: Advances in neural information processing systems 27 (2014)

work page 2014

[16] [16]

Percentile optimization for Markov decision processes with parameter uncertainty

Erick Delage and Shie Mannor. “Percentile optimization for Markov decision processes with parameter uncertainty”. In: Operations research 58.1 (2010), pp. 203–213

work page 2010

[17] [17]

Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model

Zilong Deng, Simon Khan, and Shaofeng Zou. “Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model”. In:AISTATS. 2025

work page 2025

[18] [18]

Provably Efficient Risk-Sensitive Rein- forcement Learning: Iterated CVaR and Worst Path

Yihan Du, Siwei Wang, and Longbo Huang. “Provably Efficient Risk-Sensitive Rein- forcement Learning: Iterated CVaR and Worst Path”. In: The Eleventh International Conference on Learning Representations. 2023

work page 2023

[19] [19]

Clinical data based optimal STI strategies for HIV: A reinforce- ment learning approach

Damien Ernst et al. “Clinical data based optimal STI strategies for HIV: A reinforce- ment learning approach”. In: Proceedings of the 45th IEEE Conference on Decision and Control. IEEE. 2006, pp. 667–672

work page 2006

[20] [20]

Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret

Yingjie Fei et al. “Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 22384–22395

work page 2020

[21] [21]

Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

Mohammad Gheshlaghi Azar, R´ emi Munos, and Hilbert J Kappen. “Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model”. In: Machine learning 91 (2013), pp. 325–349

work page 2013

[22] [22]

Entropic risk optimiza- tion in discounted MDPs

Jia Lin Hau, Marek Petrik, and Mohammad Ghavamzadeh. “Entropic risk optimiza- tion in discounted MDPs”. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2023, pp. 47–76. 12

work page 2023

[23] [23]

Risk-sensitive Markov decision processes

Ronald A Howard and James E Matheson. “Risk-sensitive Markov decision processes”. In: Management science 18.7 (1972), pp. 356–369

work page 1972

[24] [24]

A tighter problem-dependent regret bound for risk- sensitive reinforcement learning

Xiaoyan Hu and Ho-fung Leung. “A tighter problem-dependent regret bound for risk- sensitive reinforcement learning”. In:International Conference on Artificial Intelligence and Statistics. PMLR. 2023, pp. 5411–5437

work page 2023

[25] [25]

A utility criterion for Markov decision processes

Stratton C Jaquette. “A utility criterion for Markov decision processes”. In: Manage- ment Science 23.1 (1976), pp. 43–49

work page 1976

[26] [26]

On the sample complexity of reinforcement learning

Sham Machandranath Kakade. On the sample complexity of reinforcement learning . University of London, University College London (United Kingdom), 2003

work page 2003

[27] [27]

Finite-sample convergence rates for Q-learning and indirect algorithms

Michael Kearns and Satinder Singh. “Finite-sample convergence rates for Q-learning and indirect algorithms”. In: Advances in neural information processing systems 11 (1998)

work page 1998

[28] [28]

Near-optimal reinforcement learning in polyno- mial time

Michael Kearns and Satinder Singh. “Near-optimal reinforcement learning in polyno- mial time”. In: Machine learning 49 (2002), pp. 209–232

work page 2002

[29] [29]

A revised ap- proach for risk-averse multi-armed bandits under CVaR criterion

Najakorn Khajonchotpanya, Yilin Xue, and Napat Rujeerapaiboon. “A revised ap- proach for risk-averse multi-armed bandits under CVaR criterion”. In: Operations Re- search Letters 49.4 (2021), pp. 465–472

work page 2021

[30] [30]

Risk-aware reinforcement learning with coherent risk measures and non-linear function approximation

Thanh Lam et al. “Risk-aware reinforcement learning with coherent risk measures and non-linear function approximation”. In: The Eleventh International Conference on Learning Representations. 2022

work page 2022

[31] [31]

Near-optimal PAC bounds for discounted MDPs

Tor Lattimore and Marcus Hutter. “Near-optimal PAC bounds for discounted MDPs”. In: Theoretical Computer Science 558 (2014), pp. 125–143

work page 2014

[32] [32]

Breaking the sample size barrier in model-based reinforcement learning with a generative model

Gen Li et al. “Breaking the sample size barrier in model-based reinforcement learning with a generative model”. In: Advances in neural information processing systems 33 (2020), pp. 12861–12872

work page 2020

[33] [33]

Settling the sample complexity of model-based offline reinforcement learning

Gen Li et al. “Settling the sample complexity of model-based offline reinforcement learning”. In: The Annals of Statistics 52.1 (2024), pp. 233–260

work page 2024

[34] [34]

Regret bounds for risk-sensitive reinforcement learn- ing with lipschitz dynamic risk measures

Hao Liang and Zhiquan Luo. “Regret bounds for risk-sensitive reinforcement learn- ing with lipschitz dynamic risk measures”. In: International Conference on Artificial Intelligence and Statistics . PMLR. 2024, pp. 1774–1782

work page 2024

[35] [35]

Robust risk-averse stochastic multi-armed bandits

Odalric-Ambrym Maillard. “Robust risk-averse stochastic multi-armed bandits”. In: Algorithmic Learning Theory: 24th International Conference, ALT 2013, Singapore, October 6-9, 2013. Proceedings 24. Springer. 2013, pp. 218–233

work page 2013

[36] [36]

Beyond average return in markov decision processes

Alexandre Marthe, Aur´ elien Garivier, and Claire Vernade. “Beyond average return in markov decision processes”. In: Advances in Neural Information Processing Systems 36 (2023), pp. 56488–56507

work page 2023

[37] [37]

Efficient Risk-sensitive Planning via Entropic Risk Measures

Alexandre Marthe et al. “Efficient Risk-sensitive Planning via Entropic Risk Mea- sures”. In: arXiv preprint arXiv:2502.20423 (2025)

work page arXiv 2025

[38] [38]

A unified view of entropy-regularized Markov decision processes

Gergely Neu, Anders Jonsson, and Vicen¸ c G´ omez. “A unified view of entropy-regularized markov decision processes”. In: arXiv preprint arXiv:1705.07798 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[39] [39]

Sample complexity of robust reinforcement learning with a generative model

Kishan Panaganti and Dileep Kalathil. “Sample complexity of robust reinforcement learning with a generative model”. In: International Conference on Artificial Intelli- gence and Statistics . PMLR. 2022, pp. 9582–9602

work page 2022

[40] [40]

Puterman

Martin L. Puterman. Markov decision processes: discrete stochastic dynamic program- ming. John Wiley & Sons, 2014. 13

work page 2014

[41] [41]

Bridging offline reinforcement learning and imitation learn- ing: A tale of pessimism

Paria Rashidinejad et al. “Bridging offline reinforcement learning and imitation learn- ing: A tale of pessimism”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 11702–11716

work page 2021

[42] [42]

Risk-aversion in multi-armed ban- dits

Amir Sani, Alessandro Lazaric, and R´ emi Munos. “Risk-aversion in multi-armed ban- dits”. In: Advances in neural information processing systems 25 (2012)

work page 2012

[43] [43]

Robust portfolio asset allocation and risk measures

Maria Grazia Scutella and Raffaella Recchia. “Robust portfolio asset allocation and risk measures”. In: Annals of Operations Research 204.1 (2013), pp. 145–169

work page 2013

[44] [44]

Lectures on stochas- tic programming: Modeling and theory

Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski. Lectures on stochas- tic programming: Modeling and theory . SIAM, 2021

work page 2021

[45] [45]

The curious price of distributional robustness in reinforcement learning with a generative model

Laixi Shi et al. “The curious price of distributional robustness in reinforcement learning with a generative model”. In: Advances in Neural Information Processing Systems 36 (2023), pp. 79903–79917

work page 2023

[46] [46]

Variance reduced value iteration and faster algorithms for solving Markov decision processes

Aaron Sidford et al. “Variance reduced value iteration and faster algorithms for solving Markov decision processes”. In: Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms . 2018, pp. 770–787

work page 2018

[47] [47]

An analysis of model-based interval esti- mation for Markov decision processes

Alexander L Strehl and Michael L Littman. “An analysis of model-based interval esti- mation for Markov decision processes”. In: Journal of Computer and System Sciences 74.8 (2008), pp. 1309–1331

work page 2008

[48] [48]

Risk-averse Total-reward MDPs with ERM and EVaR

Xihong Su, Marek Petrik, and Julien Grand-Cl´ ement. “Risk-averse Total-reward MDPs with ERM and EVaR”. In: Proceedings of the AAAI Conference on Artificial Intelli- gence. Vol. 39. 19. 2025, pp. 20646–20654

work page 2025

[49] [49]

Reinforcement learning: An introduction

Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction . Vol. 1. 1. MIT press Cambridge, 1998

work page 1998

[50] [50]

Policy gradient for coherent risk measures

Aviv Tamar et al. “Policy gradient for coherent risk measures”. In: Advances in neural information processing systems 28 (2015)

work page 2015

[51] [51]

Randomized linear programming solves the Markov decision problem in nearly linear (sometimes sublinear) time

Mengdi Wang. “Randomized linear programming solves the Markov decision problem in nearly linear (sometimes sublinear) time”. In: Mathematics of Operations Research 45.2 (2020), pp. 517–546

work page 2020

[52] [52]

Inequalities for the L1 deviation of the empirical distribution

Tsachy Weissman et al. “Inequalities for the L1 deviation of the empirical distribution”. In: Hewlett-Packard Labs, Tech. Rep (2003), p. 125

work page 2003

[53] [53]

Almost horizon-free structure-aware best policy identification with a generative model

Andrea Zanette, Mykel J Kochenderfer, and Emma Brunskill. “Almost horizon-free structure-aware best policy identification with a generative model”. In: Advances in Neural Information Processing Systems 32 (2019)

work page 2019

[54] [54]

Provably efficient reinforcement learn- ing for discounted mdps with feature mapping

Dongruo Zhou, Jiafan He, and Quanquan Gu. “Provably efficient reinforcement learn- ing for discounted mdps with feature mapping”. In: International Conference on Ma- chine Learning. PMLR. 2021, pp. 12793–12802

work page 2021

[55] [55]

Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs

Matthew Zurek and Yudong Chen. “Span-Based Optimal Sample Complexity for Weakly Communicating and General Average Reward MDPs”. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems . 2024

work page 2024

[56] [56]

The Plugin Approach for Average-Reward and Discounted MDPs: Optimal Sample Complexity Analysis

Matthew Zurek and Yudong Chen. “The Plugin Approach for Average-Reward and Discounted MDPs: Optimal Sample Complexity Analysis”. In: 36th International Con- ference on Algorithmic Learning Theory . 2025. 14 A Technical Lemmas Recall that Qk is the Q-function output by the algorithm after k iterations, πk is the greedy policy with respect to Qk, and that π...

work page 2025

[57] [57]

The likelihood ratio can be written as L1(W ) L2(W ) = (p + α)k(1 − p − α)t−k pk(1 − p)t−k = 1 + α p k 1 − α 1 − p t−k = 1 + α p k 1 − α 1 − p k 1−p p 1 − α 1 − p t− k p . We start by bounding the second factor using that log(1− x) ≥ −x − x2 + x3 for x ∈ [0, 1 5] (Lemma 6) and that exp( x) ≥ 1 + x for all x along with our assumption that α ≤ 1−p 5 : 1 − α...

work page

[58] [58]

Hence, L1(W ) L0(W ) = (1 − p − α)m(p + α)t−m (1 − p)mpt−m = 1 − α 1 − p m 1 + α p t−m = 1 − α 1 − p m 1 + α p m p 1−p 1 + α p t− m 1−p

Define m = t − k, which is now the number of failed coin flips. Hence, L1(W ) L0(W ) = (1 − p − α)m(p + α)t−m (1 − p)mpt−m = 1 − α 1 − p m 1 + α p t−m = 1 − α 1 − p m 1 + α p m p 1−p 1 + α p t− m 1−p . Again, using exp(1 + x) ≥ x for all x ∈ R and using that log(1 + x) ≥ x − x2 for all x ≥ 0, we get that 1 + α p p 1−p ≥ exp p 1 − p α p − α2 p2 ≥ 1 + α 1 −...

work page

[59] [59]

We start by showing that with these parameter we have that Q∗ M1(z0 i ) − Q∗ M0(z0 i ) > 2ε, which we do by casing on the sign of β : 21 Case 1: β < 0

Fix any ( ε, δ)-correct Q-algorithm U. We start by showing that with these parameter we have that Q∗ M1(z0 i ) − Q∗ M0(z0 i ) > 2ε, which we do by casing on the sign of β : 21 Case 1: β < 0. In this case p = e−|β| 1 1−γ . We then have Q∗ M1(z0 i ) − Q∗ M0(z0 i ) = γ |β| log (p + α)e|β| 1 1−γ + 1 − p − α pe|β| 1 1−γ + 1 − p = γ |β| log 1 + α(e|β| 1 1−γ − 1...

work page

[60] [60]

Now by theorem 5 we get that P1(B0) ≥ P1(B0 ∩ E) = E1[1E1B0] = E0 L1 L0 1E1B0 ≥ θ 4 E0[1E1B0] = θ 4 P0(E ∩ B0) ≥ θ 8 Solving for t in θ 8 > δ we find t < p(1 − p) 32α2 log( 1 8δ ) and since p(1 − p) α2 = γ2 |β|2 e−|β| 1 1−γ (1 − e−|β| 1 1−γ ) 64ε2 (e|β| 1 1−γ − 1)2 ≥ γ2 64ε2 e|β| 1 1−γ − 3 |β|2 we conclude that if the algorithm U tries the state-action pa...

work page

[61] [61]

Let θ = exp − 32α2t p(1−p)

and using that the algorithm is ( ε, δ)-correct we have that P0(B) ≥ 1 − δ ≥ 3 4 where B = {πU(s0 i ) = a0} is the event that the algorithm outputs the action a0. Let θ = exp − 32α2t p(1−p) . Fix some t ∈ N and let k be the number of transitions to sG i in the t trials and ˜k = ( k if p ≥ 1 2 t − k if p < 1 2 Finally we define the event E as E = ˜pt − ˜k ...

work page

[62] [62]

From Theorem 5 we get that P1(B) ≥ P1(B ∩ E) = E1[1B1E] ≥ E0 L1(W ) L0(W )1E1B ≥ E0 θ 41E1B = θ 4 P0(E ∩ B) ≥ θ 8 (25) Now solving for θ 8 > δ we see that if t < ˜t(ε, δ) := 1 800 log( 1 8δ ) γ2 ε2 · e|β| 1 1−γ − 3 |β|2 (26) then P1(B) > δ and the event B is containing the event that the algorithm does not choose the optimal action al. Since this holds fo...

work page