On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents

Mohammad Sadegh Talebi; Oliver Mortensen

arxiv: 2605.21763 · v1 · pith:NZT4L5TYnew · submitted 2026-05-20 · 💻 cs.LG · cs.SY· eess.SY· stat.ML

On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents

Oliver Mortensen , Mohammad Sadegh Talebi This is my paper

Pith reviewed 2026-05-22 09:23 UTC · model grok-4.3

classification 💻 cs.LG cs.SYeess.SYstat.ML

keywords risk-sensitive reinforcement learningoptimized certainty equivalentPAC sample complexitydiscounted MDPsgenerative modelCVaRentropic riskvalue function learning

0 comments

The pith

Utility functions with full domain over the reals make optimized certainty equivalent objectives PAC-learnable in discounted MDPs, while restricted domains make them impossible to learn.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies risk-sensitive reinforcement learning in finite discounted MDPs where a generative model provides samples. It gives an exact characterization of which utility functions allow the optimized certainty equivalent objective to be learned to high accuracy with finite samples. This matters because risk-sensitive criteria let agents avoid rare but costly failures, yet only certain versions of them permit reliable learning from data. The authors analyze a model-based method to obtain explicit PAC sample complexity upper bounds and prove matching lower bounds that are tight in the number of state-action pairs.

Core claim

The paper claims that the OCE objective is PAC-learnable if and only if the utility function u has full domain equal to the real numbers; whenever the domain is a proper subset the problem is not PAC-learnable. For utilities that satisfy the domain condition a simple model-based planner yields PAC bounds whose leading term scales with the size of the state-action space. The same analysis produces lower bounds that match this dependence on SA and, for CVaR, improve the dependence on the risk level tau to 1 over tau squared.

What carries the argument

The optimized certainty equivalent (OCE) risk measure, which for a utility function u returns the optimized value of an expectation involving u applied to a random return.

If this is right

When the utility has full domain the model-based planner returns an epsilon-optimal value function after a number of samples that scales linearly with SA and polynomially with the effective horizon.
No algorithm can achieve PAC guarantees for any utility whose domain excludes some real numbers, regardless of sample budget.
For CVaR the sample complexity necessarily grows at least as 1 over tau squared, improving prior dependence on the risk parameter by a factor of 1 over tau.
The lower bounds remain tight even when the effective horizon is treated as a separate parameter.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The characterization suggests that entropic risk, which has full domain, is preferable to CVaR when sample efficiency is the primary concern.
The same domain condition may govern learnability in risk-sensitive settings that lack an explicit generative model but still admit concentration inequalities.
Extending the analysis to infinite state spaces or continuous actions would require new concentration arguments that preserve the domain-based separation.

Load-bearing premise

The analysis assumes a generative model that can produce independent next-state and reward samples for every state-action pair and that the risk-sensitive value function is obtained by recursively applying the OCE operator at every step.

What would settle it

A concrete demonstration that some utility function whose domain is strictly smaller than the reals still permits an algorithm to return an epsilon-optimal OCE value function with probability 1-delta after polynomially many generative samples would falsify the non-learnability claim.

Figures

Figures reproduced from arXiv: 2605.21763 by Mohammad Sadegh Talebi, Oliver Mortensen.

read the original abstract

We study risk-sensitive reinforcement learning in finite discounted MDPs, where a generative model of the MDP is assumed to be available. We consider a family or risk measures called the optimized certainty equivalent (OCE), which includes important risk measures such as entropic risk, CVaR, and mean-variance. Our focus is on the sample complexities of learning the optimal state-action value function (value learning) and an optimal policy (policy learning) under recursive OCE. We provide an exact characterization of utility functions $u$ for which the corresponding OCE defines an objective that is PAC-learnable. We analyze a simple model-based approach and derive PAC sample complexity bounds. We establish that whenever $u$ does not have full domain $\text{dom}(u)\neq \mathbb{R}$, the corresponding problem is not PAC-learnable. Finally, we establish corresponding lower bounds for both value and policy learning, demonstrating tightness in the size $SA$ of state-action space, and for a more restricted class of utilities, we derive lower bounds that makes the dependence on the effective horizon $\frac{1}{1-\gamma}$ explicit. Specifically, for $\text{CVaR}_\tau$ we show that the correct dependence on $\tau$ is $\frac{1}{\tau^2}$, thus improving by a factor of $\frac{1}{\tau}$ over state-of-the-art although our bound has a suboptimal dependence on $\frac{1}{1-\gamma}$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies risk-sensitive RL in finite discounted MDPs with a generative model, focusing on recursive optimized certainty equivalents (OCE) for risk measures including entropic risk, CVaR, and mean-variance. It claims an exact characterization of utility functions u making the OCE objective PAC-learnable, derives PAC sample complexity upper bounds via a model-based algorithm for both value and policy learning, proves that dom(u) ≠ ℝ implies the problem is not PAC-learnable, and provides matching lower bounds (tight in SA, with explicit 1/(1-γ) dependence for restricted utilities and 1/τ² dependence for CVaR_τ).

Significance. If the characterization, bounds, and non-learnability result hold, the work is significant for precisely delineating when OCE-based risk-sensitive objectives are sample-efficiently learnable in RL. The matching lower bounds, the improvement over prior CVaR dependence on τ, and the negative result for limited-domain utilities would be valuable contributions to understanding limitations of risk measures.

major comments (2)

[Non-learnability result] Non-learnability section: The central claim that dom(u) ≠ ℝ renders the problem not PAC-learnable is load-bearing for the exact characterization. The manuscript must explicitly show that recursive application of OCE can produce effective return distributions with support outside dom(u) for some reachable policies in finite MDPs (even under bounded rewards and discounting), rendering the objective undefined; otherwise the non-learnability direction does not follow.
[Lower bounds] Lower bounds for CVaR: The claimed 1/τ² dependence (improving prior work by a factor of 1/τ) is presented as tight. The information-theoretic construction should be verified to correctly incorporate the recursive OCE nesting rather than reducing to a non-recursive or single-step risk measure.

minor comments (2)

[Abstract and lower bounds section] Clarify the precise class of utilities for which the explicit 1/(1-γ) lower bound is derived, as the abstract refers to 'a more restricted class' without naming it.
[Preliminaries] Notation consistency: Ensure dom(u) and the condition dom(u)≠ℝ are defined and used uniformly when discussing domain restrictions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address the major comments point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Non-learnability result] Non-learnability section: The central claim that dom(u) ≠ ℝ renders the problem not PAC-learnable is load-bearing for the exact characterization. The manuscript must explicitly show that recursive application of OCE can produce effective return distributions with support outside dom(u) for some reachable policies in finite MDPs (even under bounded rewards and discounting), rendering the objective undefined; otherwise the non-learnability direction does not follow.

Authors: We agree that an explicit construction would strengthen the argument. In the revision we will add a concrete finite MDP example (two states, bounded rewards in [0,1], γ=0.9) showing a policy whose recursively computed return distribution has positive probability mass outside dom(u) for a utility with limited domain, even after discounting. This will be placed in the non-learnability section to directly support the claim. revision: yes
Referee: [Lower bounds] Lower bounds for CVaR: The claimed 1/τ² dependence (improving prior work by a factor of 1/τ) is presented as tight. The information-theoretic construction should be verified to correctly incorporate the recursive OCE nesting rather than reducing to a non-recursive or single-step risk measure.

Authors: Our information-theoretic lower bound is constructed for the fully recursive OCE objective: the hard MDP family uses a chain of states in which CVaR_τ is applied at every step, and the sample lower bound is derived from the propagation of tail estimation error through the nested operators. The 1/τ² term arises precisely because accurate estimation of the conditional tail must be maintained across recursion levels. We will add an explicit paragraph in the proof tracing the recursion to address the verification request. revision: partial

Circularity Check

0 steps flagged

No circularity: theoretical bounds derived independently from model-based analysis and information-theoretic arguments

full rationale

The paper derives an exact characterization of utility functions u making recursive OCE PAC-learnable in finite discounted MDPs, along with sample complexity upper bounds from a model-based algorithm and matching lower bounds. These results rest on explicit assumptions about the generative model, recursive application of OCE to value functions, and standard PAC analysis techniques. No step reduces a claimed prediction or first-principles result to a fitted parameter from the same data, a self-definitional construction, or a load-bearing self-citation chain. The non-learnability claim for dom(u) ≠ ℝ follows from showing the objective can be undefined for some policies, which is an independent argument rather than a renaming or smuggling of prior results. The derivation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The central claims rest on standard finite-MDP assumptions and the recursive definition of OCE; no new entities are postulated and no parameters are fitted to data.

axioms (3)

domain assumption The MDP has finite state and action spaces.
Required for the SA term in the sample-complexity bounds.
domain assumption A generative model that returns exact transition samples is available.
Enables the model-based algorithm analyzed in the abstract.
standard math The discount factor satisfies 0 ≤ γ < 1.
Defines the effective horizon appearing in the bounds.

pith-pipeline@v0.9.0 · 5805 in / 1575 out tokens · 58454 ms · 2026-05-22T09:23:53.042659+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We establish that whenever u does not have full domain dom(u)≠R, the corresponding problem is not PAC-learnable.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We provide an exact characterization of utility functions u for which the corresponding OCE defines an objective that is PAC-learnable.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 1 internal anchor

[1]

Model-based reinforcement learning with a gen- erative model is minimax optimal

Alekh Agarwal, Sham Kakade, and Lin F Yang. “Model-based reinforcement learning with a gen- erative model is minimax optimal”. In:Conference on Learning Theory. PMLR. 2020, pp. 67–83

work page 2020
[2]

Entropic value-at-risk: A new coherent risk measure

Amir Ahmadi-Javid. “Entropic value-at-risk: A new coherent risk measure”. In:Journal of Opti- mization Theory and Applications155 (2012), pp. 1105–1123

work page 2012
[3]

A note on a new class of recursive utilities in Markov decision processes

Hubert Asienkiewicz and Anna Ja´ skiewicz. “A note on a new class of recursive utilities in Markov decision processes”. In:Applicationes Mathematicae44 (2017), pp. 149–161

work page 2017
[4]

Markov decision processes with recursive risk measures

Nicole B¨ auerle and Alexander Glauner. “Markov decision processes with recursive risk measures”. In:European Journal of Operational Research296.3 (2022), pp. 953–966

work page 2022
[5]

Markov decision processes with risk-sensitive criteria: An overview

Nicole B¨ auerle and Anna Ja´ skiewicz. “Markov decision processes with risk-sensitive criteria: An overview”. In:Mathematical Methods of Operations Research99.1 (2024), pp. 141–178

work page 2024
[6]

Markov decision processes with average-value-at-risk criteria

Nicole B¨ auerle and Jonathan Ott. “Markov decision processes with average-value-at-risk criteria”. In:Mathematical Methods of Operations Research74 (2011), pp. 361–379

work page 2011
[7]

More risk-sensitive Markov decision processes

Nicole B¨ auerle and Ulrich Rieder. “More risk-sensitive Markov decision processes”. In:Mathematics of Operations Research39.1 (2014), pp. 105–120

work page 2014
[8]

An old-new concept of convex risk measures: The optimized certainty equivalent

Aharon Ben-Tal and Marc Teboulle. “An old-new concept of convex risk measures: The optimized certainty equivalent”. In:Mathematical Finance17.3 (2007), pp. 449–476

work page 2007
[9]

Risk-sensitive dynamic asset management

Tomasz R Bielecki and Stanley R Pliska. “Risk-sensitive dynamic asset management”. In:Applied Mathematics and Optimization39 (1999), pp. 337–360

work page 1999
[10]

Risk-averse policy optimization via risk-neutral policy optimization

Lorenzo Bisi et al. “Risk-averse policy optimization via risk-neutral policy optimization”. In:Arti- ficial Intelligence311 (2022), p. 103765

work page 2022
[11]

Q-learning for risk-sensitive control

Vivek S Borkar. “Q-learning for risk-sensitive control”. In:Mathematics of operations research27.2 (2002), pp. 294–311

work page 2002
[12]

Risk-sensitive optimal control for Markov decision processes with monotone cost

Vivek S Borkar and Sean P Meyn. “Risk-sensitive optimal control for Markov decision processes with monotone cost”. In:Mathematics of Operations Research27.1 (2002), pp. 192–209

work page 2002
[13]

Bayesian robust optimization for imitation learn- ing

Daniel Brown, Scott Niekum, and Marek Petrik. “Bayesian robust optimization for imitation learn- ing”. In:Advances in Neural Information Processing Systems33 (2020), pp. 2479–2491

work page 2020
[14]

Provably Efficient Iterated CVaR Reinforcement Learning with Function Approxi- mation and Human Feedback

Yu Chen et al. “Provably Efficient Iterated CVaR Reinforcement Learning with Function Approxi- mation and Human Feedback”. In:The Twelfth International Conference on Learning Representa- tions. 2024

work page 2024
[15]

Algorithms for CVaR optimization in MDPs

Yinlam Chow and Mohammad Ghavamzadeh. “Algorithms for CVaR optimization in MDPs”. In: Advances in neural information processing systems27 (2014). 12

work page 2014
[16]

Percentile optimization for Markov decision processes with param- eter uncertainty

Erick Delage and Shie Mannor. “Percentile optimization for Markov decision processes with param- eter uncertainty”. In:Operations research58.1 (2010), pp. 203–213

work page 2010
[17]

Cambridge University Press, 2002

Michael Alan Howarth Dempster.Risk management: value at risk and beyond. Cambridge University Press, 2002

work page 2002
[18]

Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model

Zilong Deng, Simon Khan, and Shaofeng Zou. “Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model”. In:The 28th International Conference on Artificial Intelligence and Statistics. 2025

work page 2025
[19]

Provably Efficient Risk-Sensitive Reinforcement Learn- ing: Iterated CVaR and Worst Path

Yihan Du, Siwei Wang, and Longbo Huang. “Provably Efficient Risk-Sensitive Reinforcement Learn- ing: Iterated CVaR and Worst Path”. In:The Eleventh International Conference on Learning Rep- resentations. 2023

work page 2023
[20]

Clinical data based optimal STI strategies for HIV: A reinforcement learning approach

Damien Ernst et al. “Clinical data based optimal STI strategies for HIV: A reinforcement learning approach”. In:Proceedings of the 45th IEEE Conference on Decision and Control. IEEE. 2006, pp. 667–672

work page 2006
[21]

Risk-sensitive reinforcement learning with function approximation: A debiasing approach

Yingjie Fei, Zhuoran Yang, and Zhaoran Wang. “Risk-sensitive reinforcement learning with function approximation: A debiasing approach”. In:International Conference on Machine Learning. PMLR. 2021, pp. 3198–3207

work page 2021
[22]

Exponential Bellman equation and improved regret bounds for risk-sensitive re- inforcement learning

Yingjie Fei et al. “Exponential Bellman equation and improved regret bounds for risk-sensitive re- inforcement learning”. In:Advances in neural information processing systems34 (2021), pp. 20436– 20446

work page 2021
[23]

Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in re- gret

Yingjie Fei et al. “Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in re- gret”. In:Advances in Neural Information Processing Systems33 (2020), pp. 22384–22395

work page 2020
[24]

Convex and coherent risk measures

Hans F¨ ollmer and Alexander Schied. “Convex and coherent risk measures”. In:Encyclopedia of Quantitative Finance(2010), pp. 355–363

work page 2010
[25]

Risk-Seeking Reinforcement Learning via Multi-Timescale EVaR Optimization

Deep Kumar Ganguly et al. “Risk-Seeking Reinforcement Learning via Multi-Timescale EVaR Optimization”. In:Transactions on Machine Learning Research()

work page
[26]

Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

Mohammad Gheshlaghi Azar, R´ emi Munos, and Hilbert J Kappen. “Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model”. In:Machine learning91 (2013), pp. 325–349

work page 2013
[27]

Entropic risk optimization in dis- counted MDPs

Jia Lin Hau, Marek Petrik, and Mohammad Ghavamzadeh. “Entropic risk optimization in dis- counted MDPs”. In:International Conference on Artificial Intelligence and Statistics. PMLR. 2023, pp. 47–76

work page 2023
[28]

On dynamic programming decompositions of static risk measures in Markov decision processes

Jia Lin Hau et al. “On dynamic programming decompositions of static risk measures in Markov decision processes”. In:Advances in Neural Information Processing Systems36 (2023), pp. 51734– 51757

work page 2023
[29]

Risk-sensitive Markov decision processes

Ronald A Howard and James E Matheson. “Risk-sensitive Markov decision processes”. In:Man- agement science18.7 (1972), pp. 356–369

work page 1972
[30]

A tighter problem-dependent regret bound for risk-sensitive rein- forcement learning

Xiaoyan Hu and Ho-fung Leung. “A tighter problem-dependent regret bound for risk-sensitive rein- forcement learning”. In:International Conference on Artificial Intelligence and Statistics. PMLR. 2023, pp. 5411–5437

work page 2023
[31]

Achieving mean–variance efficiency by continuous- time reinforcement learning

Yilie Huang, Yanwei Jia, and Xunyu Zhou. “Achieving mean–variance efficiency by continuous- time reinforcement learning”. In:Proceedings of the Third ACM International Conference on AI in Finance. 2022, pp. 377–385

work page 2022
[32]

A utility criterion for Markov decision processes

Stratton C Jaquette. “A utility criterion for Markov decision processes”. In:Management Science 23.1 (1976), pp. 43–49

work page 1976
[33]

Truncated variance reduced value iteration

Yujia Jin et al. “Truncated variance reduced value iteration”. In:Advances in Neural Information Processing Systems37 (2024), pp. 117481–117508. 13

work page 2024
[34]

University of London, University College London (United Kingdom), 2003

Sham Machandranath Kakade.On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom), 2003

work page 2003
[35]

Risk-aware high-level decisions for automated driving at occluded inter- sections with reinforcement learning

Danial Kamran et al. “Risk-aware high-level decisions for automated driving at occluded inter- sections with reinforcement learning”. In:2020 IEEE Intelligent Vehicles Symposium (IV). IEEE. 2020, pp. 1205–1212

work page 2020
[36]

Finite-sample convergence rates for Q-learning and indirect algorithms

Michael Kearns and Satinder Singh. “Finite-sample convergence rates for Q-learning and indirect algorithms”. In:Advances in neural information processing systems11 (1998)

work page 1998
[37]

A revised approach for risk- averse multi-armed bandits under CVaR criterion

Najakorn Khajonchotpanya, Yilin Xue, and Napat Rujeerapaiboon. “A revised approach for risk- averse multi-armed bandits under CVaR criterion”. In:Operations Research Letters49.4 (2021), pp. 465–472

work page 2021
[38]

Actor-critic algorithms for risk-sensitive MDPs

Prashanth La and Mohammad Ghavamzadeh. “Actor-critic algorithms for risk-sensitive MDPs”. In:Advances in neural information processing systems26 (2013)

work page 2013
[39]

Risk-aware reinforcement learning with coherent risk measures and non-linear function approximation

Thanh Lam et al. “Risk-aware reinforcement learning with coherent risk measures and non-linear function approximation”. In:The Eleventh International Conference on Learning Representations. 2022

work page 2022
[40]

Near-optimal PAC bounds for discounted MDPs

Tor Lattimore and Marcus Hutter. “Near-optimal PAC bounds for discounted MDPs”. In:Theo- retical Computer Science558 (2014), pp. 125–143

work page 2014
[41]

Risk-Averse Constrained Reinforcement Learning with Optimized Certainty Equivalents

Jane H Lee et al. “Risk-Averse Constrained Reinforcement Learning with Optimized Certainty Equivalents”. In:The Thirty-ninth Annual Conference on Neural Information Processing Systems. 2025

work page 2025
[42]

Optimal dynamic portfolio selection: Multiperiod mean-variance formulation

Duan Li and Wan-Lung Ng. “Optimal dynamic portfolio selection: Multiperiod mean-variance formulation”. In:Mathematical finance10.3 (2000), pp. 387–406

work page 2000
[43]

Breaking the sample size barrier in model-based reinforcement learning with a generative model

Gen Li et al. “Breaking the sample size barrier in model-based reinforcement learning with a generative model”. In:Advances in neural information processing systems33 (2020), pp. 12861– 12872

work page 2020
[44]

Breaking the sample size barrier in model-based reinforcement learning with a generative model

Gen Li et al. “Breaking the sample size barrier in model-based reinforcement learning with a generative model”. In:Operations Research72.1 (2024), pp. 203–221

work page 2024
[45]

Settling the sample complexity of model-based offline reinforcement learning

Gen Li et al. “Settling the sample complexity of model-based offline reinforcement learning”. In: The Annals of Statistics52.1 (2024), pp. 233–260

work page 2024
[46]

Regret bounds for risk-sensitive reinforcement learning with lips- chitz dynamic risk measures

Hao Liang and Zhiquan Luo. “Regret bounds for risk-sensitive reinforcement learning with lips- chitz dynamic risk measures”. In:International Conference on Artificial Intelligence and Statistics. PMLR. 2024, pp. 1774–1782

work page 2024
[47]

Robust risk-averse stochastic multi-armed bandits

Odalric-Ambrym Maillard. “Robust risk-averse stochastic multi-armed bandits”. In:Algorithmic Learning Theory: 24th International Conference, ALT 2013, Singapore, October 6-9, 2013. Pro- ceedings 24. Springer. 2013, pp. 218–233

work page 2013
[48]

Beyond average return in Markov de- cision processes

Alexandre Marthe, Aur´ elien Garivier, and Claire Vernade. “Beyond average return in Markov de- cision processes”. In:Advances in Neural Information Processing Systems36 (2023), pp. 56488– 56507

work page 2023
[49]

Efficient Risk-sensitive Planning via Entropic Risk Measures

Alexandre Marthe et al. “Efficient Risk-sensitive Planning via Entropic Risk Measures”. In:arXiv preprint arXiv:2502.20423(2025)

work page arXiv 2025
[50]

Risk-sensitive reinforcement learning

Oliver Mihatsch and Ralph Neuneier. “Risk-sensitive reinforcement learning”. In:Machine learning 49.2 (2002), pp. 267–290

work page 2002
[51]

A policy gradient algorithm for the risk-sensitive exponential cost mdp

Mehrdad Moharrami et al. “A policy gradient algorithm for the risk-sensitive exponential cost mdp”. In:Mathematics of operations research50.1 (2025), pp. 431–458

work page 2025
[52]

Recursive Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

Oliver Mortensen and Mohammad Sadegh Talebi. “Recursive Entropic Risk Optimization in Dis- counted MDPs: Sample Complexity Bounds with a Generative Model”. In:arXiv preprint arXiv:2506.00286 (2025). 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Risk-sensitive reinforcement learning via Entropic-VaR optimization

Xinyi Ni and Lifeng Lai. “Risk-sensitive reinforcement learning via Entropic-VaR optimization”. In:2022 56th Asilomar Conference on Signals, Systems, and Computers. IEEE. 2022, pp. 953–959

work page 2022
[54]

An approximate solution method for large risk- averse markov decision processes

Marek Petrik and Dharmashankar Subramanian. “An approximate solution method for large risk- averse markov decision processes”. In:Conference on Uncertainty in Artificial Intelligence. 2012

work page 2012
[55]

Bridging offline reinforcement learning and imitation learning: A tale of pessimism

Paria Rashidinejad et al. “Bridging offline reinforcement learning and imitation learning: A tale of pessimism”. In:Advances in Neural Information Processing Systems34 (2021), pp. 11702–11716

work page 2021
[56]

One risk to rule them all: A risk-sensitive perspec- tive on model-based offline reinforcement learning

Marc Rigter, Bruno Lacerda, and Nick Hawes. “One risk to rule them all: A risk-sensitive perspec- tive on model-based offline reinforcement learning”. In:Advances in neural information processing systems36 (2023), pp. 77520–77545

work page 2023
[57]

Risk-aversion in multi-armed bandits

Amir Sani, Alessandro Lazaric, and R´ emi Munos. “Risk-aversion in multi-armed bandits”. In:Ad- vances in neural information processing systems25 (2012)

work page 2012
[58]

Robust portfolio asset allocation and risk measures

Maria Grazia Scutella and Raffaella Recchia. “Robust portfolio asset allocation and risk measures”. In:Annals of Operations Research204.1 (2013), pp. 145–169

work page 2013
[59]

SIAM, 2021

Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski.Lectures on stochastic program- ming: Modeling and theory. SIAM, 2021

work page 2021
[60]

Variance reduced value iteration and faster algorithms for solving Markov de- cision processes

Aaron Sidford et al. “Variance reduced value iteration and faster algorithms for solving Markov de- cision processes”. In:Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms. 2018, pp. 770–787

work page 2018
[61]

An upper bound on the loss from approximate optimal-value functions

Satinder P Singh and Richard C Yee. “An upper bound on the loss from approximate optimal-value functions”. In:Machine Learning16.3 (1994), pp. 227–233

work page 1994
[62]

Deep reinforcement learning for optimal portfolio allocation: A comparative study with mean-variance optimization

Srijan Sood et al. “Deep reinforcement learning for optimal portfolio allocation: A comparative study with mean-variance optimization”. In:FinPlan2023.2023 (2023), p. 21

work page arXiv 2023
[63]

An analysis of model-based interval estimation for Markov decision processes

Alexander L Strehl and Michael L Littman. “An analysis of model-based interval estimation for Markov decision processes”. In:Journal of Computer and System Sciences74.8 (2008), pp. 1309– 1331

work page 2008
[64]

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction. Vol. 1. 1. MIT press Cambridge, 1998

work page 1998
[65]

Policy gradient for coherent risk measures

Aviv Tamar et al. “Policy gradient for coherent risk measures”. In:Advances in neural information processing systems28 (2015)

work page 2015
[66]

A Reductions Approach to Risk-Sensitive Reinforcement Learning with Op- timized Certainty Equivalents

Kaiwen Wang et al. “A Reductions Approach to Risk-Sensitive Reinforcement Learning with Op- timized Certainty Equivalents”. In:Forty-second International Conference on Machine Learning. 2025

work page 2025
[67]

Randomized linear programming solves the Markov decision problem in nearly linear (sometimes sublinear) time

Mengdi Wang. “Randomized linear programming solves the Markov decision problem in nearly linear (sometimes sublinear) time”. In:Mathematics of Operations Research45.2 (2020), pp. 517– 546

work page 2020
[68]

Regret bounds for Markov decision processes with re- cursive optimized certainty equivalents

Wenhao Xu, Xuefeng Gao, and Xuedong He. “Regret bounds for Markov decision processes with re- cursive optimized certainty equivalents”. In:International Conference on Machine Learning. PMLR. 2023, pp. 38400–38427

work page 2023
[69]

RA-PbRL: Provably efficient risk-aware preference-based reinforcement learning

Yujie Zhao et al. “RA-PbRL: Provably efficient risk-aware preference-based reinforcement learning”. In:Advances in Neural Information Processing Systems37 (2024), pp. 60835–60871. 15 A Risk Measures In this section, we briefly introduce risk measures. See, e.g., [24] for a good reference that like us model stochastic outcomes as rewards. We here collect s...

work page 2024
[70]

Furthermore we Theorem 8.Assumeγ≥ 1 2,δ < 1 4 andu∈U 1. If there existsp∈( 1 2 ,1)and∆>0such that for any a∈(0, 1−p 5 )and anyz i ∈Zit holds that Q∗ 1(z)−Q ∗ 0(z)≥γα∆ (16) then there exists¯ε(u, Hγ)and constantsc 1, c2 >0such that if the total number of samples is less than T≤ 1 c1 SA∆2p(1−p) ε2 log( SA c2δ ) (17) for any algorithmUand anyε∈(0,¯ε), there ...

work page
[71]

Now by Theorem 9 in [52], we get that P1(B0)≥P 1(B0 ∩ E) =E 1[1 E 1 B0] =E 0 L1 L0 1 E 1 B0 ≥ θ 4 E0[1 E 1 B0] = θ 4 P0(E ∩B 0)≥ θ 8 . Solving fortin θ 8 > δwe find t < p(1−p) 32α2 log( 1 8δ ) = ∆2p(1−p) 512ε2 log( 1 8δ ) Since alsoB 0 ⊂B c 1 we conclude that if the algorithmUtries the state-action pairz i less than eT(ε, δ) := ∆2p(1−p) 512ε2 log( 1 8δ ) ...

work page
[72]

Letθ= exp − 32α2t p(1−p)

and using that the algorithm is (ε, δ)-correct it follows thatP 0(B)≥1−δ≥ 3 4 whereB={π U(si) =a 0}is the event that the algorithm chooses the actiona 0. Letθ= exp − 32α2t p(1−p) . Fix somet∈Nand letkbe the number of transitions tos G inttrials. Finally, we define the eventEas E= pt−k≤ r 2p(1−p) log( 8 2θ ) .(20) From the Chernoff-Hoeffding bound and as s...

work page
[73]

Since this holds for all theAhypothesesH i l , l= 1,2,

From Theorem 9 in [52], we get that P1(B)≥P 1(B∩ E) =E 1[1 B1 E]≥E 0 L1(W) L0(W) 1 E 1 B ≥E 0 θ 4 1 E 1 B = θ 4 P0(E ∩B)≥ θ 8 .(21) Now solving for θ 8 > δ, we see that if t < eT(ε, δ) := ∆2p(1−p) 128ε2 log( 1 8δ ) (22) thenP 1(B)> δand the eventBis containing the event that the algorithm does not choose the optimal actiona l. Since this holds for all the...

work page

[1] [1]

Model-based reinforcement learning with a gen- erative model is minimax optimal

Alekh Agarwal, Sham Kakade, and Lin F Yang. “Model-based reinforcement learning with a gen- erative model is minimax optimal”. In:Conference on Learning Theory. PMLR. 2020, pp. 67–83

work page 2020

[2] [2]

Entropic value-at-risk: A new coherent risk measure

Amir Ahmadi-Javid. “Entropic value-at-risk: A new coherent risk measure”. In:Journal of Opti- mization Theory and Applications155 (2012), pp. 1105–1123

work page 2012

[3] [3]

A note on a new class of recursive utilities in Markov decision processes

Hubert Asienkiewicz and Anna Ja´ skiewicz. “A note on a new class of recursive utilities in Markov decision processes”. In:Applicationes Mathematicae44 (2017), pp. 149–161

work page 2017

[4] [4]

Markov decision processes with recursive risk measures

Nicole B¨ auerle and Alexander Glauner. “Markov decision processes with recursive risk measures”. In:European Journal of Operational Research296.3 (2022), pp. 953–966

work page 2022

[5] [5]

Markov decision processes with risk-sensitive criteria: An overview

Nicole B¨ auerle and Anna Ja´ skiewicz. “Markov decision processes with risk-sensitive criteria: An overview”. In:Mathematical Methods of Operations Research99.1 (2024), pp. 141–178

work page 2024

[6] [6]

Markov decision processes with average-value-at-risk criteria

Nicole B¨ auerle and Jonathan Ott. “Markov decision processes with average-value-at-risk criteria”. In:Mathematical Methods of Operations Research74 (2011), pp. 361–379

work page 2011

[7] [7]

More risk-sensitive Markov decision processes

Nicole B¨ auerle and Ulrich Rieder. “More risk-sensitive Markov decision processes”. In:Mathematics of Operations Research39.1 (2014), pp. 105–120

work page 2014

[8] [8]

An old-new concept of convex risk measures: The optimized certainty equivalent

Aharon Ben-Tal and Marc Teboulle. “An old-new concept of convex risk measures: The optimized certainty equivalent”. In:Mathematical Finance17.3 (2007), pp. 449–476

work page 2007

[9] [9]

Risk-sensitive dynamic asset management

Tomasz R Bielecki and Stanley R Pliska. “Risk-sensitive dynamic asset management”. In:Applied Mathematics and Optimization39 (1999), pp. 337–360

work page 1999

[10] [10]

Risk-averse policy optimization via risk-neutral policy optimization

Lorenzo Bisi et al. “Risk-averse policy optimization via risk-neutral policy optimization”. In:Arti- ficial Intelligence311 (2022), p. 103765

work page 2022

[11] [11]

Q-learning for risk-sensitive control

Vivek S Borkar. “Q-learning for risk-sensitive control”. In:Mathematics of operations research27.2 (2002), pp. 294–311

work page 2002

[12] [12]

Risk-sensitive optimal control for Markov decision processes with monotone cost

Vivek S Borkar and Sean P Meyn. “Risk-sensitive optimal control for Markov decision processes with monotone cost”. In:Mathematics of Operations Research27.1 (2002), pp. 192–209

work page 2002

[13] [13]

Bayesian robust optimization for imitation learn- ing

Daniel Brown, Scott Niekum, and Marek Petrik. “Bayesian robust optimization for imitation learn- ing”. In:Advances in Neural Information Processing Systems33 (2020), pp. 2479–2491

work page 2020

[14] [14]

Provably Efficient Iterated CVaR Reinforcement Learning with Function Approxi- mation and Human Feedback

Yu Chen et al. “Provably Efficient Iterated CVaR Reinforcement Learning with Function Approxi- mation and Human Feedback”. In:The Twelfth International Conference on Learning Representa- tions. 2024

work page 2024

[15] [15]

Algorithms for CVaR optimization in MDPs

Yinlam Chow and Mohammad Ghavamzadeh. “Algorithms for CVaR optimization in MDPs”. In: Advances in neural information processing systems27 (2014). 12

work page 2014

[16] [16]

Percentile optimization for Markov decision processes with param- eter uncertainty

Erick Delage and Shie Mannor. “Percentile optimization for Markov decision processes with param- eter uncertainty”. In:Operations research58.1 (2010), pp. 203–213

work page 2010

[17] [17]

Cambridge University Press, 2002

Michael Alan Howarth Dempster.Risk management: value at risk and beyond. Cambridge University Press, 2002

work page 2002

[18] [18]

Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model

Zilong Deng, Simon Khan, and Shaofeng Zou. “Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model”. In:The 28th International Conference on Artificial Intelligence and Statistics. 2025

work page 2025

[19] [19]

Provably Efficient Risk-Sensitive Reinforcement Learn- ing: Iterated CVaR and Worst Path

Yihan Du, Siwei Wang, and Longbo Huang. “Provably Efficient Risk-Sensitive Reinforcement Learn- ing: Iterated CVaR and Worst Path”. In:The Eleventh International Conference on Learning Rep- resentations. 2023

work page 2023

[20] [20]

Clinical data based optimal STI strategies for HIV: A reinforcement learning approach

Damien Ernst et al. “Clinical data based optimal STI strategies for HIV: A reinforcement learning approach”. In:Proceedings of the 45th IEEE Conference on Decision and Control. IEEE. 2006, pp. 667–672

work page 2006

[21] [21]

Risk-sensitive reinforcement learning with function approximation: A debiasing approach

Yingjie Fei, Zhuoran Yang, and Zhaoran Wang. “Risk-sensitive reinforcement learning with function approximation: A debiasing approach”. In:International Conference on Machine Learning. PMLR. 2021, pp. 3198–3207

work page 2021

[22] [22]

Exponential Bellman equation and improved regret bounds for risk-sensitive re- inforcement learning

Yingjie Fei et al. “Exponential Bellman equation and improved regret bounds for risk-sensitive re- inforcement learning”. In:Advances in neural information processing systems34 (2021), pp. 20436– 20446

work page 2021

[23] [23]

Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in re- gret

Yingjie Fei et al. “Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in re- gret”. In:Advances in Neural Information Processing Systems33 (2020), pp. 22384–22395

work page 2020

[24] [24]

Convex and coherent risk measures

Hans F¨ ollmer and Alexander Schied. “Convex and coherent risk measures”. In:Encyclopedia of Quantitative Finance(2010), pp. 355–363

work page 2010

[25] [25]

Risk-Seeking Reinforcement Learning via Multi-Timescale EVaR Optimization

Deep Kumar Ganguly et al. “Risk-Seeking Reinforcement Learning via Multi-Timescale EVaR Optimization”. In:Transactions on Machine Learning Research()

work page

[26] [26]

Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

Mohammad Gheshlaghi Azar, R´ emi Munos, and Hilbert J Kappen. “Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model”. In:Machine learning91 (2013), pp. 325–349

work page 2013

[27] [27]

Entropic risk optimization in dis- counted MDPs

Jia Lin Hau, Marek Petrik, and Mohammad Ghavamzadeh. “Entropic risk optimization in dis- counted MDPs”. In:International Conference on Artificial Intelligence and Statistics. PMLR. 2023, pp. 47–76

work page 2023

[28] [28]

On dynamic programming decompositions of static risk measures in Markov decision processes

Jia Lin Hau et al. “On dynamic programming decompositions of static risk measures in Markov decision processes”. In:Advances in Neural Information Processing Systems36 (2023), pp. 51734– 51757

work page 2023

[29] [29]

Risk-sensitive Markov decision processes

Ronald A Howard and James E Matheson. “Risk-sensitive Markov decision processes”. In:Man- agement science18.7 (1972), pp. 356–369

work page 1972

[30] [30]

A tighter problem-dependent regret bound for risk-sensitive rein- forcement learning

Xiaoyan Hu and Ho-fung Leung. “A tighter problem-dependent regret bound for risk-sensitive rein- forcement learning”. In:International Conference on Artificial Intelligence and Statistics. PMLR. 2023, pp. 5411–5437

work page 2023

[31] [31]

Achieving mean–variance efficiency by continuous- time reinforcement learning

Yilie Huang, Yanwei Jia, and Xunyu Zhou. “Achieving mean–variance efficiency by continuous- time reinforcement learning”. In:Proceedings of the Third ACM International Conference on AI in Finance. 2022, pp. 377–385

work page 2022

[32] [32]

A utility criterion for Markov decision processes

Stratton C Jaquette. “A utility criterion for Markov decision processes”. In:Management Science 23.1 (1976), pp. 43–49

work page 1976

[33] [33]

Truncated variance reduced value iteration

Yujia Jin et al. “Truncated variance reduced value iteration”. In:Advances in Neural Information Processing Systems37 (2024), pp. 117481–117508. 13

work page 2024

[34] [34]

University of London, University College London (United Kingdom), 2003

Sham Machandranath Kakade.On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom), 2003

work page 2003

[35] [35]

Risk-aware high-level decisions for automated driving at occluded inter- sections with reinforcement learning

Danial Kamran et al. “Risk-aware high-level decisions for automated driving at occluded inter- sections with reinforcement learning”. In:2020 IEEE Intelligent Vehicles Symposium (IV). IEEE. 2020, pp. 1205–1212

work page 2020

[36] [36]

Finite-sample convergence rates for Q-learning and indirect algorithms

Michael Kearns and Satinder Singh. “Finite-sample convergence rates for Q-learning and indirect algorithms”. In:Advances in neural information processing systems11 (1998)

work page 1998

[37] [37]

A revised approach for risk- averse multi-armed bandits under CVaR criterion

Najakorn Khajonchotpanya, Yilin Xue, and Napat Rujeerapaiboon. “A revised approach for risk- averse multi-armed bandits under CVaR criterion”. In:Operations Research Letters49.4 (2021), pp. 465–472

work page 2021

[38] [38]

Actor-critic algorithms for risk-sensitive MDPs

Prashanth La and Mohammad Ghavamzadeh. “Actor-critic algorithms for risk-sensitive MDPs”. In:Advances in neural information processing systems26 (2013)

work page 2013

[39] [39]

Risk-aware reinforcement learning with coherent risk measures and non-linear function approximation

Thanh Lam et al. “Risk-aware reinforcement learning with coherent risk measures and non-linear function approximation”. In:The Eleventh International Conference on Learning Representations. 2022

work page 2022

[40] [40]

Near-optimal PAC bounds for discounted MDPs

Tor Lattimore and Marcus Hutter. “Near-optimal PAC bounds for discounted MDPs”. In:Theo- retical Computer Science558 (2014), pp. 125–143

work page 2014

[41] [41]

Risk-Averse Constrained Reinforcement Learning with Optimized Certainty Equivalents

Jane H Lee et al. “Risk-Averse Constrained Reinforcement Learning with Optimized Certainty Equivalents”. In:The Thirty-ninth Annual Conference on Neural Information Processing Systems. 2025

work page 2025

[42] [42]

Optimal dynamic portfolio selection: Multiperiod mean-variance formulation

Duan Li and Wan-Lung Ng. “Optimal dynamic portfolio selection: Multiperiod mean-variance formulation”. In:Mathematical finance10.3 (2000), pp. 387–406

work page 2000

[43] [43]

Breaking the sample size barrier in model-based reinforcement learning with a generative model

Gen Li et al. “Breaking the sample size barrier in model-based reinforcement learning with a generative model”. In:Advances in neural information processing systems33 (2020), pp. 12861– 12872

work page 2020

[44] [44]

Breaking the sample size barrier in model-based reinforcement learning with a generative model

Gen Li et al. “Breaking the sample size barrier in model-based reinforcement learning with a generative model”. In:Operations Research72.1 (2024), pp. 203–221

work page 2024

[45] [45]

Settling the sample complexity of model-based offline reinforcement learning

Gen Li et al. “Settling the sample complexity of model-based offline reinforcement learning”. In: The Annals of Statistics52.1 (2024), pp. 233–260

work page 2024

[46] [46]

Regret bounds for risk-sensitive reinforcement learning with lips- chitz dynamic risk measures

Hao Liang and Zhiquan Luo. “Regret bounds for risk-sensitive reinforcement learning with lips- chitz dynamic risk measures”. In:International Conference on Artificial Intelligence and Statistics. PMLR. 2024, pp. 1774–1782

work page 2024

[47] [47]

Robust risk-averse stochastic multi-armed bandits

Odalric-Ambrym Maillard. “Robust risk-averse stochastic multi-armed bandits”. In:Algorithmic Learning Theory: 24th International Conference, ALT 2013, Singapore, October 6-9, 2013. Pro- ceedings 24. Springer. 2013, pp. 218–233

work page 2013

[48] [48]

Beyond average return in Markov de- cision processes

Alexandre Marthe, Aur´ elien Garivier, and Claire Vernade. “Beyond average return in Markov de- cision processes”. In:Advances in Neural Information Processing Systems36 (2023), pp. 56488– 56507

work page 2023

[49] [49]

Efficient Risk-sensitive Planning via Entropic Risk Measures

Alexandre Marthe et al. “Efficient Risk-sensitive Planning via Entropic Risk Measures”. In:arXiv preprint arXiv:2502.20423(2025)

work page arXiv 2025

[50] [50]

Risk-sensitive reinforcement learning

Oliver Mihatsch and Ralph Neuneier. “Risk-sensitive reinforcement learning”. In:Machine learning 49.2 (2002), pp. 267–290

work page 2002

[51] [51]

A policy gradient algorithm for the risk-sensitive exponential cost mdp

Mehrdad Moharrami et al. “A policy gradient algorithm for the risk-sensitive exponential cost mdp”. In:Mathematics of operations research50.1 (2025), pp. 431–458

work page 2025

[52] [52]

Recursive Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

Oliver Mortensen and Mohammad Sadegh Talebi. “Recursive Entropic Risk Optimization in Dis- counted MDPs: Sample Complexity Bounds with a Generative Model”. In:arXiv preprint arXiv:2506.00286 (2025). 14

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

Risk-sensitive reinforcement learning via Entropic-VaR optimization

Xinyi Ni and Lifeng Lai. “Risk-sensitive reinforcement learning via Entropic-VaR optimization”. In:2022 56th Asilomar Conference on Signals, Systems, and Computers. IEEE. 2022, pp. 953–959

work page 2022

[54] [54]

An approximate solution method for large risk- averse markov decision processes

Marek Petrik and Dharmashankar Subramanian. “An approximate solution method for large risk- averse markov decision processes”. In:Conference on Uncertainty in Artificial Intelligence. 2012

work page 2012

[55] [55]

Bridging offline reinforcement learning and imitation learning: A tale of pessimism

Paria Rashidinejad et al. “Bridging offline reinforcement learning and imitation learning: A tale of pessimism”. In:Advances in Neural Information Processing Systems34 (2021), pp. 11702–11716

work page 2021

[56] [56]

One risk to rule them all: A risk-sensitive perspec- tive on model-based offline reinforcement learning

Marc Rigter, Bruno Lacerda, and Nick Hawes. “One risk to rule them all: A risk-sensitive perspec- tive on model-based offline reinforcement learning”. In:Advances in neural information processing systems36 (2023), pp. 77520–77545

work page 2023

[57] [57]

Risk-aversion in multi-armed bandits

Amir Sani, Alessandro Lazaric, and R´ emi Munos. “Risk-aversion in multi-armed bandits”. In:Ad- vances in neural information processing systems25 (2012)

work page 2012

[58] [58]

Robust portfolio asset allocation and risk measures

Maria Grazia Scutella and Raffaella Recchia. “Robust portfolio asset allocation and risk measures”. In:Annals of Operations Research204.1 (2013), pp. 145–169

work page 2013

[59] [59]

SIAM, 2021

Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski.Lectures on stochastic program- ming: Modeling and theory. SIAM, 2021

work page 2021

[60] [60]

Variance reduced value iteration and faster algorithms for solving Markov de- cision processes

Aaron Sidford et al. “Variance reduced value iteration and faster algorithms for solving Markov de- cision processes”. In:Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms. 2018, pp. 770–787

work page 2018

[61] [61]

An upper bound on the loss from approximate optimal-value functions

Satinder P Singh and Richard C Yee. “An upper bound on the loss from approximate optimal-value functions”. In:Machine Learning16.3 (1994), pp. 227–233

work page 1994

[62] [62]

Deep reinforcement learning for optimal portfolio allocation: A comparative study with mean-variance optimization

Srijan Sood et al. “Deep reinforcement learning for optimal portfolio allocation: A comparative study with mean-variance optimization”. In:FinPlan2023.2023 (2023), p. 21

work page arXiv 2023

[63] [63]

An analysis of model-based interval estimation for Markov decision processes

Alexander L Strehl and Michael L Littman. “An analysis of model-based interval estimation for Markov decision processes”. In:Journal of Computer and System Sciences74.8 (2008), pp. 1309– 1331

work page 2008

[64] [64]

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction. Vol. 1. 1. MIT press Cambridge, 1998

work page 1998

[65] [65]

Policy gradient for coherent risk measures

Aviv Tamar et al. “Policy gradient for coherent risk measures”. In:Advances in neural information processing systems28 (2015)

work page 2015

[66] [66]

A Reductions Approach to Risk-Sensitive Reinforcement Learning with Op- timized Certainty Equivalents

Kaiwen Wang et al. “A Reductions Approach to Risk-Sensitive Reinforcement Learning with Op- timized Certainty Equivalents”. In:Forty-second International Conference on Machine Learning. 2025

work page 2025

[67] [67]

Randomized linear programming solves the Markov decision problem in nearly linear (sometimes sublinear) time

Mengdi Wang. “Randomized linear programming solves the Markov decision problem in nearly linear (sometimes sublinear) time”. In:Mathematics of Operations Research45.2 (2020), pp. 517– 546

work page 2020

[68] [68]

Regret bounds for Markov decision processes with re- cursive optimized certainty equivalents

Wenhao Xu, Xuefeng Gao, and Xuedong He. “Regret bounds for Markov decision processes with re- cursive optimized certainty equivalents”. In:International Conference on Machine Learning. PMLR. 2023, pp. 38400–38427

work page 2023

[69] [69]

RA-PbRL: Provably efficient risk-aware preference-based reinforcement learning

Yujie Zhao et al. “RA-PbRL: Provably efficient risk-aware preference-based reinforcement learning”. In:Advances in Neural Information Processing Systems37 (2024), pp. 60835–60871. 15 A Risk Measures In this section, we briefly introduce risk measures. See, e.g., [24] for a good reference that like us model stochastic outcomes as rewards. We here collect s...

work page 2024

[70] [70]

Furthermore we Theorem 8.Assumeγ≥ 1 2,δ < 1 4 andu∈U 1. If there existsp∈( 1 2 ,1)and∆>0such that for any a∈(0, 1−p 5 )and anyz i ∈Zit holds that Q∗ 1(z)−Q ∗ 0(z)≥γα∆ (16) then there exists¯ε(u, Hγ)and constantsc 1, c2 >0such that if the total number of samples is less than T≤ 1 c1 SA∆2p(1−p) ε2 log( SA c2δ ) (17) for any algorithmUand anyε∈(0,¯ε), there ...

work page

[71] [71]

Now by Theorem 9 in [52], we get that P1(B0)≥P 1(B0 ∩ E) =E 1[1 E 1 B0] =E 0 L1 L0 1 E 1 B0 ≥ θ 4 E0[1 E 1 B0] = θ 4 P0(E ∩B 0)≥ θ 8 . Solving fortin θ 8 > δwe find t < p(1−p) 32α2 log( 1 8δ ) = ∆2p(1−p) 512ε2 log( 1 8δ ) Since alsoB 0 ⊂B c 1 we conclude that if the algorithmUtries the state-action pairz i less than eT(ε, δ) := ∆2p(1−p) 512ε2 log( 1 8δ ) ...

work page

[72] [72]

Letθ= exp − 32α2t p(1−p)

and using that the algorithm is (ε, δ)-correct it follows thatP 0(B)≥1−δ≥ 3 4 whereB={π U(si) =a 0}is the event that the algorithm chooses the actiona 0. Letθ= exp − 32α2t p(1−p) . Fix somet∈Nand letkbe the number of transitions tos G inttrials. Finally, we define the eventEas E= pt−k≤ r 2p(1−p) log( 8 2θ ) .(20) From the Chernoff-Hoeffding bound and as s...

work page

[73] [73]

Since this holds for all theAhypothesesH i l , l= 1,2,

From Theorem 9 in [52], we get that P1(B)≥P 1(B∩ E) =E 1[1 B1 E]≥E 0 L1(W) L0(W) 1 E 1 B ≥E 0 θ 4 1 E 1 B = θ 4 P0(E ∩B)≥ θ 8 .(21) Now solving for θ 8 > δ, we see that if t < eT(ε, δ) := ∆2p(1−p) 128ε2 log( 1 8δ ) (22) thenP 1(B)> δand the eventBis containing the event that the algorithm does not choose the optimal actiona l. Since this holds for all the...

work page