pith. sign in

arxiv: 2605.21763 · v1 · pith:NZT4L5TYnew · submitted 2026-05-20 · 💻 cs.LG · cs.SY· eess.SY· stat.ML

On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents

Pith reviewed 2026-05-22 09:23 UTC · model grok-4.3

classification 💻 cs.LG cs.SYeess.SYstat.ML
keywords risk-sensitive reinforcement learningoptimized certainty equivalentPAC sample complexitydiscounted MDPsgenerative modelCVaRentropic riskvalue function learning
0
0 comments X

The pith

Utility functions with full domain over the reals make optimized certainty equivalent objectives PAC-learnable in discounted MDPs, while restricted domains make them impossible to learn.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies risk-sensitive reinforcement learning in finite discounted MDPs where a generative model provides samples. It gives an exact characterization of which utility functions allow the optimized certainty equivalent objective to be learned to high accuracy with finite samples. This matters because risk-sensitive criteria let agents avoid rare but costly failures, yet only certain versions of them permit reliable learning from data. The authors analyze a model-based method to obtain explicit PAC sample complexity upper bounds and prove matching lower bounds that are tight in the number of state-action pairs.

Core claim

The paper claims that the OCE objective is PAC-learnable if and only if the utility function u has full domain equal to the real numbers; whenever the domain is a proper subset the problem is not PAC-learnable. For utilities that satisfy the domain condition a simple model-based planner yields PAC bounds whose leading term scales with the size of the state-action space. The same analysis produces lower bounds that match this dependence on SA and, for CVaR, improve the dependence on the risk level tau to 1 over tau squared.

What carries the argument

The optimized certainty equivalent (OCE) risk measure, which for a utility function u returns the optimized value of an expectation involving u applied to a random return.

If this is right

  • When the utility has full domain the model-based planner returns an epsilon-optimal value function after a number of samples that scales linearly with SA and polynomially with the effective horizon.
  • No algorithm can achieve PAC guarantees for any utility whose domain excludes some real numbers, regardless of sample budget.
  • For CVaR the sample complexity necessarily grows at least as 1 over tau squared, improving prior dependence on the risk parameter by a factor of 1 over tau.
  • The lower bounds remain tight even when the effective horizon is treated as a separate parameter.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The characterization suggests that entropic risk, which has full domain, is preferable to CVaR when sample efficiency is the primary concern.
  • The same domain condition may govern learnability in risk-sensitive settings that lack an explicit generative model but still admit concentration inequalities.
  • Extending the analysis to infinite state spaces or continuous actions would require new concentration arguments that preserve the domain-based separation.

Load-bearing premise

The analysis assumes a generative model that can produce independent next-state and reward samples for every state-action pair and that the risk-sensitive value function is obtained by recursively applying the OCE operator at every step.

What would settle it

A concrete demonstration that some utility function whose domain is strictly smaller than the reals still permits an algorithm to return an epsilon-optimal OCE value function with probability 1-delta after polynomially many generative samples would falsify the non-learnability claim.

Figures

Figures reproduced from arXiv: 2605.21763 by Mohammad Sadegh Talebi, Oliver Mortensen.

Figure 1
Figure 1. Figure 1: Lower bound building blocks for value learning (a) and policy learning (b) [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
read the original abstract

We study risk-sensitive reinforcement learning in finite discounted MDPs, where a generative model of the MDP is assumed to be available. We consider a family or risk measures called the optimized certainty equivalent (OCE), which includes important risk measures such as entropic risk, CVaR, and mean-variance. Our focus is on the sample complexities of learning the optimal state-action value function (value learning) and an optimal policy (policy learning) under recursive OCE. We provide an exact characterization of utility functions $u$ for which the corresponding OCE defines an objective that is PAC-learnable. We analyze a simple model-based approach and derive PAC sample complexity bounds. We establish that whenever $u$ does not have full domain $\text{dom}(u)\neq \mathbb{R}$, the corresponding problem is not PAC-learnable. Finally, we establish corresponding lower bounds for both value and policy learning, demonstrating tightness in the size $SA$ of state-action space, and for a more restricted class of utilities, we derive lower bounds that makes the dependence on the effective horizon $\frac{1}{1-\gamma}$ explicit. Specifically, for $\text{CVaR}_\tau$ we show that the correct dependence on $\tau$ is $\frac{1}{\tau^2}$, thus improving by a factor of $\frac{1}{\tau}$ over state-of-the-art although our bound has a suboptimal dependence on $\frac{1}{1-\gamma}$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies risk-sensitive RL in finite discounted MDPs with a generative model, focusing on recursive optimized certainty equivalents (OCE) for risk measures including entropic risk, CVaR, and mean-variance. It claims an exact characterization of utility functions u making the OCE objective PAC-learnable, derives PAC sample complexity upper bounds via a model-based algorithm for both value and policy learning, proves that dom(u) ≠ ℝ implies the problem is not PAC-learnable, and provides matching lower bounds (tight in SA, with explicit 1/(1-γ) dependence for restricted utilities and 1/τ² dependence for CVaR_τ).

Significance. If the characterization, bounds, and non-learnability result hold, the work is significant for precisely delineating when OCE-based risk-sensitive objectives are sample-efficiently learnable in RL. The matching lower bounds, the improvement over prior CVaR dependence on τ, and the negative result for limited-domain utilities would be valuable contributions to understanding limitations of risk measures.

major comments (2)
  1. [Non-learnability result] Non-learnability section: The central claim that dom(u) ≠ ℝ renders the problem not PAC-learnable is load-bearing for the exact characterization. The manuscript must explicitly show that recursive application of OCE can produce effective return distributions with support outside dom(u) for some reachable policies in finite MDPs (even under bounded rewards and discounting), rendering the objective undefined; otherwise the non-learnability direction does not follow.
  2. [Lower bounds] Lower bounds for CVaR: The claimed 1/τ² dependence (improving prior work by a factor of 1/τ) is presented as tight. The information-theoretic construction should be verified to correctly incorporate the recursive OCE nesting rather than reducing to a non-recursive or single-step risk measure.
minor comments (2)
  1. [Abstract and lower bounds section] Clarify the precise class of utilities for which the explicit 1/(1-γ) lower bound is derived, as the abstract refers to 'a more restricted class' without naming it.
  2. [Preliminaries] Notation consistency: Ensure dom(u) and the condition dom(u)≠ℝ are defined and used uniformly when discussing domain restrictions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address the major comments point by point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Non-learnability result] Non-learnability section: The central claim that dom(u) ≠ ℝ renders the problem not PAC-learnable is load-bearing for the exact characterization. The manuscript must explicitly show that recursive application of OCE can produce effective return distributions with support outside dom(u) for some reachable policies in finite MDPs (even under bounded rewards and discounting), rendering the objective undefined; otherwise the non-learnability direction does not follow.

    Authors: We agree that an explicit construction would strengthen the argument. In the revision we will add a concrete finite MDP example (two states, bounded rewards in [0,1], γ=0.9) showing a policy whose recursively computed return distribution has positive probability mass outside dom(u) for a utility with limited domain, even after discounting. This will be placed in the non-learnability section to directly support the claim. revision: yes

  2. Referee: [Lower bounds] Lower bounds for CVaR: The claimed 1/τ² dependence (improving prior work by a factor of 1/τ) is presented as tight. The information-theoretic construction should be verified to correctly incorporate the recursive OCE nesting rather than reducing to a non-recursive or single-step risk measure.

    Authors: Our information-theoretic lower bound is constructed for the fully recursive OCE objective: the hard MDP family uses a chain of states in which CVaR_τ is applied at every step, and the sample lower bound is derived from the propagation of tail estimation error through the nested operators. The 1/τ² term arises precisely because accurate estimation of the conditional tail must be maintained across recursion levels. We will add an explicit paragraph in the proof tracing the recursion to address the verification request. revision: partial

Circularity Check

0 steps flagged

No circularity: theoretical bounds derived independently from model-based analysis and information-theoretic arguments

full rationale

The paper derives an exact characterization of utility functions u making recursive OCE PAC-learnable in finite discounted MDPs, along with sample complexity upper bounds from a model-based algorithm and matching lower bounds. These results rest on explicit assumptions about the generative model, recursive application of OCE to value functions, and standard PAC analysis techniques. No step reduces a claimed prediction or first-principles result to a fitted parameter from the same data, a self-definitional construction, or a load-bearing self-citation chain. The non-learnability claim for dom(u) ≠ ℝ follows from showing the objective can be undefined for some policies, which is an independent argument rather than a renaming or smuggling of prior results. The derivation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The central claims rest on standard finite-MDP assumptions and the recursive definition of OCE; no new entities are postulated and no parameters are fitted to data.

axioms (3)
  • domain assumption The MDP has finite state and action spaces.
    Required for the SA term in the sample-complexity bounds.
  • domain assumption A generative model that returns exact transition samples is available.
    Enables the model-based algorithm analyzed in the abstract.
  • standard math The discount factor satisfies 0 ≤ γ < 1.
    Defines the effective horizon appearing in the bounds.

pith-pipeline@v0.9.0 · 5805 in / 1575 out tokens · 58454 ms · 2026-05-22T09:23:53.042659+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 1 internal anchor

  1. [1]

    Model-based reinforcement learning with a gen- erative model is minimax optimal

    Alekh Agarwal, Sham Kakade, and Lin F Yang. “Model-based reinforcement learning with a gen- erative model is minimax optimal”. In:Conference on Learning Theory. PMLR. 2020, pp. 67–83

  2. [2]

    Entropic value-at-risk: A new coherent risk measure

    Amir Ahmadi-Javid. “Entropic value-at-risk: A new coherent risk measure”. In:Journal of Opti- mization Theory and Applications155 (2012), pp. 1105–1123

  3. [3]

    A note on a new class of recursive utilities in Markov decision processes

    Hubert Asienkiewicz and Anna Ja´ skiewicz. “A note on a new class of recursive utilities in Markov decision processes”. In:Applicationes Mathematicae44 (2017), pp. 149–161

  4. [4]

    Markov decision processes with recursive risk measures

    Nicole B¨ auerle and Alexander Glauner. “Markov decision processes with recursive risk measures”. In:European Journal of Operational Research296.3 (2022), pp. 953–966

  5. [5]

    Markov decision processes with risk-sensitive criteria: An overview

    Nicole B¨ auerle and Anna Ja´ skiewicz. “Markov decision processes with risk-sensitive criteria: An overview”. In:Mathematical Methods of Operations Research99.1 (2024), pp. 141–178

  6. [6]

    Markov decision processes with average-value-at-risk criteria

    Nicole B¨ auerle and Jonathan Ott. “Markov decision processes with average-value-at-risk criteria”. In:Mathematical Methods of Operations Research74 (2011), pp. 361–379

  7. [7]

    More risk-sensitive Markov decision processes

    Nicole B¨ auerle and Ulrich Rieder. “More risk-sensitive Markov decision processes”. In:Mathematics of Operations Research39.1 (2014), pp. 105–120

  8. [8]

    An old-new concept of convex risk measures: The optimized certainty equivalent

    Aharon Ben-Tal and Marc Teboulle. “An old-new concept of convex risk measures: The optimized certainty equivalent”. In:Mathematical Finance17.3 (2007), pp. 449–476

  9. [9]

    Risk-sensitive dynamic asset management

    Tomasz R Bielecki and Stanley R Pliska. “Risk-sensitive dynamic asset management”. In:Applied Mathematics and Optimization39 (1999), pp. 337–360

  10. [10]

    Risk-averse policy optimization via risk-neutral policy optimization

    Lorenzo Bisi et al. “Risk-averse policy optimization via risk-neutral policy optimization”. In:Arti- ficial Intelligence311 (2022), p. 103765

  11. [11]

    Q-learning for risk-sensitive control

    Vivek S Borkar. “Q-learning for risk-sensitive control”. In:Mathematics of operations research27.2 (2002), pp. 294–311

  12. [12]

    Risk-sensitive optimal control for Markov decision processes with monotone cost

    Vivek S Borkar and Sean P Meyn. “Risk-sensitive optimal control for Markov decision processes with monotone cost”. In:Mathematics of Operations Research27.1 (2002), pp. 192–209

  13. [13]

    Bayesian robust optimization for imitation learn- ing

    Daniel Brown, Scott Niekum, and Marek Petrik. “Bayesian robust optimization for imitation learn- ing”. In:Advances in Neural Information Processing Systems33 (2020), pp. 2479–2491

  14. [14]

    Provably Efficient Iterated CVaR Reinforcement Learning with Function Approxi- mation and Human Feedback

    Yu Chen et al. “Provably Efficient Iterated CVaR Reinforcement Learning with Function Approxi- mation and Human Feedback”. In:The Twelfth International Conference on Learning Representa- tions. 2024

  15. [15]

    Algorithms for CVaR optimization in MDPs

    Yinlam Chow and Mohammad Ghavamzadeh. “Algorithms for CVaR optimization in MDPs”. In: Advances in neural information processing systems27 (2014). 12

  16. [16]

    Percentile optimization for Markov decision processes with param- eter uncertainty

    Erick Delage and Shie Mannor. “Percentile optimization for Markov decision processes with param- eter uncertainty”. In:Operations research58.1 (2010), pp. 203–213

  17. [17]

    Cambridge University Press, 2002

    Michael Alan Howarth Dempster.Risk management: value at risk and beyond. Cambridge University Press, 2002

  18. [18]

    Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model

    Zilong Deng, Simon Khan, and Shaofeng Zou. “Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model”. In:The 28th International Conference on Artificial Intelligence and Statistics. 2025

  19. [19]

    Provably Efficient Risk-Sensitive Reinforcement Learn- ing: Iterated CVaR and Worst Path

    Yihan Du, Siwei Wang, and Longbo Huang. “Provably Efficient Risk-Sensitive Reinforcement Learn- ing: Iterated CVaR and Worst Path”. In:The Eleventh International Conference on Learning Rep- resentations. 2023

  20. [20]

    Clinical data based optimal STI strategies for HIV: A reinforcement learning approach

    Damien Ernst et al. “Clinical data based optimal STI strategies for HIV: A reinforcement learning approach”. In:Proceedings of the 45th IEEE Conference on Decision and Control. IEEE. 2006, pp. 667–672

  21. [21]

    Risk-sensitive reinforcement learning with function approximation: A debiasing approach

    Yingjie Fei, Zhuoran Yang, and Zhaoran Wang. “Risk-sensitive reinforcement learning with function approximation: A debiasing approach”. In:International Conference on Machine Learning. PMLR. 2021, pp. 3198–3207

  22. [22]

    Exponential Bellman equation and improved regret bounds for risk-sensitive re- inforcement learning

    Yingjie Fei et al. “Exponential Bellman equation and improved regret bounds for risk-sensitive re- inforcement learning”. In:Advances in neural information processing systems34 (2021), pp. 20436– 20446

  23. [23]

    Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in re- gret

    Yingjie Fei et al. “Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in re- gret”. In:Advances in Neural Information Processing Systems33 (2020), pp. 22384–22395

  24. [24]

    Convex and coherent risk measures

    Hans F¨ ollmer and Alexander Schied. “Convex and coherent risk measures”. In:Encyclopedia of Quantitative Finance(2010), pp. 355–363

  25. [25]

    Risk-Seeking Reinforcement Learning via Multi-Timescale EVaR Optimization

    Deep Kumar Ganguly et al. “Risk-Seeking Reinforcement Learning via Multi-Timescale EVaR Optimization”. In:Transactions on Machine Learning Research()

  26. [26]

    Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

    Mohammad Gheshlaghi Azar, R´ emi Munos, and Hilbert J Kappen. “Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model”. In:Machine learning91 (2013), pp. 325–349

  27. [27]

    Entropic risk optimization in dis- counted MDPs

    Jia Lin Hau, Marek Petrik, and Mohammad Ghavamzadeh. “Entropic risk optimization in dis- counted MDPs”. In:International Conference on Artificial Intelligence and Statistics. PMLR. 2023, pp. 47–76

  28. [28]

    On dynamic programming decompositions of static risk measures in Markov decision processes

    Jia Lin Hau et al. “On dynamic programming decompositions of static risk measures in Markov decision processes”. In:Advances in Neural Information Processing Systems36 (2023), pp. 51734– 51757

  29. [29]

    Risk-sensitive Markov decision processes

    Ronald A Howard and James E Matheson. “Risk-sensitive Markov decision processes”. In:Man- agement science18.7 (1972), pp. 356–369

  30. [30]

    A tighter problem-dependent regret bound for risk-sensitive rein- forcement learning

    Xiaoyan Hu and Ho-fung Leung. “A tighter problem-dependent regret bound for risk-sensitive rein- forcement learning”. In:International Conference on Artificial Intelligence and Statistics. PMLR. 2023, pp. 5411–5437

  31. [31]

    Achieving mean–variance efficiency by continuous- time reinforcement learning

    Yilie Huang, Yanwei Jia, and Xunyu Zhou. “Achieving mean–variance efficiency by continuous- time reinforcement learning”. In:Proceedings of the Third ACM International Conference on AI in Finance. 2022, pp. 377–385

  32. [32]

    A utility criterion for Markov decision processes

    Stratton C Jaquette. “A utility criterion for Markov decision processes”. In:Management Science 23.1 (1976), pp. 43–49

  33. [33]

    Truncated variance reduced value iteration

    Yujia Jin et al. “Truncated variance reduced value iteration”. In:Advances in Neural Information Processing Systems37 (2024), pp. 117481–117508. 13

  34. [34]

    University of London, University College London (United Kingdom), 2003

    Sham Machandranath Kakade.On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom), 2003

  35. [35]

    Risk-aware high-level decisions for automated driving at occluded inter- sections with reinforcement learning

    Danial Kamran et al. “Risk-aware high-level decisions for automated driving at occluded inter- sections with reinforcement learning”. In:2020 IEEE Intelligent Vehicles Symposium (IV). IEEE. 2020, pp. 1205–1212

  36. [36]

    Finite-sample convergence rates for Q-learning and indirect algorithms

    Michael Kearns and Satinder Singh. “Finite-sample convergence rates for Q-learning and indirect algorithms”. In:Advances in neural information processing systems11 (1998)

  37. [37]

    A revised approach for risk- averse multi-armed bandits under CVaR criterion

    Najakorn Khajonchotpanya, Yilin Xue, and Napat Rujeerapaiboon. “A revised approach for risk- averse multi-armed bandits under CVaR criterion”. In:Operations Research Letters49.4 (2021), pp. 465–472

  38. [38]

    Actor-critic algorithms for risk-sensitive MDPs

    Prashanth La and Mohammad Ghavamzadeh. “Actor-critic algorithms for risk-sensitive MDPs”. In:Advances in neural information processing systems26 (2013)

  39. [39]

    Risk-aware reinforcement learning with coherent risk measures and non-linear function approximation

    Thanh Lam et al. “Risk-aware reinforcement learning with coherent risk measures and non-linear function approximation”. In:The Eleventh International Conference on Learning Representations. 2022

  40. [40]

    Near-optimal PAC bounds for discounted MDPs

    Tor Lattimore and Marcus Hutter. “Near-optimal PAC bounds for discounted MDPs”. In:Theo- retical Computer Science558 (2014), pp. 125–143

  41. [41]

    Risk-Averse Constrained Reinforcement Learning with Optimized Certainty Equivalents

    Jane H Lee et al. “Risk-Averse Constrained Reinforcement Learning with Optimized Certainty Equivalents”. In:The Thirty-ninth Annual Conference on Neural Information Processing Systems. 2025

  42. [42]

    Optimal dynamic portfolio selection: Multiperiod mean-variance formulation

    Duan Li and Wan-Lung Ng. “Optimal dynamic portfolio selection: Multiperiod mean-variance formulation”. In:Mathematical finance10.3 (2000), pp. 387–406

  43. [43]

    Breaking the sample size barrier in model-based reinforcement learning with a generative model

    Gen Li et al. “Breaking the sample size barrier in model-based reinforcement learning with a generative model”. In:Advances in neural information processing systems33 (2020), pp. 12861– 12872

  44. [44]

    Breaking the sample size barrier in model-based reinforcement learning with a generative model

    Gen Li et al. “Breaking the sample size barrier in model-based reinforcement learning with a generative model”. In:Operations Research72.1 (2024), pp. 203–221

  45. [45]

    Settling the sample complexity of model-based offline reinforcement learning

    Gen Li et al. “Settling the sample complexity of model-based offline reinforcement learning”. In: The Annals of Statistics52.1 (2024), pp. 233–260

  46. [46]

    Regret bounds for risk-sensitive reinforcement learning with lips- chitz dynamic risk measures

    Hao Liang and Zhiquan Luo. “Regret bounds for risk-sensitive reinforcement learning with lips- chitz dynamic risk measures”. In:International Conference on Artificial Intelligence and Statistics. PMLR. 2024, pp. 1774–1782

  47. [47]

    Robust risk-averse stochastic multi-armed bandits

    Odalric-Ambrym Maillard. “Robust risk-averse stochastic multi-armed bandits”. In:Algorithmic Learning Theory: 24th International Conference, ALT 2013, Singapore, October 6-9, 2013. Pro- ceedings 24. Springer. 2013, pp. 218–233

  48. [48]

    Beyond average return in Markov de- cision processes

    Alexandre Marthe, Aur´ elien Garivier, and Claire Vernade. “Beyond average return in Markov de- cision processes”. In:Advances in Neural Information Processing Systems36 (2023), pp. 56488– 56507

  49. [49]

    Efficient Risk-sensitive Planning via Entropic Risk Measures

    Alexandre Marthe et al. “Efficient Risk-sensitive Planning via Entropic Risk Measures”. In:arXiv preprint arXiv:2502.20423(2025)

  50. [50]

    Risk-sensitive reinforcement learning

    Oliver Mihatsch and Ralph Neuneier. “Risk-sensitive reinforcement learning”. In:Machine learning 49.2 (2002), pp. 267–290

  51. [51]

    A policy gradient algorithm for the risk-sensitive exponential cost mdp

    Mehrdad Moharrami et al. “A policy gradient algorithm for the risk-sensitive exponential cost mdp”. In:Mathematics of operations research50.1 (2025), pp. 431–458

  52. [52]

    Recursive Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

    Oliver Mortensen and Mohammad Sadegh Talebi. “Recursive Entropic Risk Optimization in Dis- counted MDPs: Sample Complexity Bounds with a Generative Model”. In:arXiv preprint arXiv:2506.00286 (2025). 14

  53. [53]

    Risk-sensitive reinforcement learning via Entropic-VaR optimization

    Xinyi Ni and Lifeng Lai. “Risk-sensitive reinforcement learning via Entropic-VaR optimization”. In:2022 56th Asilomar Conference on Signals, Systems, and Computers. IEEE. 2022, pp. 953–959

  54. [54]

    An approximate solution method for large risk- averse markov decision processes

    Marek Petrik and Dharmashankar Subramanian. “An approximate solution method for large risk- averse markov decision processes”. In:Conference on Uncertainty in Artificial Intelligence. 2012

  55. [55]

    Bridging offline reinforcement learning and imitation learning: A tale of pessimism

    Paria Rashidinejad et al. “Bridging offline reinforcement learning and imitation learning: A tale of pessimism”. In:Advances in Neural Information Processing Systems34 (2021), pp. 11702–11716

  56. [56]

    One risk to rule them all: A risk-sensitive perspec- tive on model-based offline reinforcement learning

    Marc Rigter, Bruno Lacerda, and Nick Hawes. “One risk to rule them all: A risk-sensitive perspec- tive on model-based offline reinforcement learning”. In:Advances in neural information processing systems36 (2023), pp. 77520–77545

  57. [57]

    Risk-aversion in multi-armed bandits

    Amir Sani, Alessandro Lazaric, and R´ emi Munos. “Risk-aversion in multi-armed bandits”. In:Ad- vances in neural information processing systems25 (2012)

  58. [58]

    Robust portfolio asset allocation and risk measures

    Maria Grazia Scutella and Raffaella Recchia. “Robust portfolio asset allocation and risk measures”. In:Annals of Operations Research204.1 (2013), pp. 145–169

  59. [59]

    SIAM, 2021

    Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski.Lectures on stochastic program- ming: Modeling and theory. SIAM, 2021

  60. [60]

    Variance reduced value iteration and faster algorithms for solving Markov de- cision processes

    Aaron Sidford et al. “Variance reduced value iteration and faster algorithms for solving Markov de- cision processes”. In:Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms. 2018, pp. 770–787

  61. [61]

    An upper bound on the loss from approximate optimal-value functions

    Satinder P Singh and Richard C Yee. “An upper bound on the loss from approximate optimal-value functions”. In:Machine Learning16.3 (1994), pp. 227–233

  62. [62]

    Deep reinforcement learning for optimal portfolio allocation: A comparative study with mean-variance optimization

    Srijan Sood et al. “Deep reinforcement learning for optimal portfolio allocation: A comparative study with mean-variance optimization”. In:FinPlan2023.2023 (2023), p. 21

  63. [63]

    An analysis of model-based interval estimation for Markov decision processes

    Alexander L Strehl and Michael L Littman. “An analysis of model-based interval estimation for Markov decision processes”. In:Journal of Computer and System Sciences74.8 (2008), pp. 1309– 1331

  64. [64]

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction. Vol. 1. 1. MIT press Cambridge, 1998

  65. [65]

    Policy gradient for coherent risk measures

    Aviv Tamar et al. “Policy gradient for coherent risk measures”. In:Advances in neural information processing systems28 (2015)

  66. [66]

    A Reductions Approach to Risk-Sensitive Reinforcement Learning with Op- timized Certainty Equivalents

    Kaiwen Wang et al. “A Reductions Approach to Risk-Sensitive Reinforcement Learning with Op- timized Certainty Equivalents”. In:Forty-second International Conference on Machine Learning. 2025

  67. [67]

    Randomized linear programming solves the Markov decision problem in nearly linear (sometimes sublinear) time

    Mengdi Wang. “Randomized linear programming solves the Markov decision problem in nearly linear (sometimes sublinear) time”. In:Mathematics of Operations Research45.2 (2020), pp. 517– 546

  68. [68]

    Regret bounds for Markov decision processes with re- cursive optimized certainty equivalents

    Wenhao Xu, Xuefeng Gao, and Xuedong He. “Regret bounds for Markov decision processes with re- cursive optimized certainty equivalents”. In:International Conference on Machine Learning. PMLR. 2023, pp. 38400–38427

  69. [69]

    RA-PbRL: Provably efficient risk-aware preference-based reinforcement learning

    Yujie Zhao et al. “RA-PbRL: Provably efficient risk-aware preference-based reinforcement learning”. In:Advances in Neural Information Processing Systems37 (2024), pp. 60835–60871. 15 A Risk Measures In this section, we briefly introduce risk measures. See, e.g., [24] for a good reference that like us model stochastic outcomes as rewards. We here collect s...

  70. [70]

    Furthermore we Theorem 8.Assumeγ≥ 1 2,δ < 1 4 andu∈U 1. If there existsp∈( 1 2 ,1)and∆>0such that for any a∈(0, 1−p 5 )and anyz i ∈Zit holds that Q∗ 1(z)−Q ∗ 0(z)≥γα∆ (16) then there exists¯ε(u, Hγ)and constantsc 1, c2 >0such that if the total number of samples is less than T≤ 1 c1 SA∆2p(1−p) ε2 log( SA c2δ ) (17) for any algorithmUand anyε∈(0,¯ε), there ...

  71. [71]

    Now by Theorem 9 in [52], we get that P1(B0)≥P 1(B0 ∩ E) =E 1[1 E 1 B0] =E 0 L1 L0 1 E 1 B0 ≥ θ 4 E0[1 E 1 B0] = θ 4 P0(E ∩B 0)≥ θ 8 . Solving fortin θ 8 > δwe find t < p(1−p) 32α2 log( 1 8δ ) = ∆2p(1−p) 512ε2 log( 1 8δ ) Since alsoB 0 ⊂B c 1 we conclude that if the algorithmUtries the state-action pairz i less than eT(ε, δ) := ∆2p(1−p) 512ε2 log( 1 8δ ) ...

  72. [72]

    Letθ= exp − 32α2t p(1−p)

    and using that the algorithm is (ε, δ)-correct it follows thatP 0(B)≥1−δ≥ 3 4 whereB={π U(si) =a 0}is the event that the algorithm chooses the actiona 0. Letθ= exp − 32α2t p(1−p) . Fix somet∈Nand letkbe the number of transitions tos G inttrials. Finally, we define the eventEas E= pt−k≤ r 2p(1−p) log( 8 2θ ) .(20) From the Chernoff-Hoeffding bound and as s...

  73. [73]

    Since this holds for all theAhypothesesH i l , l= 1,2,

    From Theorem 9 in [52], we get that P1(B)≥P 1(B∩ E) =E 1[1 B1 E]≥E 0 L1(W) L0(W) 1 E 1 B ≥E 0 θ 4 1 E 1 B = θ 4 P0(E ∩B)≥ θ 8 .(21) Now solving for θ 8 > δ, we see that if t < eT(ε, δ) := ∆2p(1−p) 128ε2 log( 1 8δ ) (22) thenP 1(B)> δand the eventBis containing the event that the algorithm does not choose the optimal actiona l. Since this holds for all the...