On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents
Pith reviewed 2026-05-22 09:23 UTC · model grok-4.3
The pith
Utility functions with full domain over the reals make optimized certainty equivalent objectives PAC-learnable in discounted MDPs, while restricted domains make them impossible to learn.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that the OCE objective is PAC-learnable if and only if the utility function u has full domain equal to the real numbers; whenever the domain is a proper subset the problem is not PAC-learnable. For utilities that satisfy the domain condition a simple model-based planner yields PAC bounds whose leading term scales with the size of the state-action space. The same analysis produces lower bounds that match this dependence on SA and, for CVaR, improve the dependence on the risk level tau to 1 over tau squared.
What carries the argument
The optimized certainty equivalent (OCE) risk measure, which for a utility function u returns the optimized value of an expectation involving u applied to a random return.
If this is right
- When the utility has full domain the model-based planner returns an epsilon-optimal value function after a number of samples that scales linearly with SA and polynomially with the effective horizon.
- No algorithm can achieve PAC guarantees for any utility whose domain excludes some real numbers, regardless of sample budget.
- For CVaR the sample complexity necessarily grows at least as 1 over tau squared, improving prior dependence on the risk parameter by a factor of 1 over tau.
- The lower bounds remain tight even when the effective horizon is treated as a separate parameter.
Where Pith is reading between the lines
- The characterization suggests that entropic risk, which has full domain, is preferable to CVaR when sample efficiency is the primary concern.
- The same domain condition may govern learnability in risk-sensitive settings that lack an explicit generative model but still admit concentration inequalities.
- Extending the analysis to infinite state spaces or continuous actions would require new concentration arguments that preserve the domain-based separation.
Load-bearing premise
The analysis assumes a generative model that can produce independent next-state and reward samples for every state-action pair and that the risk-sensitive value function is obtained by recursively applying the OCE operator at every step.
What would settle it
A concrete demonstration that some utility function whose domain is strictly smaller than the reals still permits an algorithm to return an epsilon-optimal OCE value function with probability 1-delta after polynomially many generative samples would falsify the non-learnability claim.
Figures
read the original abstract
We study risk-sensitive reinforcement learning in finite discounted MDPs, where a generative model of the MDP is assumed to be available. We consider a family or risk measures called the optimized certainty equivalent (OCE), which includes important risk measures such as entropic risk, CVaR, and mean-variance. Our focus is on the sample complexities of learning the optimal state-action value function (value learning) and an optimal policy (policy learning) under recursive OCE. We provide an exact characterization of utility functions $u$ for which the corresponding OCE defines an objective that is PAC-learnable. We analyze a simple model-based approach and derive PAC sample complexity bounds. We establish that whenever $u$ does not have full domain $\text{dom}(u)\neq \mathbb{R}$, the corresponding problem is not PAC-learnable. Finally, we establish corresponding lower bounds for both value and policy learning, demonstrating tightness in the size $SA$ of state-action space, and for a more restricted class of utilities, we derive lower bounds that makes the dependence on the effective horizon $\frac{1}{1-\gamma}$ explicit. Specifically, for $\text{CVaR}_\tau$ we show that the correct dependence on $\tau$ is $\frac{1}{\tau^2}$, thus improving by a factor of $\frac{1}{\tau}$ over state-of-the-art although our bound has a suboptimal dependence on $\frac{1}{1-\gamma}$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies risk-sensitive RL in finite discounted MDPs with a generative model, focusing on recursive optimized certainty equivalents (OCE) for risk measures including entropic risk, CVaR, and mean-variance. It claims an exact characterization of utility functions u making the OCE objective PAC-learnable, derives PAC sample complexity upper bounds via a model-based algorithm for both value and policy learning, proves that dom(u) ≠ ℝ implies the problem is not PAC-learnable, and provides matching lower bounds (tight in SA, with explicit 1/(1-γ) dependence for restricted utilities and 1/τ² dependence for CVaR_τ).
Significance. If the characterization, bounds, and non-learnability result hold, the work is significant for precisely delineating when OCE-based risk-sensitive objectives are sample-efficiently learnable in RL. The matching lower bounds, the improvement over prior CVaR dependence on τ, and the negative result for limited-domain utilities would be valuable contributions to understanding limitations of risk measures.
major comments (2)
- [Non-learnability result] Non-learnability section: The central claim that dom(u) ≠ ℝ renders the problem not PAC-learnable is load-bearing for the exact characterization. The manuscript must explicitly show that recursive application of OCE can produce effective return distributions with support outside dom(u) for some reachable policies in finite MDPs (even under bounded rewards and discounting), rendering the objective undefined; otherwise the non-learnability direction does not follow.
- [Lower bounds] Lower bounds for CVaR: The claimed 1/τ² dependence (improving prior work by a factor of 1/τ) is presented as tight. The information-theoretic construction should be verified to correctly incorporate the recursive OCE nesting rather than reducing to a non-recursive or single-step risk measure.
minor comments (2)
- [Abstract and lower bounds section] Clarify the precise class of utilities for which the explicit 1/(1-γ) lower bound is derived, as the abstract refers to 'a more restricted class' without naming it.
- [Preliminaries] Notation consistency: Ensure dom(u) and the condition dom(u)≠ℝ are defined and used uniformly when discussing domain restrictions.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. We address the major comments point by point below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Non-learnability result] Non-learnability section: The central claim that dom(u) ≠ ℝ renders the problem not PAC-learnable is load-bearing for the exact characterization. The manuscript must explicitly show that recursive application of OCE can produce effective return distributions with support outside dom(u) for some reachable policies in finite MDPs (even under bounded rewards and discounting), rendering the objective undefined; otherwise the non-learnability direction does not follow.
Authors: We agree that an explicit construction would strengthen the argument. In the revision we will add a concrete finite MDP example (two states, bounded rewards in [0,1], γ=0.9) showing a policy whose recursively computed return distribution has positive probability mass outside dom(u) for a utility with limited domain, even after discounting. This will be placed in the non-learnability section to directly support the claim. revision: yes
-
Referee: [Lower bounds] Lower bounds for CVaR: The claimed 1/τ² dependence (improving prior work by a factor of 1/τ) is presented as tight. The information-theoretic construction should be verified to correctly incorporate the recursive OCE nesting rather than reducing to a non-recursive or single-step risk measure.
Authors: Our information-theoretic lower bound is constructed for the fully recursive OCE objective: the hard MDP family uses a chain of states in which CVaR_τ is applied at every step, and the sample lower bound is derived from the propagation of tail estimation error through the nested operators. The 1/τ² term arises precisely because accurate estimation of the conditional tail must be maintained across recursion levels. We will add an explicit paragraph in the proof tracing the recursion to address the verification request. revision: partial
Circularity Check
No circularity: theoretical bounds derived independently from model-based analysis and information-theoretic arguments
full rationale
The paper derives an exact characterization of utility functions u making recursive OCE PAC-learnable in finite discounted MDPs, along with sample complexity upper bounds from a model-based algorithm and matching lower bounds. These results rest on explicit assumptions about the generative model, recursive application of OCE to value functions, and standard PAC analysis techniques. No step reduces a claimed prediction or first-principles result to a fitted parameter from the same data, a self-definitional construction, or a load-bearing self-citation chain. The non-learnability claim for dom(u) ≠ ℝ follows from showing the objective can be undefined for some policies, which is an independent argument rather than a renaming or smuggling of prior results. The derivation is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (3)
- domain assumption The MDP has finite state and action spaces.
- domain assumption A generative model that returns exact transition samples is available.
- standard math The discount factor satisfies 0 ≤ γ < 1.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We establish that whenever u does not have full domain dom(u)≠R, the corresponding problem is not PAC-learnable.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We provide an exact characterization of utility functions u for which the corresponding OCE defines an objective that is PAC-learnable.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Model-based reinforcement learning with a gen- erative model is minimax optimal
Alekh Agarwal, Sham Kakade, and Lin F Yang. “Model-based reinforcement learning with a gen- erative model is minimax optimal”. In:Conference on Learning Theory. PMLR. 2020, pp. 67–83
work page 2020
-
[2]
Entropic value-at-risk: A new coherent risk measure
Amir Ahmadi-Javid. “Entropic value-at-risk: A new coherent risk measure”. In:Journal of Opti- mization Theory and Applications155 (2012), pp. 1105–1123
work page 2012
-
[3]
A note on a new class of recursive utilities in Markov decision processes
Hubert Asienkiewicz and Anna Ja´ skiewicz. “A note on a new class of recursive utilities in Markov decision processes”. In:Applicationes Mathematicae44 (2017), pp. 149–161
work page 2017
-
[4]
Markov decision processes with recursive risk measures
Nicole B¨ auerle and Alexander Glauner. “Markov decision processes with recursive risk measures”. In:European Journal of Operational Research296.3 (2022), pp. 953–966
work page 2022
-
[5]
Markov decision processes with risk-sensitive criteria: An overview
Nicole B¨ auerle and Anna Ja´ skiewicz. “Markov decision processes with risk-sensitive criteria: An overview”. In:Mathematical Methods of Operations Research99.1 (2024), pp. 141–178
work page 2024
-
[6]
Markov decision processes with average-value-at-risk criteria
Nicole B¨ auerle and Jonathan Ott. “Markov decision processes with average-value-at-risk criteria”. In:Mathematical Methods of Operations Research74 (2011), pp. 361–379
work page 2011
-
[7]
More risk-sensitive Markov decision processes
Nicole B¨ auerle and Ulrich Rieder. “More risk-sensitive Markov decision processes”. In:Mathematics of Operations Research39.1 (2014), pp. 105–120
work page 2014
-
[8]
An old-new concept of convex risk measures: The optimized certainty equivalent
Aharon Ben-Tal and Marc Teboulle. “An old-new concept of convex risk measures: The optimized certainty equivalent”. In:Mathematical Finance17.3 (2007), pp. 449–476
work page 2007
-
[9]
Risk-sensitive dynamic asset management
Tomasz R Bielecki and Stanley R Pliska. “Risk-sensitive dynamic asset management”. In:Applied Mathematics and Optimization39 (1999), pp. 337–360
work page 1999
-
[10]
Risk-averse policy optimization via risk-neutral policy optimization
Lorenzo Bisi et al. “Risk-averse policy optimization via risk-neutral policy optimization”. In:Arti- ficial Intelligence311 (2022), p. 103765
work page 2022
-
[11]
Q-learning for risk-sensitive control
Vivek S Borkar. “Q-learning for risk-sensitive control”. In:Mathematics of operations research27.2 (2002), pp. 294–311
work page 2002
-
[12]
Risk-sensitive optimal control for Markov decision processes with monotone cost
Vivek S Borkar and Sean P Meyn. “Risk-sensitive optimal control for Markov decision processes with monotone cost”. In:Mathematics of Operations Research27.1 (2002), pp. 192–209
work page 2002
-
[13]
Bayesian robust optimization for imitation learn- ing
Daniel Brown, Scott Niekum, and Marek Petrik. “Bayesian robust optimization for imitation learn- ing”. In:Advances in Neural Information Processing Systems33 (2020), pp. 2479–2491
work page 2020
-
[14]
Yu Chen et al. “Provably Efficient Iterated CVaR Reinforcement Learning with Function Approxi- mation and Human Feedback”. In:The Twelfth International Conference on Learning Representa- tions. 2024
work page 2024
-
[15]
Algorithms for CVaR optimization in MDPs
Yinlam Chow and Mohammad Ghavamzadeh. “Algorithms for CVaR optimization in MDPs”. In: Advances in neural information processing systems27 (2014). 12
work page 2014
-
[16]
Percentile optimization for Markov decision processes with param- eter uncertainty
Erick Delage and Shie Mannor. “Percentile optimization for Markov decision processes with param- eter uncertainty”. In:Operations research58.1 (2010), pp. 203–213
work page 2010
-
[17]
Cambridge University Press, 2002
Michael Alan Howarth Dempster.Risk management: value at risk and beyond. Cambridge University Press, 2002
work page 2002
-
[18]
Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model
Zilong Deng, Simon Khan, and Shaofeng Zou. “Near-Optimal Sample Complexity for Iterated CVaR Reinforcement Learning with a Generative Model”. In:The 28th International Conference on Artificial Intelligence and Statistics. 2025
work page 2025
-
[19]
Provably Efficient Risk-Sensitive Reinforcement Learn- ing: Iterated CVaR and Worst Path
Yihan Du, Siwei Wang, and Longbo Huang. “Provably Efficient Risk-Sensitive Reinforcement Learn- ing: Iterated CVaR and Worst Path”. In:The Eleventh International Conference on Learning Rep- resentations. 2023
work page 2023
-
[20]
Clinical data based optimal STI strategies for HIV: A reinforcement learning approach
Damien Ernst et al. “Clinical data based optimal STI strategies for HIV: A reinforcement learning approach”. In:Proceedings of the 45th IEEE Conference on Decision and Control. IEEE. 2006, pp. 667–672
work page 2006
-
[21]
Risk-sensitive reinforcement learning with function approximation: A debiasing approach
Yingjie Fei, Zhuoran Yang, and Zhaoran Wang. “Risk-sensitive reinforcement learning with function approximation: A debiasing approach”. In:International Conference on Machine Learning. PMLR. 2021, pp. 3198–3207
work page 2021
-
[22]
Exponential Bellman equation and improved regret bounds for risk-sensitive re- inforcement learning
Yingjie Fei et al. “Exponential Bellman equation and improved regret bounds for risk-sensitive re- inforcement learning”. In:Advances in neural information processing systems34 (2021), pp. 20436– 20446
work page 2021
-
[23]
Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in re- gret
Yingjie Fei et al. “Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in re- gret”. In:Advances in Neural Information Processing Systems33 (2020), pp. 22384–22395
work page 2020
-
[24]
Convex and coherent risk measures
Hans F¨ ollmer and Alexander Schied. “Convex and coherent risk measures”. In:Encyclopedia of Quantitative Finance(2010), pp. 355–363
work page 2010
-
[25]
Risk-Seeking Reinforcement Learning via Multi-Timescale EVaR Optimization
Deep Kumar Ganguly et al. “Risk-Seeking Reinforcement Learning via Multi-Timescale EVaR Optimization”. In:Transactions on Machine Learning Research()
-
[26]
Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model
Mohammad Gheshlaghi Azar, R´ emi Munos, and Hilbert J Kappen. “Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model”. In:Machine learning91 (2013), pp. 325–349
work page 2013
-
[27]
Entropic risk optimization in dis- counted MDPs
Jia Lin Hau, Marek Petrik, and Mohammad Ghavamzadeh. “Entropic risk optimization in dis- counted MDPs”. In:International Conference on Artificial Intelligence and Statistics. PMLR. 2023, pp. 47–76
work page 2023
-
[28]
On dynamic programming decompositions of static risk measures in Markov decision processes
Jia Lin Hau et al. “On dynamic programming decompositions of static risk measures in Markov decision processes”. In:Advances in Neural Information Processing Systems36 (2023), pp. 51734– 51757
work page 2023
-
[29]
Risk-sensitive Markov decision processes
Ronald A Howard and James E Matheson. “Risk-sensitive Markov decision processes”. In:Man- agement science18.7 (1972), pp. 356–369
work page 1972
-
[30]
A tighter problem-dependent regret bound for risk-sensitive rein- forcement learning
Xiaoyan Hu and Ho-fung Leung. “A tighter problem-dependent regret bound for risk-sensitive rein- forcement learning”. In:International Conference on Artificial Intelligence and Statistics. PMLR. 2023, pp. 5411–5437
work page 2023
-
[31]
Achieving mean–variance efficiency by continuous- time reinforcement learning
Yilie Huang, Yanwei Jia, and Xunyu Zhou. “Achieving mean–variance efficiency by continuous- time reinforcement learning”. In:Proceedings of the Third ACM International Conference on AI in Finance. 2022, pp. 377–385
work page 2022
-
[32]
A utility criterion for Markov decision processes
Stratton C Jaquette. “A utility criterion for Markov decision processes”. In:Management Science 23.1 (1976), pp. 43–49
work page 1976
-
[33]
Truncated variance reduced value iteration
Yujia Jin et al. “Truncated variance reduced value iteration”. In:Advances in Neural Information Processing Systems37 (2024), pp. 117481–117508. 13
work page 2024
-
[34]
University of London, University College London (United Kingdom), 2003
Sham Machandranath Kakade.On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom), 2003
work page 2003
-
[35]
Danial Kamran et al. “Risk-aware high-level decisions for automated driving at occluded inter- sections with reinforcement learning”. In:2020 IEEE Intelligent Vehicles Symposium (IV). IEEE. 2020, pp. 1205–1212
work page 2020
-
[36]
Finite-sample convergence rates for Q-learning and indirect algorithms
Michael Kearns and Satinder Singh. “Finite-sample convergence rates for Q-learning and indirect algorithms”. In:Advances in neural information processing systems11 (1998)
work page 1998
-
[37]
A revised approach for risk- averse multi-armed bandits under CVaR criterion
Najakorn Khajonchotpanya, Yilin Xue, and Napat Rujeerapaiboon. “A revised approach for risk- averse multi-armed bandits under CVaR criterion”. In:Operations Research Letters49.4 (2021), pp. 465–472
work page 2021
-
[38]
Actor-critic algorithms for risk-sensitive MDPs
Prashanth La and Mohammad Ghavamzadeh. “Actor-critic algorithms for risk-sensitive MDPs”. In:Advances in neural information processing systems26 (2013)
work page 2013
-
[39]
Risk-aware reinforcement learning with coherent risk measures and non-linear function approximation
Thanh Lam et al. “Risk-aware reinforcement learning with coherent risk measures and non-linear function approximation”. In:The Eleventh International Conference on Learning Representations. 2022
work page 2022
-
[40]
Near-optimal PAC bounds for discounted MDPs
Tor Lattimore and Marcus Hutter. “Near-optimal PAC bounds for discounted MDPs”. In:Theo- retical Computer Science558 (2014), pp. 125–143
work page 2014
-
[41]
Risk-Averse Constrained Reinforcement Learning with Optimized Certainty Equivalents
Jane H Lee et al. “Risk-Averse Constrained Reinforcement Learning with Optimized Certainty Equivalents”. In:The Thirty-ninth Annual Conference on Neural Information Processing Systems. 2025
work page 2025
-
[42]
Optimal dynamic portfolio selection: Multiperiod mean-variance formulation
Duan Li and Wan-Lung Ng. “Optimal dynamic portfolio selection: Multiperiod mean-variance formulation”. In:Mathematical finance10.3 (2000), pp. 387–406
work page 2000
-
[43]
Breaking the sample size barrier in model-based reinforcement learning with a generative model
Gen Li et al. “Breaking the sample size barrier in model-based reinforcement learning with a generative model”. In:Advances in neural information processing systems33 (2020), pp. 12861– 12872
work page 2020
-
[44]
Breaking the sample size barrier in model-based reinforcement learning with a generative model
Gen Li et al. “Breaking the sample size barrier in model-based reinforcement learning with a generative model”. In:Operations Research72.1 (2024), pp. 203–221
work page 2024
-
[45]
Settling the sample complexity of model-based offline reinforcement learning
Gen Li et al. “Settling the sample complexity of model-based offline reinforcement learning”. In: The Annals of Statistics52.1 (2024), pp. 233–260
work page 2024
-
[46]
Regret bounds for risk-sensitive reinforcement learning with lips- chitz dynamic risk measures
Hao Liang and Zhiquan Luo. “Regret bounds for risk-sensitive reinforcement learning with lips- chitz dynamic risk measures”. In:International Conference on Artificial Intelligence and Statistics. PMLR. 2024, pp. 1774–1782
work page 2024
-
[47]
Robust risk-averse stochastic multi-armed bandits
Odalric-Ambrym Maillard. “Robust risk-averse stochastic multi-armed bandits”. In:Algorithmic Learning Theory: 24th International Conference, ALT 2013, Singapore, October 6-9, 2013. Pro- ceedings 24. Springer. 2013, pp. 218–233
work page 2013
-
[48]
Beyond average return in Markov de- cision processes
Alexandre Marthe, Aur´ elien Garivier, and Claire Vernade. “Beyond average return in Markov de- cision processes”. In:Advances in Neural Information Processing Systems36 (2023), pp. 56488– 56507
work page 2023
-
[49]
Efficient Risk-sensitive Planning via Entropic Risk Measures
Alexandre Marthe et al. “Efficient Risk-sensitive Planning via Entropic Risk Measures”. In:arXiv preprint arXiv:2502.20423(2025)
-
[50]
Risk-sensitive reinforcement learning
Oliver Mihatsch and Ralph Neuneier. “Risk-sensitive reinforcement learning”. In:Machine learning 49.2 (2002), pp. 267–290
work page 2002
-
[51]
A policy gradient algorithm for the risk-sensitive exponential cost mdp
Mehrdad Moharrami et al. “A policy gradient algorithm for the risk-sensitive exponential cost mdp”. In:Mathematics of operations research50.1 (2025), pp. 431–458
work page 2025
-
[52]
Oliver Mortensen and Mohammad Sadegh Talebi. “Recursive Entropic Risk Optimization in Dis- counted MDPs: Sample Complexity Bounds with a Generative Model”. In:arXiv preprint arXiv:2506.00286 (2025). 14
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
Risk-sensitive reinforcement learning via Entropic-VaR optimization
Xinyi Ni and Lifeng Lai. “Risk-sensitive reinforcement learning via Entropic-VaR optimization”. In:2022 56th Asilomar Conference on Signals, Systems, and Computers. IEEE. 2022, pp. 953–959
work page 2022
-
[54]
An approximate solution method for large risk- averse markov decision processes
Marek Petrik and Dharmashankar Subramanian. “An approximate solution method for large risk- averse markov decision processes”. In:Conference on Uncertainty in Artificial Intelligence. 2012
work page 2012
-
[55]
Bridging offline reinforcement learning and imitation learning: A tale of pessimism
Paria Rashidinejad et al. “Bridging offline reinforcement learning and imitation learning: A tale of pessimism”. In:Advances in Neural Information Processing Systems34 (2021), pp. 11702–11716
work page 2021
-
[56]
Marc Rigter, Bruno Lacerda, and Nick Hawes. “One risk to rule them all: A risk-sensitive perspec- tive on model-based offline reinforcement learning”. In:Advances in neural information processing systems36 (2023), pp. 77520–77545
work page 2023
-
[57]
Risk-aversion in multi-armed bandits
Amir Sani, Alessandro Lazaric, and R´ emi Munos. “Risk-aversion in multi-armed bandits”. In:Ad- vances in neural information processing systems25 (2012)
work page 2012
-
[58]
Robust portfolio asset allocation and risk measures
Maria Grazia Scutella and Raffaella Recchia. “Robust portfolio asset allocation and risk measures”. In:Annals of Operations Research204.1 (2013), pp. 145–169
work page 2013
-
[59]
Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski.Lectures on stochastic program- ming: Modeling and theory. SIAM, 2021
work page 2021
-
[60]
Variance reduced value iteration and faster algorithms for solving Markov de- cision processes
Aaron Sidford et al. “Variance reduced value iteration and faster algorithms for solving Markov de- cision processes”. In:Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms. 2018, pp. 770–787
work page 2018
-
[61]
An upper bound on the loss from approximate optimal-value functions
Satinder P Singh and Richard C Yee. “An upper bound on the loss from approximate optimal-value functions”. In:Machine Learning16.3 (1994), pp. 227–233
work page 1994
-
[62]
Srijan Sood et al. “Deep reinforcement learning for optimal portfolio allocation: A comparative study with mean-variance optimization”. In:FinPlan2023.2023 (2023), p. 21
-
[63]
An analysis of model-based interval estimation for Markov decision processes
Alexander L Strehl and Michael L Littman. “An analysis of model-based interval estimation for Markov decision processes”. In:Journal of Computer and System Sciences74.8 (2008), pp. 1309– 1331
work page 2008
-
[64]
Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction. Vol. 1. 1. MIT press Cambridge, 1998
work page 1998
-
[65]
Policy gradient for coherent risk measures
Aviv Tamar et al. “Policy gradient for coherent risk measures”. In:Advances in neural information processing systems28 (2015)
work page 2015
-
[66]
Kaiwen Wang et al. “A Reductions Approach to Risk-Sensitive Reinforcement Learning with Op- timized Certainty Equivalents”. In:Forty-second International Conference on Machine Learning. 2025
work page 2025
-
[67]
Mengdi Wang. “Randomized linear programming solves the Markov decision problem in nearly linear (sometimes sublinear) time”. In:Mathematics of Operations Research45.2 (2020), pp. 517– 546
work page 2020
-
[68]
Regret bounds for Markov decision processes with re- cursive optimized certainty equivalents
Wenhao Xu, Xuefeng Gao, and Xuedong He. “Regret bounds for Markov decision processes with re- cursive optimized certainty equivalents”. In:International Conference on Machine Learning. PMLR. 2023, pp. 38400–38427
work page 2023
-
[69]
RA-PbRL: Provably efficient risk-aware preference-based reinforcement learning
Yujie Zhao et al. “RA-PbRL: Provably efficient risk-aware preference-based reinforcement learning”. In:Advances in Neural Information Processing Systems37 (2024), pp. 60835–60871. 15 A Risk Measures In this section, we briefly introduce risk measures. See, e.g., [24] for a good reference that like us model stochastic outcomes as rewards. We here collect s...
work page 2024
-
[70]
Furthermore we Theorem 8.Assumeγ≥ 1 2,δ < 1 4 andu∈U 1. If there existsp∈( 1 2 ,1)and∆>0such that for any a∈(0, 1−p 5 )and anyz i ∈Zit holds that Q∗ 1(z)−Q ∗ 0(z)≥γα∆ (16) then there exists¯ε(u, Hγ)and constantsc 1, c2 >0such that if the total number of samples is less than T≤ 1 c1 SA∆2p(1−p) ε2 log( SA c2δ ) (17) for any algorithmUand anyε∈(0,¯ε), there ...
-
[71]
Now by Theorem 9 in [52], we get that P1(B0)≥P 1(B0 ∩ E) =E 1[1 E 1 B0] =E 0 L1 L0 1 E 1 B0 ≥ θ 4 E0[1 E 1 B0] = θ 4 P0(E ∩B 0)≥ θ 8 . Solving fortin θ 8 > δwe find t < p(1−p) 32α2 log( 1 8δ ) = ∆2p(1−p) 512ε2 log( 1 8δ ) Since alsoB 0 ⊂B c 1 we conclude that if the algorithmUtries the state-action pairz i less than eT(ε, δ) := ∆2p(1−p) 512ε2 log( 1 8δ ) ...
-
[72]
and using that the algorithm is (ε, δ)-correct it follows thatP 0(B)≥1−δ≥ 3 4 whereB={π U(si) =a 0}is the event that the algorithm chooses the actiona 0. Letθ= exp − 32α2t p(1−p) . Fix somet∈Nand letkbe the number of transitions tos G inttrials. Finally, we define the eventEas E= pt−k≤ r 2p(1−p) log( 8 2θ ) .(20) From the Chernoff-Hoeffding bound and as s...
-
[73]
Since this holds for all theAhypothesesH i l , l= 1,2,
From Theorem 9 in [52], we get that P1(B)≥P 1(B∩ E) =E 1[1 B1 E]≥E 0 L1(W) L0(W) 1 E 1 B ≥E 0 θ 4 1 E 1 B = θ 4 P0(E ∩B)≥ θ 8 .(21) Now solving for θ 8 > δ, we see that if t < eT(ε, δ) := ∆2p(1−p) 128ε2 log( 1 8δ ) (22) thenP 1(B)> δand the eventBis containing the event that the algorithm does not choose the optimal actiona l. Since this holds for all the...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.