On the Fundamental Limitations of Dual Static CVaR Decompositions in Markov Decision Processes

Audrey Durand; Mathieu Godbout

arxiv: 2507.14005 · v2 · submitted 2025-07-18 · 💻 cs.LG

On the Fundamental Limitations of Dual Static CVaR Decompositions in Markov Decision Processes

Mathieu Godbout , Audrey Durand This is my paper

Pith reviewed 2026-05-19 03:33 UTC · model grok-4.3

classification 💻 cs.LG

keywords CVaRMarkov decision processesrisk measurespolicy evaluationdual decompositiondynamic programmingconsistency constraints

0 comments

The pith

No single policy can be optimal for all risk levels under dual static CVaR in some MDPs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes static CVaR policy evaluation in MDPs as two separate minimization problems whose solutions agree only when risk-assignment consistency constraints are satisfied. An empty intersection of those constraints produces a nonzero CVaR evaluation gap that explains the failures of dual-based dynamic programming methods. The authors use this perspective to construct an explicit MDP in which the dual decomposition admits no single policy that is optimal across every initial risk level.

Core claim

Static CVaR evaluation decomposes into two distinct minimization problems; risk-assignment consistency constraints must hold for the problems to return matching values. When the constraints have empty intersection, a CVaR evaluation gap appears. Policies returned by dual CVaR dynamic programming inherit this gap, and the same constraint view proves that, in at least one MDP, no policy is simultaneously optimal for all initial risk levels.

What carries the argument

risk-assignment consistency constraints, which must be jointly satisfied by the two minimization problems for static CVaR evaluation to be consistent

If this is right

Dual-based dynamic programming returns policies whose CVaR evaluation gap is positive.
The size of the evaluation gap directly measures the discrepancy between the two minimization problems.
Uniform optimality over all initial risk levels is impossible for the dual formulation in certain MDPs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Alternative formulations that avoid splitting the CVaR objective into two separate problems may be required for reliable optimization.
The concrete MDP example supplies a minimal test case for checking whether any proposed fix restores uniform optimality.
The same inconsistency mechanism could appear in other risk measures whose dual representations rely on similar decompositions.

Load-bearing premise

The original CVaR objective is exactly and completely captured by the pair of dual minimization problems, so that agreement between them requires the risk-assignment consistency constraints.

What would settle it

Exhibit a single policy that is optimal for every initial risk level inside the MDP constructed by the paper.

Figures

Figures reproduced from arXiv: 2507.14005 by Audrey Durand, Mathieu Godbout.

**Figure 2.** Figure 2: Visual representation of the risk-assignment constraints on the MDP from Figure 1, with [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Evolution of the static CVaR evaluation of all possible policies on the MDP presented in [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Relation between initial risk-level α and the corresponding risk-level Y(s1) for all three possible policies on the MDP presented in [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

read the original abstract

It was recently shown that dynamic programming (DP) methods for finding static CVaR-optimal policies in Markov Decision Processes (MDPs) can fail when based on the dual formulation, yet the root cause of this failure remains unclear. We expand on these findings by shifting focus from policy optimization to the seemingly simpler task of policy evaluation. We show that evaluating the static CVaR of a given policy can be framed as two distinct minimization problems. We introduce a set of ``risk-assignment consistency constraints'' that must be satisfied for their solutions to match and we demonstrate that an empty intersection of these constraints is the source of previously observed evaluation errors. Quantifying the evaluation error as the \emph{CVaR evaluation gap}, we demonstrate that the issues observed when optimizing over the dual-based CVaR DP are explained by the returned policy having a non-zero CVaR evaluation gap. Finally, we leverage our proposed risk-assignment constraints perspective to prove that the search for a single, uniformly optimal policy on the dual CVaR decomposition is fundamentally limited, identifying an MDP where no single policy can be optimal across all initial risk levels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Dual CVaR decompositions run into a structural limit from empty risk-assignment constraint intersections, shown by a concrete MDP with no uniformly optimal policy.

read the letter

The one or two things to know: this paper claims that looking for a single uniformly optimal policy under dual CVaR decomposition is fundamentally limited, because they can find an MDP where no policy is best for every initial risk level. They trace this to risk-assignment consistency constraints whose intersection can be empty. They do well by moving the focus to policy evaluation and showing how two minimization problems for static CVaR only agree if those constraints are satisfied. Introducing the CVaR evaluation gap as a way to measure the mismatch is a nice way to connect the dots to why dual DP methods were failing before. The explicit construction of the MDP counterexample gives something tangible to check against. A soft spot worth checking is whether the two minimization problems they use really form a complete dual decomposition. If the original CVaR objective has more to it or other ways to decompose it, then the consistency constraints might not fully explain the limitation or the evaluation errors. The proof via empty intersection sounds reasonable from the description, but the details in the paper would need to confirm that the example MDP indeed has no uniformly optimal policy across risk levels. This paper is for folks in risk-sensitive reinforcement learning who work with CVaR and dual methods. A reader who wants to understand the structural problems in these approaches and see a new angle with constraints and a counterexample will find it worthwhile. It shows clear engagement with the literature and has an independent counterexample, so it deserves a serious referee. I recommend sending it out for peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims that static CVaR policy evaluation in MDPs can be expressed as two distinct minimization problems whose solutions coincide only when a set of risk-assignment consistency constraints has non-empty intersection. An empty intersection produces a CVaR evaluation gap that accounts for observed failures in dual-based dynamic programming. The authors use this perspective to prove that no single policy can be uniformly optimal across all initial risk levels, supported by a concrete MDP counterexample.

Significance. If the decomposition is complete and the counterexample holds, the work supplies a precise, falsifiable explanation for why dual CVaR methods cannot yield uniformly optimal policies and shifts attention from optimization failures to an underlying evaluation inconsistency. The explicit construction of an MDP where the constraints intersect emptily is a concrete strength that grounds the fundamental-limitation claim.

major comments (2)

[Section introducing the risk-assignment consistency constraints] Section introducing the risk-assignment consistency constraints: the manuscript states that the two minimization problems constitute the dual decomposition of static CVaR and that their solutions must coincide precisely when the constraints are satisfied. A more explicit derivation from the original CVaR definition (showing why no additional dual variables or alternative decompositions are possible) would strengthen the claim that empty intersection is the sole source of the evaluation gap and the uniform-optimality limitation.
[MDP counterexample] MDP counterexample (the section presenting the concrete MDP): the claim that no single policy is optimal for all initial risk levels rests on demonstrating both an empty intersection and that the resulting optimal policies differ across risk levels. Explicit verification that the constructed transition and reward structure produces this emptiness, together with the numerical or symbolic values of the two minimization problems, is needed to confirm the example is not an artifact of the particular risk levels chosen.

minor comments (2)

[Abstract and introduction] The term 'CVaR evaluation gap' is used throughout but is not formally defined until after the constraints are introduced; a brief forward reference or one-sentence definition in the abstract and introduction would improve readability.
[Notation and definitions] Notation for the two minimization problems (e.g., the variables representing risk assignments) should be introduced with a short table or explicit mapping to the original CVaR formulation to avoid ambiguity when the constraints are stated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. The suggested clarifications will improve the rigor of our presentation regarding the dual decomposition and the counterexample. We respond to each major comment below.

read point-by-point responses

Referee: [Section introducing the risk-assignment consistency constraints] Section introducing the risk-assignment consistency constraints: the manuscript states that the two minimization problems constitute the dual decomposition of static CVaR and that their solutions must coincide precisely when the constraints are satisfied. A more explicit derivation from the original CVaR definition (showing why no additional dual variables or alternative decompositions are possible) would strengthen the claim that empty intersection is the sole source of the evaluation gap and the uniform-optimality limitation.

Authors: We agree that a more explicit derivation from the original CVaR definition would strengthen the presentation. In the revised manuscript we will add a dedicated derivation subsection that starts from the static CVaR definition, obtains the two minimization problems, and shows why the risk-assignment consistency constraints are the precise conditions for their solutions to coincide. The derivation will also clarify that the chosen decomposition is canonical for static CVaR and that alternative dual variables would alter the semantics of the original risk measure. revision: yes
Referee: [MDP counterexample] MDP counterexample (the section presenting the concrete MDP): the claim that no single policy is optimal for all initial risk levels rests on demonstrating both an empty intersection and that the resulting optimal policies differ across risk levels. Explicit verification that the constructed transition and reward structure produces this emptiness, together with the numerical or symbolic values of the two minimization problems, is needed to confirm the example is not an artifact of the particular risk levels chosen.

Authors: We thank the referee for highlighting the need for explicit verification. In the revised manuscript we will augment the counterexample section with the explicit computation of the risk-assignment consistency constraints for the given transition and reward structure, demonstrating that their intersection is empty. We will also report the numerical (or symbolic) values attained by each of the two minimization problems at representative risk levels, confirming that the resulting optimal policies differ and that the emptiness is independent of the specific risk levels selected. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation self-contained via definitions and counterexample MDP

full rationale

The paper frames static CVaR evaluation as two minimization problems, introduces risk-assignment consistency constraints whose empty intersection explains evaluation gaps, and constructs an explicit MDP to prove no single policy is uniformly optimal across risk levels. This chain rests on the definitions of static CVaR, standard MDP transition and reward structure, and direct verification in the counterexample; no equation reduces to a fitted parameter renamed as prediction, no load-bearing premise collapses to a self-citation, and the uniqueness claim is established by exhibiting a concrete MDP rather than by imported theorem or ansatz. The argument is therefore independent of its own outputs and does not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The work relies on standard MDP assumptions (finite states/actions, stationary policies) and prior definitions of static CVaR; no new free parameters or invented entities beyond the introduced constraints are apparent from the abstract.

axioms (1)

domain assumption MDPs are finite-state, finite-action, and the static CVaR is defined via the dual formulation as in prior work.
Invoked to frame the two minimization problems and consistency constraints.

invented entities (2)

risk-assignment consistency constraints no independent evidence
purpose: To ensure the two minimization problems yield matching CVaR values.
New concept introduced to diagnose the evaluation gap.
CVaR evaluation gap no independent evidence
purpose: To quantify the discrepancy between the two minimization problems.
Introduced as a diagnostic quantity for the observed errors.

pith-pipeline@v0.9.0 · 5728 in / 1381 out tokens · 40946 ms · 2026-05-19T03:33:51.615654+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

[1]

Coherent measures of risk

Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. Coherent measures of risk. Mathematical Finance, 9 0 (3): 0 203--228, 1999

work page 1999
[2]

Path planning for automation of surgery robot based on probabilistic roadmap and reinforcement learning

Donghoon Baek, Minho Hwang, Hansoul Kim, and Dong-Soo Kwon. Path planning for automation of surgery robot based on probabilistic roadmap and reinforcement learning. In 2018 15th International Conference on Ubiquitous Robots (UR), pages 342--347, 2018

work page 2018
[3]

Minimum capital requirements for market risk

Basel Committee on Banking Supervision . Minimum capital requirements for market risk. In Basel III: International Regulatory Framework for Banks. Bank for International Settlements, 2019

work page 2019
[4]

Markov decision processes with average-value-at-risk criteria

Nicole B \"a uerle and Jonathan Ott. Markov decision processes with average-value-at-risk criteria. Mathematical Methods of Operations Research, 74 0 (3): 0 361--379, 2011

work page 2011
[5]

Filar, Yuanlie Lin, and Lieneke Spanjers

Kang Boda, Jerzy A. Filar, Yuanlie Lin, and Lieneke Spanjers. Stochastic target hitting time and the problem of early retirement. IEEE Transactions on Automatic Control, 49 0 (3): 0 409--419, 2004

work page 2004
[6]

Chapman, Jonathan Lacotte, Aviv Tamar, Donggun Lee, Kevin M

Margaret P. Chapman, Jonathan Lacotte, Aviv Tamar, Donggun Lee, Kevin M. Smith, Victoria Cheng, Jaime F. Fisac, Susmit Jha, Marco Pavone, and Claire J. Tomlin. A risk-sensitive finite-time reachability approach for safety of stochastic dynamic systems. In American Control Conference (ACC), pages 2958--2963, 2019

work page 2019
[7]

Chapman, Riccardo Bonalli, Kevin M

Margaret P. Chapman, Riccardo Bonalli, Kevin M. Smith, Insoon Yang, Marco Pavone, and Claire J. Tomlin. Risk-sensitive safety analysis using conditional value-at-risk. IEEE Transactions on Automatic Control, 67 0 (12): 0 6521--6536, 2021

work page 2021
[8]

Algorithms for CVaR optimization in MDPs

Yinlam Chow and Mohammad Ghavamzadeh. Algorithms for CVaR optimization in MDPs . In Advances in Neural Information Processing Systems (NeurIPS), pages 3509--3517, 2014

work page 2014
[9]

Risk-sensitive and robust decision-making: A CVaR optimization approach

Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decision-making: A CVaR optimization approach. In Advances in Neural Information Processing Systems (NeurIPS), volume 28, 2015

work page 2015
[10]

Risk-constrained reinforcement learning with percentile risk criteria

Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-constrained reinforcement learning with percentile risk criteria. Journal of Machine Learning Research, 18 0 (167): 0 1--51, 2018

work page 2018
[11]

CVaR optimization for MDPs : Existence and computation of optimal policies

Rui Ding and Eugene Feinberg. CVaR optimization for MDPs : Existence and computation of optimal policies. ACM SIGMETRICS Performance Evaluation Review, 50 0 (2): 0 39--41, 2022 a

work page 2022
[12]

Feinberg

Rui Ding and Eugene A. Feinberg. Sequential optimization of CVaR . arXiv preprint arXiv:2211.07288, 2022 b

work page arXiv 2022
[13]

Stochastic Finance: An Introduction in Discrete Time

Hans F \"o llmer and Alexander Schied. Stochastic Finance: An Introduction in Discrete Time. De Gruyter, 2016

work page 2016
[14]

Two steps to risk sensitivity

Christopher Gagne and Peter Dayan. Two steps to risk sensitivity. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, pages 2361--2372, 2021

work page 2021
[15]

Guidelines for reinforcement learning in healthcare

Omer Gottesman, Fredrik Johansson, Matthieu Komorowski, Aldo Faisal, David Sontag, Finale Doshi-Velez, and Leo Anthony Celi. Guidelines for reinforcement learning in healthcare. Nature Medicine, 25 0 (1): 0 16--18, 2019

work page 2019
[16]

On dynamic programming decompositions of static risk measures in markov decision processes

Jia Lin Hau, Erick Delage, Mohammad Ghavamzadeh, and Marek Petrik. On dynamic programming decompositions of static risk measures in markov decision processes. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pages 51734--51757, 2023

work page 2023
[17]

Schapire

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the International Conference on World Wide Web (WWW), pages 661--670, 2010

work page 2010
[18]

Brandeau

Xiaocheng Li, Huaiyang Zhong, and Margaret L. Brandeau. Quantile markov decision processes. Operations Research, 70 0 (3): 0 1428--1447, 2022

work page 2022
[19]

Bias and variance approximation in value function estimates

Shie Mannor, Duncan Simester, Peng Sun, and John N Tsitsiklis. Bias and variance approximation in value function estimates. Management Science, 53 0 (2): 0 308--322, 2007

work page 2007
[20]

Pflug and Alois Pichler

Georg Ch. Pflug and Alois Pichler. Time-consistent decisions and temporal decomposition of coherent risk functionals. Mathematics of Operations Research, 41 0 (2): 0 682--699, 2016

work page 2016
[21]

L. A. Prashanth, Michael C. Fu, et al. Risk-sensitive reinforcement learning via policy gradient search. Foundations and Trends in Machine Learning, 15 0 (5): 0 537--693, 2022

work page 2022
[22]

Puterman

Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014

work page 2014
[23]

Risk-averse bayes-adaptive reinforcement learning

Marc Rigter, Bruno Lacerda, and Nick Hawes. Risk-averse bayes-adaptive reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, pages 1142--1154, 2021

work page 2021
[24]

Tyrrell Rockafellar and Stanislav Uryasev

R. Tyrrell Rockafellar and Stanislav Uryasev. Optimization of conditional value-at-risk. Journal of Risk, 2 0 (3): 0 21--42, 2000

work page 2000
[25]

Lectures on Stochastic Programming: Modeling and Theory

Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczy \'n ski. Lectures on Stochastic Programming: Modeling and Theory. SIAM, 2014

work page 2014
[26]

A general reinforcement learning algorithm that masters chess, shogi, and go through self-play

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362 0 (6419): 0 1140--1144, 2018

work page 2018
[27]

Risk-averse distributional reinforcement learning: A CVaR optimization approach

Silvestr Stanko and Karel Macek. Risk-averse distributional reinforcement learning: A CVaR optimization approach. In Proceedings of the International Joint Conference on Computational Intelligence (IJCCI), pages 412--423, 2019

work page 2019
[28]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. A Bradford Book, 2018

work page 2018
[29]

Algorithms for Reinforcement Learning

Csaba Szepesv \'a ri. Algorithms for Reinforcement Learning. Springer Nature, 2022

work page 2022
[30]

Optimizing the CVaR via sampling

Aviv Tamar, Yonatan Glassner, and Shie Mannor. Optimizing the CVaR via sampling. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 29, 2015

work page 2015
[31]

Czarnecki, Micha \"e l Mathieu, Andrew Dudzik, Junyoung Chung, David H

Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Micha \"e l Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575 0 (7782): 0 350--354, 2019

work page 2019

[1] [1]

Coherent measures of risk

Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. Coherent measures of risk. Mathematical Finance, 9 0 (3): 0 203--228, 1999

work page 1999

[2] [2]

Path planning for automation of surgery robot based on probabilistic roadmap and reinforcement learning

Donghoon Baek, Minho Hwang, Hansoul Kim, and Dong-Soo Kwon. Path planning for automation of surgery robot based on probabilistic roadmap and reinforcement learning. In 2018 15th International Conference on Ubiquitous Robots (UR), pages 342--347, 2018

work page 2018

[3] [3]

Minimum capital requirements for market risk

Basel Committee on Banking Supervision . Minimum capital requirements for market risk. In Basel III: International Regulatory Framework for Banks. Bank for International Settlements, 2019

work page 2019

[4] [4]

Markov decision processes with average-value-at-risk criteria

Nicole B \"a uerle and Jonathan Ott. Markov decision processes with average-value-at-risk criteria. Mathematical Methods of Operations Research, 74 0 (3): 0 361--379, 2011

work page 2011

[5] [5]

Filar, Yuanlie Lin, and Lieneke Spanjers

Kang Boda, Jerzy A. Filar, Yuanlie Lin, and Lieneke Spanjers. Stochastic target hitting time and the problem of early retirement. IEEE Transactions on Automatic Control, 49 0 (3): 0 409--419, 2004

work page 2004

[6] [6]

Chapman, Jonathan Lacotte, Aviv Tamar, Donggun Lee, Kevin M

Margaret P. Chapman, Jonathan Lacotte, Aviv Tamar, Donggun Lee, Kevin M. Smith, Victoria Cheng, Jaime F. Fisac, Susmit Jha, Marco Pavone, and Claire J. Tomlin. A risk-sensitive finite-time reachability approach for safety of stochastic dynamic systems. In American Control Conference (ACC), pages 2958--2963, 2019

work page 2019

[7] [7]

Chapman, Riccardo Bonalli, Kevin M

Margaret P. Chapman, Riccardo Bonalli, Kevin M. Smith, Insoon Yang, Marco Pavone, and Claire J. Tomlin. Risk-sensitive safety analysis using conditional value-at-risk. IEEE Transactions on Automatic Control, 67 0 (12): 0 6521--6536, 2021

work page 2021

[8] [8]

Algorithms for CVaR optimization in MDPs

Yinlam Chow and Mohammad Ghavamzadeh. Algorithms for CVaR optimization in MDPs . In Advances in Neural Information Processing Systems (NeurIPS), pages 3509--3517, 2014

work page 2014

[9] [9]

Risk-sensitive and robust decision-making: A CVaR optimization approach

Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decision-making: A CVaR optimization approach. In Advances in Neural Information Processing Systems (NeurIPS), volume 28, 2015

work page 2015

[10] [10]

Risk-constrained reinforcement learning with percentile risk criteria

Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-constrained reinforcement learning with percentile risk criteria. Journal of Machine Learning Research, 18 0 (167): 0 1--51, 2018

work page 2018

[11] [11]

CVaR optimization for MDPs : Existence and computation of optimal policies

Rui Ding and Eugene Feinberg. CVaR optimization for MDPs : Existence and computation of optimal policies. ACM SIGMETRICS Performance Evaluation Review, 50 0 (2): 0 39--41, 2022 a

work page 2022

[12] [12]

Feinberg

Rui Ding and Eugene A. Feinberg. Sequential optimization of CVaR . arXiv preprint arXiv:2211.07288, 2022 b

work page arXiv 2022

[13] [13]

Stochastic Finance: An Introduction in Discrete Time

Hans F \"o llmer and Alexander Schied. Stochastic Finance: An Introduction in Discrete Time. De Gruyter, 2016

work page 2016

[14] [14]

Two steps to risk sensitivity

Christopher Gagne and Peter Dayan. Two steps to risk sensitivity. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, pages 2361--2372, 2021

work page 2021

[15] [15]

Guidelines for reinforcement learning in healthcare

Omer Gottesman, Fredrik Johansson, Matthieu Komorowski, Aldo Faisal, David Sontag, Finale Doshi-Velez, and Leo Anthony Celi. Guidelines for reinforcement learning in healthcare. Nature Medicine, 25 0 (1): 0 16--18, 2019

work page 2019

[16] [16]

On dynamic programming decompositions of static risk measures in markov decision processes

Jia Lin Hau, Erick Delage, Mohammad Ghavamzadeh, and Marek Petrik. On dynamic programming decompositions of static risk measures in markov decision processes. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pages 51734--51757, 2023

work page 2023

[17] [17]

Schapire

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the International Conference on World Wide Web (WWW), pages 661--670, 2010

work page 2010

[18] [18]

Brandeau

Xiaocheng Li, Huaiyang Zhong, and Margaret L. Brandeau. Quantile markov decision processes. Operations Research, 70 0 (3): 0 1428--1447, 2022

work page 2022

[19] [19]

Bias and variance approximation in value function estimates

Shie Mannor, Duncan Simester, Peng Sun, and John N Tsitsiklis. Bias and variance approximation in value function estimates. Management Science, 53 0 (2): 0 308--322, 2007

work page 2007

[20] [20]

Pflug and Alois Pichler

Georg Ch. Pflug and Alois Pichler. Time-consistent decisions and temporal decomposition of coherent risk functionals. Mathematics of Operations Research, 41 0 (2): 0 682--699, 2016

work page 2016

[21] [21]

L. A. Prashanth, Michael C. Fu, et al. Risk-sensitive reinforcement learning via policy gradient search. Foundations and Trends in Machine Learning, 15 0 (5): 0 537--693, 2022

work page 2022

[22] [22]

Puterman

Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014

work page 2014

[23] [23]

Risk-averse bayes-adaptive reinforcement learning

Marc Rigter, Bruno Lacerda, and Nick Hawes. Risk-averse bayes-adaptive reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, pages 1142--1154, 2021

work page 2021

[24] [24]

Tyrrell Rockafellar and Stanislav Uryasev

R. Tyrrell Rockafellar and Stanislav Uryasev. Optimization of conditional value-at-risk. Journal of Risk, 2 0 (3): 0 21--42, 2000

work page 2000

[25] [25]

Lectures on Stochastic Programming: Modeling and Theory

Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczy \'n ski. Lectures on Stochastic Programming: Modeling and Theory. SIAM, 2014

work page 2014

[26] [26]

A general reinforcement learning algorithm that masters chess, shogi, and go through self-play

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362 0 (6419): 0 1140--1144, 2018

work page 2018

[27] [27]

Risk-averse distributional reinforcement learning: A CVaR optimization approach

Silvestr Stanko and Karel Macek. Risk-averse distributional reinforcement learning: A CVaR optimization approach. In Proceedings of the International Joint Conference on Computational Intelligence (IJCCI), pages 412--423, 2019

work page 2019

[28] [28]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. A Bradford Book, 2018

work page 2018

[29] [29]

Algorithms for Reinforcement Learning

Csaba Szepesv \'a ri. Algorithms for Reinforcement Learning. Springer Nature, 2022

work page 2022

[30] [30]

Optimizing the CVaR via sampling

Aviv Tamar, Yonatan Glassner, and Shie Mannor. Optimizing the CVaR via sampling. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 29, 2015

work page 2015

[31] [31]

Czarnecki, Micha \"e l Mathieu, Andrew Dudzik, Junyoung Chung, David H

Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Micha \"e l Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575 0 (7782): 0 350--354, 2019

work page 2019