On the Fundamental Limitations of Dual Static CVaR Decompositions in Markov Decision Processes
Pith reviewed 2026-05-19 03:33 UTC · model grok-4.3
The pith
No single policy can be optimal for all risk levels under dual static CVaR in some MDPs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Static CVaR evaluation decomposes into two distinct minimization problems; risk-assignment consistency constraints must hold for the problems to return matching values. When the constraints have empty intersection, a CVaR evaluation gap appears. Policies returned by dual CVaR dynamic programming inherit this gap, and the same constraint view proves that, in at least one MDP, no policy is simultaneously optimal for all initial risk levels.
What carries the argument
risk-assignment consistency constraints, which must be jointly satisfied by the two minimization problems for static CVaR evaluation to be consistent
If this is right
- Dual-based dynamic programming returns policies whose CVaR evaluation gap is positive.
- The size of the evaluation gap directly measures the discrepancy between the two minimization problems.
- Uniform optimality over all initial risk levels is impossible for the dual formulation in certain MDPs.
Where Pith is reading between the lines
- Alternative formulations that avoid splitting the CVaR objective into two separate problems may be required for reliable optimization.
- The concrete MDP example supplies a minimal test case for checking whether any proposed fix restores uniform optimality.
- The same inconsistency mechanism could appear in other risk measures whose dual representations rely on similar decompositions.
Load-bearing premise
The original CVaR objective is exactly and completely captured by the pair of dual minimization problems, so that agreement between them requires the risk-assignment consistency constraints.
What would settle it
Exhibit a single policy that is optimal for every initial risk level inside the MDP constructed by the paper.
Figures
read the original abstract
It was recently shown that dynamic programming (DP) methods for finding static CVaR-optimal policies in Markov Decision Processes (MDPs) can fail when based on the dual formulation, yet the root cause of this failure remains unclear. We expand on these findings by shifting focus from policy optimization to the seemingly simpler task of policy evaluation. We show that evaluating the static CVaR of a given policy can be framed as two distinct minimization problems. We introduce a set of ``risk-assignment consistency constraints'' that must be satisfied for their solutions to match and we demonstrate that an empty intersection of these constraints is the source of previously observed evaluation errors. Quantifying the evaluation error as the \emph{CVaR evaluation gap}, we demonstrate that the issues observed when optimizing over the dual-based CVaR DP are explained by the returned policy having a non-zero CVaR evaluation gap. Finally, we leverage our proposed risk-assignment constraints perspective to prove that the search for a single, uniformly optimal policy on the dual CVaR decomposition is fundamentally limited, identifying an MDP where no single policy can be optimal across all initial risk levels.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that static CVaR policy evaluation in MDPs can be expressed as two distinct minimization problems whose solutions coincide only when a set of risk-assignment consistency constraints has non-empty intersection. An empty intersection produces a CVaR evaluation gap that accounts for observed failures in dual-based dynamic programming. The authors use this perspective to prove that no single policy can be uniformly optimal across all initial risk levels, supported by a concrete MDP counterexample.
Significance. If the decomposition is complete and the counterexample holds, the work supplies a precise, falsifiable explanation for why dual CVaR methods cannot yield uniformly optimal policies and shifts attention from optimization failures to an underlying evaluation inconsistency. The explicit construction of an MDP where the constraints intersect emptily is a concrete strength that grounds the fundamental-limitation claim.
major comments (2)
- [Section introducing the risk-assignment consistency constraints] Section introducing the risk-assignment consistency constraints: the manuscript states that the two minimization problems constitute the dual decomposition of static CVaR and that their solutions must coincide precisely when the constraints are satisfied. A more explicit derivation from the original CVaR definition (showing why no additional dual variables or alternative decompositions are possible) would strengthen the claim that empty intersection is the sole source of the evaluation gap and the uniform-optimality limitation.
- [MDP counterexample] MDP counterexample (the section presenting the concrete MDP): the claim that no single policy is optimal for all initial risk levels rests on demonstrating both an empty intersection and that the resulting optimal policies differ across risk levels. Explicit verification that the constructed transition and reward structure produces this emptiness, together with the numerical or symbolic values of the two minimization problems, is needed to confirm the example is not an artifact of the particular risk levels chosen.
minor comments (2)
- [Abstract and introduction] The term 'CVaR evaluation gap' is used throughout but is not formally defined until after the constraints are introduced; a brief forward reference or one-sentence definition in the abstract and introduction would improve readability.
- [Notation and definitions] Notation for the two minimization problems (e.g., the variables representing risk assignments) should be introduced with a short table or explicit mapping to the original CVaR formulation to avoid ambiguity when the constraints are stated.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for minor revision. The suggested clarifications will improve the rigor of our presentation regarding the dual decomposition and the counterexample. We respond to each major comment below.
read point-by-point responses
-
Referee: [Section introducing the risk-assignment consistency constraints] Section introducing the risk-assignment consistency constraints: the manuscript states that the two minimization problems constitute the dual decomposition of static CVaR and that their solutions must coincide precisely when the constraints are satisfied. A more explicit derivation from the original CVaR definition (showing why no additional dual variables or alternative decompositions are possible) would strengthen the claim that empty intersection is the sole source of the evaluation gap and the uniform-optimality limitation.
Authors: We agree that a more explicit derivation from the original CVaR definition would strengthen the presentation. In the revised manuscript we will add a dedicated derivation subsection that starts from the static CVaR definition, obtains the two minimization problems, and shows why the risk-assignment consistency constraints are the precise conditions for their solutions to coincide. The derivation will also clarify that the chosen decomposition is canonical for static CVaR and that alternative dual variables would alter the semantics of the original risk measure. revision: yes
-
Referee: [MDP counterexample] MDP counterexample (the section presenting the concrete MDP): the claim that no single policy is optimal for all initial risk levels rests on demonstrating both an empty intersection and that the resulting optimal policies differ across risk levels. Explicit verification that the constructed transition and reward structure produces this emptiness, together with the numerical or symbolic values of the two minimization problems, is needed to confirm the example is not an artifact of the particular risk levels chosen.
Authors: We thank the referee for highlighting the need for explicit verification. In the revised manuscript we will augment the counterexample section with the explicit computation of the risk-assignment consistency constraints for the given transition and reward structure, demonstrating that their intersection is empty. We will also report the numerical (or symbolic) values attained by each of the two minimization problems at representative risk levels, confirming that the resulting optimal policies differ and that the emptiness is independent of the specific risk levels selected. revision: yes
Circularity Check
No circularity: derivation self-contained via definitions and counterexample MDP
full rationale
The paper frames static CVaR evaluation as two minimization problems, introduces risk-assignment consistency constraints whose empty intersection explains evaluation gaps, and constructs an explicit MDP to prove no single policy is uniformly optimal across risk levels. This chain rests on the definitions of static CVaR, standard MDP transition and reward structure, and direct verification in the counterexample; no equation reduces to a fitted parameter renamed as prediction, no load-bearing premise collapses to a self-citation, and the uniqueness claim is established by exhibiting a concrete MDP rather than by imported theorem or ansatz. The argument is therefore independent of its own outputs and does not exhibit any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption MDPs are finite-state, finite-action, and the static CVaR is defined via the dual formulation as in prior work.
invented entities (2)
-
risk-assignment consistency constraints
no independent evidence
-
CVaR evaluation gap
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. Coherent measures of risk. Mathematical Finance, 9 0 (3): 0 203--228, 1999
work page 1999
-
[2]
Donghoon Baek, Minho Hwang, Hansoul Kim, and Dong-Soo Kwon. Path planning for automation of surgery robot based on probabilistic roadmap and reinforcement learning. In 2018 15th International Conference on Ubiquitous Robots (UR), pages 342--347, 2018
work page 2018
-
[3]
Minimum capital requirements for market risk
Basel Committee on Banking Supervision . Minimum capital requirements for market risk. In Basel III: International Regulatory Framework for Banks. Bank for International Settlements, 2019
work page 2019
-
[4]
Markov decision processes with average-value-at-risk criteria
Nicole B \"a uerle and Jonathan Ott. Markov decision processes with average-value-at-risk criteria. Mathematical Methods of Operations Research, 74 0 (3): 0 361--379, 2011
work page 2011
-
[5]
Filar, Yuanlie Lin, and Lieneke Spanjers
Kang Boda, Jerzy A. Filar, Yuanlie Lin, and Lieneke Spanjers. Stochastic target hitting time and the problem of early retirement. IEEE Transactions on Automatic Control, 49 0 (3): 0 409--419, 2004
work page 2004
-
[6]
Chapman, Jonathan Lacotte, Aviv Tamar, Donggun Lee, Kevin M
Margaret P. Chapman, Jonathan Lacotte, Aviv Tamar, Donggun Lee, Kevin M. Smith, Victoria Cheng, Jaime F. Fisac, Susmit Jha, Marco Pavone, and Claire J. Tomlin. A risk-sensitive finite-time reachability approach for safety of stochastic dynamic systems. In American Control Conference (ACC), pages 2958--2963, 2019
work page 2019
-
[7]
Chapman, Riccardo Bonalli, Kevin M
Margaret P. Chapman, Riccardo Bonalli, Kevin M. Smith, Insoon Yang, Marco Pavone, and Claire J. Tomlin. Risk-sensitive safety analysis using conditional value-at-risk. IEEE Transactions on Automatic Control, 67 0 (12): 0 6521--6536, 2021
work page 2021
-
[8]
Algorithms for CVaR optimization in MDPs
Yinlam Chow and Mohammad Ghavamzadeh. Algorithms for CVaR optimization in MDPs . In Advances in Neural Information Processing Systems (NeurIPS), pages 3509--3517, 2014
work page 2014
-
[9]
Risk-sensitive and robust decision-making: A CVaR optimization approach
Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decision-making: A CVaR optimization approach. In Advances in Neural Information Processing Systems (NeurIPS), volume 28, 2015
work page 2015
-
[10]
Risk-constrained reinforcement learning with percentile risk criteria
Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-constrained reinforcement learning with percentile risk criteria. Journal of Machine Learning Research, 18 0 (167): 0 1--51, 2018
work page 2018
-
[11]
CVaR optimization for MDPs : Existence and computation of optimal policies
Rui Ding and Eugene Feinberg. CVaR optimization for MDPs : Existence and computation of optimal policies. ACM SIGMETRICS Performance Evaluation Review, 50 0 (2): 0 39--41, 2022 a
work page 2022
- [12]
-
[13]
Stochastic Finance: An Introduction in Discrete Time
Hans F \"o llmer and Alexander Schied. Stochastic Finance: An Introduction in Discrete Time. De Gruyter, 2016
work page 2016
-
[14]
Christopher Gagne and Peter Dayan. Two steps to risk sensitivity. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, pages 2361--2372, 2021
work page 2021
-
[15]
Guidelines for reinforcement learning in healthcare
Omer Gottesman, Fredrik Johansson, Matthieu Komorowski, Aldo Faisal, David Sontag, Finale Doshi-Velez, and Leo Anthony Celi. Guidelines for reinforcement learning in healthcare. Nature Medicine, 25 0 (1): 0 16--18, 2019
work page 2019
-
[16]
On dynamic programming decompositions of static risk measures in markov decision processes
Jia Lin Hau, Erick Delage, Mohammad Ghavamzadeh, and Marek Petrik. On dynamic programming decompositions of static risk measures in markov decision processes. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pages 51734--51757, 2023
work page 2023
- [17]
- [18]
-
[19]
Bias and variance approximation in value function estimates
Shie Mannor, Duncan Simester, Peng Sun, and John N Tsitsiklis. Bias and variance approximation in value function estimates. Management Science, 53 0 (2): 0 308--322, 2007
work page 2007
-
[20]
Georg Ch. Pflug and Alois Pichler. Time-consistent decisions and temporal decomposition of coherent risk functionals. Mathematics of Operations Research, 41 0 (2): 0 682--699, 2016
work page 2016
-
[21]
L. A. Prashanth, Michael C. Fu, et al. Risk-sensitive reinforcement learning via policy gradient search. Foundations and Trends in Machine Learning, 15 0 (5): 0 537--693, 2022
work page 2022
- [22]
-
[23]
Risk-averse bayes-adaptive reinforcement learning
Marc Rigter, Bruno Lacerda, and Nick Hawes. Risk-averse bayes-adaptive reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, pages 1142--1154, 2021
work page 2021
-
[24]
Tyrrell Rockafellar and Stanislav Uryasev
R. Tyrrell Rockafellar and Stanislav Uryasev. Optimization of conditional value-at-risk. Journal of Risk, 2 0 (3): 0 21--42, 2000
work page 2000
-
[25]
Lectures on Stochastic Programming: Modeling and Theory
Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczy \'n ski. Lectures on Stochastic Programming: Modeling and Theory. SIAM, 2014
work page 2014
-
[26]
A general reinforcement learning algorithm that masters chess, shogi, and go through self-play
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362 0 (6419): 0 1140--1144, 2018
work page 2018
-
[27]
Risk-averse distributional reinforcement learning: A CVaR optimization approach
Silvestr Stanko and Karel Macek. Risk-averse distributional reinforcement learning: A CVaR optimization approach. In Proceedings of the International Joint Conference on Computational Intelligence (IJCCI), pages 412--423, 2019
work page 2019
-
[28]
Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. A Bradford Book, 2018
work page 2018
-
[29]
Algorithms for Reinforcement Learning
Csaba Szepesv \'a ri. Algorithms for Reinforcement Learning. Springer Nature, 2022
work page 2022
-
[30]
Optimizing the CVaR via sampling
Aviv Tamar, Yonatan Glassner, and Shie Mannor. Optimizing the CVaR via sampling. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 29, 2015
work page 2015
-
[31]
Czarnecki, Micha \"e l Mathieu, Andrew Dudzik, Junyoung Chung, David H
Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Micha \"e l Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575 0 (7782): 0 350--354, 2019
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.