pith. sign in

arxiv: 2507.14005 · v2 · submitted 2025-07-18 · 💻 cs.LG

On the Fundamental Limitations of Dual Static CVaR Decompositions in Markov Decision Processes

Pith reviewed 2026-05-19 03:33 UTC · model grok-4.3

classification 💻 cs.LG
keywords CVaRMarkov decision processesrisk measurespolicy evaluationdual decompositiondynamic programmingconsistency constraints
0
0 comments X

The pith

No single policy can be optimal for all risk levels under dual static CVaR in some MDPs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes static CVaR policy evaluation in MDPs as two separate minimization problems whose solutions agree only when risk-assignment consistency constraints are satisfied. An empty intersection of those constraints produces a nonzero CVaR evaluation gap that explains the failures of dual-based dynamic programming methods. The authors use this perspective to construct an explicit MDP in which the dual decomposition admits no single policy that is optimal across every initial risk level.

Core claim

Static CVaR evaluation decomposes into two distinct minimization problems; risk-assignment consistency constraints must hold for the problems to return matching values. When the constraints have empty intersection, a CVaR evaluation gap appears. Policies returned by dual CVaR dynamic programming inherit this gap, and the same constraint view proves that, in at least one MDP, no policy is simultaneously optimal for all initial risk levels.

What carries the argument

risk-assignment consistency constraints, which must be jointly satisfied by the two minimization problems for static CVaR evaluation to be consistent

If this is right

  • Dual-based dynamic programming returns policies whose CVaR evaluation gap is positive.
  • The size of the evaluation gap directly measures the discrepancy between the two minimization problems.
  • Uniform optimality over all initial risk levels is impossible for the dual formulation in certain MDPs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Alternative formulations that avoid splitting the CVaR objective into two separate problems may be required for reliable optimization.
  • The concrete MDP example supplies a minimal test case for checking whether any proposed fix restores uniform optimality.
  • The same inconsistency mechanism could appear in other risk measures whose dual representations rely on similar decompositions.

Load-bearing premise

The original CVaR objective is exactly and completely captured by the pair of dual minimization problems, so that agreement between them requires the risk-assignment consistency constraints.

What would settle it

Exhibit a single policy that is optimal for every initial risk level inside the MDP constructed by the paper.

Figures

Figures reproduced from arXiv: 2507.14005 by Audrey Durand, Mathieu Godbout.

Figure 1
Figure 1. Figure 1: Sample MDP from Hau et al. [2023]. Next state transition probabilities are in [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visual representation of the risk-assignment constraints on the MDP from Figure 1, with [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Evolution of the static CVaR evaluation of all possible policies on the MDP presented in [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Relation between initial risk-level α and the corresponding risk-level Y(s1) for all three possible policies on the MDP presented in [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
read the original abstract

It was recently shown that dynamic programming (DP) methods for finding static CVaR-optimal policies in Markov Decision Processes (MDPs) can fail when based on the dual formulation, yet the root cause of this failure remains unclear. We expand on these findings by shifting focus from policy optimization to the seemingly simpler task of policy evaluation. We show that evaluating the static CVaR of a given policy can be framed as two distinct minimization problems. We introduce a set of ``risk-assignment consistency constraints'' that must be satisfied for their solutions to match and we demonstrate that an empty intersection of these constraints is the source of previously observed evaluation errors. Quantifying the evaluation error as the \emph{CVaR evaluation gap}, we demonstrate that the issues observed when optimizing over the dual-based CVaR DP are explained by the returned policy having a non-zero CVaR evaluation gap. Finally, we leverage our proposed risk-assignment constraints perspective to prove that the search for a single, uniformly optimal policy on the dual CVaR decomposition is fundamentally limited, identifying an MDP where no single policy can be optimal across all initial risk levels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that static CVaR policy evaluation in MDPs can be expressed as two distinct minimization problems whose solutions coincide only when a set of risk-assignment consistency constraints has non-empty intersection. An empty intersection produces a CVaR evaluation gap that accounts for observed failures in dual-based dynamic programming. The authors use this perspective to prove that no single policy can be uniformly optimal across all initial risk levels, supported by a concrete MDP counterexample.

Significance. If the decomposition is complete and the counterexample holds, the work supplies a precise, falsifiable explanation for why dual CVaR methods cannot yield uniformly optimal policies and shifts attention from optimization failures to an underlying evaluation inconsistency. The explicit construction of an MDP where the constraints intersect emptily is a concrete strength that grounds the fundamental-limitation claim.

major comments (2)
  1. [Section introducing the risk-assignment consistency constraints] Section introducing the risk-assignment consistency constraints: the manuscript states that the two minimization problems constitute the dual decomposition of static CVaR and that their solutions must coincide precisely when the constraints are satisfied. A more explicit derivation from the original CVaR definition (showing why no additional dual variables or alternative decompositions are possible) would strengthen the claim that empty intersection is the sole source of the evaluation gap and the uniform-optimality limitation.
  2. [MDP counterexample] MDP counterexample (the section presenting the concrete MDP): the claim that no single policy is optimal for all initial risk levels rests on demonstrating both an empty intersection and that the resulting optimal policies differ across risk levels. Explicit verification that the constructed transition and reward structure produces this emptiness, together with the numerical or symbolic values of the two minimization problems, is needed to confirm the example is not an artifact of the particular risk levels chosen.
minor comments (2)
  1. [Abstract and introduction] The term 'CVaR evaluation gap' is used throughout but is not formally defined until after the constraints are introduced; a brief forward reference or one-sentence definition in the abstract and introduction would improve readability.
  2. [Notation and definitions] Notation for the two minimization problems (e.g., the variables representing risk assignments) should be introduced with a short table or explicit mapping to the original CVaR formulation to avoid ambiguity when the constraints are stated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. The suggested clarifications will improve the rigor of our presentation regarding the dual decomposition and the counterexample. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Section introducing the risk-assignment consistency constraints] Section introducing the risk-assignment consistency constraints: the manuscript states that the two minimization problems constitute the dual decomposition of static CVaR and that their solutions must coincide precisely when the constraints are satisfied. A more explicit derivation from the original CVaR definition (showing why no additional dual variables or alternative decompositions are possible) would strengthen the claim that empty intersection is the sole source of the evaluation gap and the uniform-optimality limitation.

    Authors: We agree that a more explicit derivation from the original CVaR definition would strengthen the presentation. In the revised manuscript we will add a dedicated derivation subsection that starts from the static CVaR definition, obtains the two minimization problems, and shows why the risk-assignment consistency constraints are the precise conditions for their solutions to coincide. The derivation will also clarify that the chosen decomposition is canonical for static CVaR and that alternative dual variables would alter the semantics of the original risk measure. revision: yes

  2. Referee: [MDP counterexample] MDP counterexample (the section presenting the concrete MDP): the claim that no single policy is optimal for all initial risk levels rests on demonstrating both an empty intersection and that the resulting optimal policies differ across risk levels. Explicit verification that the constructed transition and reward structure produces this emptiness, together with the numerical or symbolic values of the two minimization problems, is needed to confirm the example is not an artifact of the particular risk levels chosen.

    Authors: We thank the referee for highlighting the need for explicit verification. In the revised manuscript we will augment the counterexample section with the explicit computation of the risk-assignment consistency constraints for the given transition and reward structure, demonstrating that their intersection is empty. We will also report the numerical (or symbolic) values attained by each of the two minimization problems at representative risk levels, confirming that the resulting optimal policies differ and that the emptiness is independent of the specific risk levels selected. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation self-contained via definitions and counterexample MDP

full rationale

The paper frames static CVaR evaluation as two minimization problems, introduces risk-assignment consistency constraints whose empty intersection explains evaluation gaps, and constructs an explicit MDP to prove no single policy is uniformly optimal across risk levels. This chain rests on the definitions of static CVaR, standard MDP transition and reward structure, and direct verification in the counterexample; no equation reduces to a fitted parameter renamed as prediction, no load-bearing premise collapses to a self-citation, and the uniqueness claim is established by exhibiting a concrete MDP rather than by imported theorem or ansatz. The argument is therefore independent of its own outputs and does not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The work relies on standard MDP assumptions (finite states/actions, stationary policies) and prior definitions of static CVaR; no new free parameters or invented entities beyond the introduced constraints are apparent from the abstract.

axioms (1)
  • domain assumption MDPs are finite-state, finite-action, and the static CVaR is defined via the dual formulation as in prior work.
    Invoked to frame the two minimization problems and consistency constraints.
invented entities (2)
  • risk-assignment consistency constraints no independent evidence
    purpose: To ensure the two minimization problems yield matching CVaR values.
    New concept introduced to diagnose the evaluation gap.
  • CVaR evaluation gap no independent evidence
    purpose: To quantify the discrepancy between the two minimization problems.
    Introduced as a diagnostic quantity for the observed errors.

pith-pipeline@v0.9.0 · 5728 in / 1381 out tokens · 40946 ms · 2026-05-19T03:33:51.615654+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

  1. [1]

    Coherent measures of risk

    Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. Coherent measures of risk. Mathematical Finance, 9 0 (3): 0 203--228, 1999

  2. [2]

    Path planning for automation of surgery robot based on probabilistic roadmap and reinforcement learning

    Donghoon Baek, Minho Hwang, Hansoul Kim, and Dong-Soo Kwon. Path planning for automation of surgery robot based on probabilistic roadmap and reinforcement learning. In 2018 15th International Conference on Ubiquitous Robots (UR), pages 342--347, 2018

  3. [3]

    Minimum capital requirements for market risk

    Basel Committee on Banking Supervision . Minimum capital requirements for market risk. In Basel III: International Regulatory Framework for Banks. Bank for International Settlements, 2019

  4. [4]

    Markov decision processes with average-value-at-risk criteria

    Nicole B \"a uerle and Jonathan Ott. Markov decision processes with average-value-at-risk criteria. Mathematical Methods of Operations Research, 74 0 (3): 0 361--379, 2011

  5. [5]

    Filar, Yuanlie Lin, and Lieneke Spanjers

    Kang Boda, Jerzy A. Filar, Yuanlie Lin, and Lieneke Spanjers. Stochastic target hitting time and the problem of early retirement. IEEE Transactions on Automatic Control, 49 0 (3): 0 409--419, 2004

  6. [6]

    Chapman, Jonathan Lacotte, Aviv Tamar, Donggun Lee, Kevin M

    Margaret P. Chapman, Jonathan Lacotte, Aviv Tamar, Donggun Lee, Kevin M. Smith, Victoria Cheng, Jaime F. Fisac, Susmit Jha, Marco Pavone, and Claire J. Tomlin. A risk-sensitive finite-time reachability approach for safety of stochastic dynamic systems. In American Control Conference (ACC), pages 2958--2963, 2019

  7. [7]

    Chapman, Riccardo Bonalli, Kevin M

    Margaret P. Chapman, Riccardo Bonalli, Kevin M. Smith, Insoon Yang, Marco Pavone, and Claire J. Tomlin. Risk-sensitive safety analysis using conditional value-at-risk. IEEE Transactions on Automatic Control, 67 0 (12): 0 6521--6536, 2021

  8. [8]

    Algorithms for CVaR optimization in MDPs

    Yinlam Chow and Mohammad Ghavamzadeh. Algorithms for CVaR optimization in MDPs . In Advances in Neural Information Processing Systems (NeurIPS), pages 3509--3517, 2014

  9. [9]

    Risk-sensitive and robust decision-making: A CVaR optimization approach

    Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decision-making: A CVaR optimization approach. In Advances in Neural Information Processing Systems (NeurIPS), volume 28, 2015

  10. [10]

    Risk-constrained reinforcement learning with percentile risk criteria

    Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-constrained reinforcement learning with percentile risk criteria. Journal of Machine Learning Research, 18 0 (167): 0 1--51, 2018

  11. [11]

    CVaR optimization for MDPs : Existence and computation of optimal policies

    Rui Ding and Eugene Feinberg. CVaR optimization for MDPs : Existence and computation of optimal policies. ACM SIGMETRICS Performance Evaluation Review, 50 0 (2): 0 39--41, 2022 a

  12. [12]

    Feinberg

    Rui Ding and Eugene A. Feinberg. Sequential optimization of CVaR . arXiv preprint arXiv:2211.07288, 2022 b

  13. [13]

    Stochastic Finance: An Introduction in Discrete Time

    Hans F \"o llmer and Alexander Schied. Stochastic Finance: An Introduction in Discrete Time. De Gruyter, 2016

  14. [14]

    Two steps to risk sensitivity

    Christopher Gagne and Peter Dayan. Two steps to risk sensitivity. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, pages 2361--2372, 2021

  15. [15]

    Guidelines for reinforcement learning in healthcare

    Omer Gottesman, Fredrik Johansson, Matthieu Komorowski, Aldo Faisal, David Sontag, Finale Doshi-Velez, and Leo Anthony Celi. Guidelines for reinforcement learning in healthcare. Nature Medicine, 25 0 (1): 0 16--18, 2019

  16. [16]

    On dynamic programming decompositions of static risk measures in markov decision processes

    Jia Lin Hau, Erick Delage, Mohammad Ghavamzadeh, and Marek Petrik. On dynamic programming decompositions of static risk measures in markov decision processes. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pages 51734--51757, 2023

  17. [17]

    Schapire

    Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the International Conference on World Wide Web (WWW), pages 661--670, 2010

  18. [18]

    Brandeau

    Xiaocheng Li, Huaiyang Zhong, and Margaret L. Brandeau. Quantile markov decision processes. Operations Research, 70 0 (3): 0 1428--1447, 2022

  19. [19]

    Bias and variance approximation in value function estimates

    Shie Mannor, Duncan Simester, Peng Sun, and John N Tsitsiklis. Bias and variance approximation in value function estimates. Management Science, 53 0 (2): 0 308--322, 2007

  20. [20]

    Pflug and Alois Pichler

    Georg Ch. Pflug and Alois Pichler. Time-consistent decisions and temporal decomposition of coherent risk functionals. Mathematics of Operations Research, 41 0 (2): 0 682--699, 2016

  21. [21]

    L. A. Prashanth, Michael C. Fu, et al. Risk-sensitive reinforcement learning via policy gradient search. Foundations and Trends in Machine Learning, 15 0 (5): 0 537--693, 2022

  22. [22]

    Puterman

    Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014

  23. [23]

    Risk-averse bayes-adaptive reinforcement learning

    Marc Rigter, Bruno Lacerda, and Nick Hawes. Risk-averse bayes-adaptive reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), volume 34, pages 1142--1154, 2021

  24. [24]

    Tyrrell Rockafellar and Stanislav Uryasev

    R. Tyrrell Rockafellar and Stanislav Uryasev. Optimization of conditional value-at-risk. Journal of Risk, 2 0 (3): 0 21--42, 2000

  25. [25]

    Lectures on Stochastic Programming: Modeling and Theory

    Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczy \'n ski. Lectures on Stochastic Programming: Modeling and Theory. SIAM, 2014

  26. [26]

    A general reinforcement learning algorithm that masters chess, shogi, and go through self-play

    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362 0 (6419): 0 1140--1144, 2018

  27. [27]

    Risk-averse distributional reinforcement learning: A CVaR optimization approach

    Silvestr Stanko and Karel Macek. Risk-averse distributional reinforcement learning: A CVaR optimization approach. In Proceedings of the International Joint Conference on Computational Intelligence (IJCCI), pages 412--423, 2019

  28. [28]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. A Bradford Book, 2018

  29. [29]

    Algorithms for Reinforcement Learning

    Csaba Szepesv \'a ri. Algorithms for Reinforcement Learning. Springer Nature, 2022

  30. [30]

    Optimizing the CVaR via sampling

    Aviv Tamar, Yonatan Glassner, and Shie Mannor. Optimizing the CVaR via sampling. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 29, 2015

  31. [31]

    Czarnecki, Micha \"e l Mathieu, Andrew Dudzik, Junyoung Chung, David H

    Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Micha \"e l Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575 0 (7782): 0 350--354, 2019