pith. sign in

arxiv: 2604.21177 · v1 · submitted 2026-04-23 · 🧮 math.OC

Revisiting Subgradient Dominance in Robust MDPs: Counterexamples, Hardness, and Sufficient Conditions

Pith reviewed 2026-05-09 21:59 UTC · model grok-4.3

classification 🧮 math.OC
keywords robust Markov decision processessubgradient dominanceprojected subgradient descentNP-hardnesslocal minimauncertainty setspolicy optimization
0
0 comments X

The pith

Robust MDPs with general compact uncertainty sets do not satisfy subgradient dominance, contrary to prior claims, and optimal policies can be NP-hard to find even when subgradients are computable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that claims of subgradient dominance holding for robust Markov decision processes under arbitrary compact uncertainty sets are incorrect. Counterexamples demonstrate that even uncertainty sets of cardinality two allow the objective to lack this property and admit strict suboptimal local minima. The authors prove NP-hardness of finding epsilon-optimal policies in two settings where subgradients remain efficiently computable: finite transition uncertainty sets and sa-rectangular finite transition sets paired with finite cost uncertainty. They also supply two sufficient conditions for when the dominance property does hold.

Core claim

Existing assertions that RMDPs satisfy subgradient dominance for general compact uncertainty sets are false. Even for uncertainty sets with only two elements, the RMDP objective fails to be subgradient-dominant and can possess suboptimal strict local minima. Finding an epsilon-optimal policy is NP-hard in cases with finite transition uncertainty sets and sa-rectangular finite transition uncertainty with finite cost uncertainty, despite efficient subgradient computation. Subgradient dominance holds when, for each policy, either the worst-case transition kernel or the worst-case action-value function is unique.

What carries the argument

The subgradient dominance property, under which the squared norm of any subgradient lower-bounds the optimality gap of a policy, allowing projected subgradient descent to converge to near-optimal policies in polynomial steps.

If this is right

  • Projected subgradient descent applied to general RMDPs can converge to policies whose value is bounded away from optimal.
  • Convergence guarantees previously derived from subgradient dominance no longer apply to broad classes of robust MDPs.
  • Solving for epsilon-optimal policies remains computationally intractable in the identified finite-uncertainty settings even though subgradients can be evaluated efficiently.
  • Only the two identified uniqueness conditions restore the dominance property and its associated polynomial convergence for PSD.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Algorithm designers for robust sequential decision problems should check uniqueness of worst-case kernels or values before relying on subgradient methods.
  • Alternative optimization techniques such as regularized objectives or global search may be needed when uniqueness fails.
  • Similar counterexamples and hardness results could apply to other robust optimization problems over sequential policies.

Load-bearing premise

That robust MDPs with arbitrary compact uncertainty sets always satisfy subgradient dominance, independent of whether the worst-case transition or value is unique for each policy.

What would settle it

A concrete RMDP example with a two-element uncertainty set whose objective has a strict local minimum whose value is strictly worse than the global optimum, or a polynomial-time algorithm solving an NP-hard instance of epsilon-optimal policy search under finite transition uncertainty.

Figures

Figures reproduced from arXiv: 2604.21177 by Alex Ayoub, Arnob Ghosh, Csaba Szepesv\'ari, Thang D. Chu, Toshinori Kitamura.

Figure 1
Figure 1. Figure 1: An RMDP example where PSD fails (left) and the corresponding robust total cost landscape (right). The policy π2, which always selects action a2, is a strictly suboptimal local minimum. Details are provided in Example 1. Example 1 (An RMDP where PSD fails) Consider a deterministic RMDP with three states(s1, s2, s+), two actions (a1, a2), and two transition kernels P = {P1, P2}, as illustrated in [PITH_FULL… view at source ↗
Figure 2
Figure 2. Figure 2: An illustration of the transition kernel [PITH_FULL_IMAGE:figures/full_fig_p023_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An illustration of the transition and the m-th cost function for Cm = (xm1∨¬xm2∨xm3 ). Since the costs at other clause states sCi for i ̸= m are set to −(1 − γ) −1 , the worst-case transition kernel for cm is Pm, which transitions the agent to sCm and incurs zero cost. • Each variable state sxn is absorbing; that is, any action taken at sxn transitions to sxn with probability 1. • Transitions at the clause… view at source ↗
Figure 4
Figure 4. Figure 4: An s-rectangular RMDP example satisfying the subgradient dominance property but sat￾isfying neither Assumption 11 nor Assumption 12. Left: P1. Right: P2. Example 2: unique worst-case transition, non-unique worst-case action-value. Now consider a cost-robust MDP with two states {s1, s2} and two actions {a1, a2}. The initial state is s1. After taking any action at s1, the agent deterministically transitions … view at source ↗
read the original abstract

Projected subgradient descent (PSD) has gained popularity for solving robust Markov decision processes (RMDPs) because it applies to a broader class of uncertainty sets than traditional dynamic programming. Existing work claims that RMDPs with a general compact uncertainty set satisfy the subgradient dominance property, under which exact PSD converges to an $\varepsilon$-optimal policy in a polynomial number of updates (e.g., Wang et al., 2023). We show that these claims are incorrect. Even when the uncertainty set has cardinality two, the RMDP objective is not subgradient-dominant and can admit suboptimal strict local minima. Moreover, we prove that finding an $\varepsilon$-optimal policy can be NP-hard even in settings where subgradients are efficiently computable: (i) finite transition uncertainty sets and (ii) $sa$-rectangular finite transition uncertainty sets with finite cost uncertainty sets. Finally, we identify two conditions under which RMDPs do satisfy subgradient dominance: when, for each policy, either the worst-case transition kernel or the worst-case action-value function is unique.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper refutes prior claims that robust MDPs with general compact uncertainty sets satisfy the subgradient dominance property. It shows via explicit counterexamples (even for uncertainty sets of cardinality two) that the RMDP objective need not be subgradient-dominant and can admit suboptimal strict local minima. It further proves NP-hardness of computing an ε-optimal policy in two settings where subgradients remain efficiently computable (finite transition uncertainty sets; sa-rectangular finite transition sets with finite cost sets). Two sufficient conditions for subgradient dominance are identified: uniqueness of the worst-case transition kernel or of the worst-case action-value function for each policy.

Significance. If the counterexamples and reductions hold, the work corrects an important misconception in the literature on first-order methods for robust MDPs. It delineates precise limitations of projected subgradient descent and supplies verifiable sufficient conditions under which the method is guaranteed to succeed, thereby guiding algorithm design for robust sequential decision problems.

minor comments (2)
  1. [Abstract / Introduction] The abstract and introduction would benefit from a brief explicit statement of the two sufficient conditions (unique worst-case kernel vs. unique worst-case Q-function) rather than deferring the precise wording entirely to the later theorem.
  2. [Preliminaries] Notation for the uncertainty set U and the robust objective J(π) should be introduced once in a dedicated preliminary subsection to avoid repeated re-definition across sections.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work and for recommending acceptance. The referee correctly identifies the key contributions: explicit counterexamples showing that RMDPs with compact uncertainty sets (even of cardinality two) need not satisfy subgradient dominance and can have suboptimal strict local minima, NP-hardness results for computing ε-optimal policies in settings where subgradients are efficiently computable, and two verifiable sufficient conditions (uniqueness of the worst-case kernel or worst-case action-value function per policy) under which subgradient dominance holds.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper refutes prior claims on subgradient dominance for general compact uncertainty sets in RMDPs by providing explicit counterexamples (including for |U|=2) that demonstrate the objective is not subgradient-dominant and admits strict suboptimal local minima. It establishes NP-hardness of ε-optimal policy search via independent complexity reductions in settings with tractable subgradient evaluation (finite transition sets and sa-rectangular finite transition+cost sets). The two sufficient conditions (unique worst-case kernel or unique worst-case Q-function per policy) are stated directly from the problem definitions without any fitted parameters, self-referential derivations, or load-bearing self-citations. All steps are self-contained against external benchmarks and falsifiable via the constructions; no derivation reduces by construction to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard MDP assumptions (finite states/actions, compact uncertainty sets) and standard complexity theory; no free parameters are fitted to data and no new entities are postulated.

axioms (2)
  • domain assumption Finite state and action spaces with compact uncertainty sets
    Standard background for defining RMDPs and subgradients
  • standard math Standard NP-hardness reductions from known hard problems
    Used to establish computational hardness in the two settings

pith-pipeline@v0.9.0 · 5511 in / 1454 out tokens · 38801 ms · 2026-05-09T21:59:31.525755+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages

  1. [1]

    On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

    Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift . Journal of Machine Learning Research, 22 0 (98): 0 1--76, 2021

  2. [2]

    Constrained Markov Decision Processes , volume 7

    Eitan Altman. Constrained Markov Decision Processes , volume 7. CRC Press, 1999

  3. [3]

    A Unified Analysis of Descent Sequences in Weakly Convex Optimization, Including Convergence Rates for Bundle Methods

    Felipe Atenas, Claudia Sagastiz \'a bal, Paulo JS Silva, and Mikhail Solodov. A Unified Analysis of Descent Sequences in Weakly Convex Optimization, Including Convergence Rates for Bundle Methods . SIAM Journal on Optimization, 33 0 (1): 0 89--115, 2023

  4. [4]

    Solving Uncertain Markov Decision Processes

    J Andrew Bagnell, Andrew Y Ng, and Jeff G Schneider. Solving Uncertain Markov Decision Processes . Technical report, Carnegie Mellon University, 2001

  5. [5]

    On the Global Optimality of Policy Gradient Methods in General Utility Reinforcement Learning

    Anas Barakat, Souradip Chakraborty, Peihong Yu, Pratap Tokekar, and Amrit Singh Bedi. On the Global Optimality of Policy Gradient Methods in General Utility Reinforcement Learning . In Advances in Neural Information Processing Systems, 2025

  6. [6]

    Inverse Reinforcement Learning in Contextual MDPs

    Stav Belogolovsky, Philip Korsunsky, Shie Mannor, Chen Tessler, and Tom Zahavy. Inverse Reinforcement Learning in Contextual MDPs . Machine Learning, 110 0 (9): 0 2295--2334, 2021

  7. [7]

    Convex optimization theory , volume 1

    Dimitri Bertsekas. Convex optimization theory , volume 1. Athena Scientific, 2009

  8. [8]

    Your Policy Regularizer is Secretly an Adversary

    Rob Brekelmans, Tim Genewein, Jordi Grau-Moya, Gr \'e goire Del \'e tang, Markus Kunesch, Shane Legg, and Pedro Ortega. Your Policy Regularizer is Secretly an Adversary . Transactions on Machine Learning Research, 2022

  9. [9]

    Robust Reinforcement Learning with General Utility

    Ziyi Chen, Yan Wen, Zhengmian Hu, and Heng Huang. Robust Reinforcement Learning with General Utility . In Advances in Neural Information Processing Systems, 2025

  10. [10]

    Stochastic Model-Based Minimization of Weakly Convex Functions

    Damek Davis and Dmitriy Drusvyatskiy. Stochastic Model-Based Minimization of Weakly Convex Functions . SIAM Journal on Optimization, 29 0 (1): 0 207--239, 2019

  11. [11]

    Twice Regularized MDPs and the Equivalence Between Robustness and Regularization

    Esther Derman, Matthieu Geist, and Shie Mannor. Twice Regularized MDPs and the Equivalence Between Robustness and Regularization . In Advances in Neural Information Processing Systems, 2021

  12. [12]

    Efficiency of Minimizing Compositions of Convex Functions and Smooth Maps

    Dmitriy Drusvyatskiy and Courtney Paquette. Efficiency of Minimizing Compositions of Convex Functions and Smooth Maps . Mathematical Programming, 178: 0 503--558, 2019

  13. [13]

    Efficient Projections onto the L 1-ball for Learning in High Dimensions

    John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient Projections onto the L 1-ball for Learning in High Dimensions . In International Conference on Machine Learning, 2008

  14. [14]

    Diversity is All You Need: Learning Skills without a Reward Function

    Benjamin Eysenbach, Julian Ibarz, Abhishek Gupta, and Sergey Levine. Diversity is All You Need: Learning Skills without a Reward Function . In International Conference on Learning Representations, 2019

  15. [15]

    Stochastic Optimization under Hidden Convexity

    Ilyas Fatkhullin, Niao He, and Yifan Hu. Stochastic Optimization under Hidden Convexity . SIAM Journal on Optimization, 35 0 (4): 0 2544--2571, 2025

  16. [16]

    Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization

    Uri Gadot, Esther Derman, Navdeep Kumar, Maxence Mohamed Elfatihi, Kfir Levy, and Shie Mannor. Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization . In AAAI Conference on Artificial Intelligence, 2024

  17. [17]

    Robust Markov Decision Processes: Beyond Rectangularity

    Vineet Goyal and Julien Grand-Clement. Robust Markov Decision Processes: Beyond Rectangularity . Mathematics of Operations Research, 48 0 (1): 0 203--226, 2023

  18. [18]

    Scalable First-Order Methods for Robust Mdps

    Julien Grand-Cl \'e ment and Christian Kroer. Scalable First-Order Methods for Robust Mdps . In AAAI Conference on Artificial Intelligence, 2021

  19. [19]

    Provably Efficient Maximum Entropy Exploration

    Elad Hazan, Sham Kakade, Karan Singh, and Abby Van Soest. Provably Efficient Maximum Entropy Exploration . In International Conference on Machine Learning, 2019

  20. [20]

    Generative Adversarial Imitation Learning

    Jonathan Ho and Stefano Ermon. Generative Adversarial Imitation Learning . In Advances in Neural Information Processing Systems, 2016

  21. [21]

    Regularized Policies are Reward Robust

    Hisham Husain, Kamil Ciosek, and Ryota Tomioka. Regularized Policies are Reward Robust . In International Conference on Artificial Intelligence and Statistics, 2021

  22. [22]

    Robust Dynamic Programming

    Garud N Iyengar. Robust Dynamic Programming . Mathematics of Operations Research, 30 0 (2): 0 257--280, 2005

  23. [23]

    Provably Efficient Reinforcement Learning with Linear Function Approximation

    Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably Efficient Reinforcement Learning with Linear Function Approximation . In Conference on Learning Theory, 2020

  24. [24]

    Approximately optimal approximate reinforcement learning

    Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, 2002

  25. [25]

    Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form

    Toshinori Kitamura, Tadashi Kozuno, Wataru Kumagai, Kenta Hoshino, Yohei Hosoe, Kazumi Kasaura, Masashi Hamaya, Paavo Parmas, and Yutaka Matsuo. Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form . In International Conference on Learning Representations, 2025

  26. [26]

    Policy Gradient for Rectangular Robust Markov Decision Processes

    Navdeep Kumar, Esther Derman, Matthieu Geist, Kfir Y Levy, and Shie Mannor. Policy Gradient for Rectangular Robust Markov Decision Processes . In Advances in Neural Information Processing Systems, 2024

  27. [27]

    Policy gradient algorithms for robust mdps with non-rectangular uncertainty sets.arXiv preprint arXiv:2305.19004, 2023

    Mengmeng Li, Daniel Kuhn, and Tobias Sutter. Policy Gradient Algorithms for Robust MDPs with Non-Rectangular Uncertainty Sets . arXiv preprint arXiv:2305.19004, 2023

  28. [28]

    First-Order Policy Optimization for Robust Markov Decision Process

    Yan Li, Guanghui Lan, and Tuo Zhao. First-Order Policy Optimization for Robust Markov Decision Process . arXiv preprint arXiv:2209.10579, 2022

  29. [29]

    Rectified Robust Policy Optimization for Model-Uncertain Constrained Reinforcement Learning without Strong Duality

    Shaocong Ma, Ziyi Chen, Yi Zhou, and Heng Huang. Rectified Robust Policy Optimization for Model-Uncertain Constrained Reinforcement Learning without Strong Duality . Transactions on Machine Learning Research, 2025

  30. [30]

    Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty

    Shie Mannor, Ofir Mebel, and Huan Xu. Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty . In International Conference on Machine Learning, 2012

  31. [31]

    On the Global Convergence Rates of Softmax Policy Gradient Methods

    Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. On the Global Convergence Rates of Softmax Policy Gradient Methods . In International Conference on Machine Learning, 2020

  32. [32]

    Robust Control of Markov Decision Processes with Uncertain Transition Matrices

    Arnab Nilim and Laurent El Ghaoui. Robust Control of Markov Decision Processes with Uncertain Transition Matrices . Operations Research, 53 0 (5): 0 780--798, 2005

  33. [33]

    Sequential Decision-Making under Uncertainty: A Robust MDPs Review

    Wenfan Ou and Sheng Bi. Sequential Decision-Making under Uncertainty: A Robust MDPs Review . Annals of Operations Research, 353 0 (3): 0 1239--1285, 2025

  34. [34]

    Markov Decision Processes: Discrete Stochastic Dynamic Programming

    Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming . John Wiley & Sons, Inc., 1994

  35. [35]

    On the Moreau Envelope Properties of Weakly Convex Functions

    Marien Renaud, Arthur Leclaire, and Nicolas Papadakis. On the Moreau Envelope Properties of Weakly Convex Functions . arXiv preprint arXiv:2509.13960, 2025

  36. [36]

    Variational Analysis , volume 317

    R Tyrrell Rockafellar and Roger J-B Wets. Variational Analysis , volume 317. Springer Science & Business Media, 2009

  37. [37]

    Robust Constrained-MDPs: Soft-Constrained Robust Policy Optimization under Model Uncertainty

    Reazul Hasan Russel, Mouhacine Benosman, and Jeroen Van Baar. Robust Constrained-MDPs: Soft-Constrained Robust Policy Optimization under Model Uncertainty . arXiv preprint arXiv:2010.04870, 2020

  38. [38]

    On General Minimax Theorems

    Maurice Sion. On General Minimax Theorems . Pacific Journal of Mathematics, 8 0 (1): 0 171 -- 176, 1958

  39. [39]

    Taguchi, G

    G. Taguchi, G. Taguchi, and Asian Productivity Organization. Introduction to Quality Engineering: Designing Quality Into Products and Processes . Asian Productivity Organization, 1986

  40. [40]

    Policy Gradient in Robust MDPs with Global Convergence Guarantee

    Qiuhao Wang, Chin Pang Ho, and Marek Petrik. Policy Gradient in Robust MDPs with Global Convergence Guarantee . In International Conference on Machine Learning, 2023

  41. [41]

    Policy Gradient for Robust Markov Decision Processes

    Qiuhao Wang, Shaohang Xu, Chin Pang Ho, and Marek Petrik. Policy Gradient for Robust Markov Decision Processes . arXiv preprint arXiv:2410.22114, 2024

  42. [42]

    Policy Gradient Method for Robust Reinforcement Learning

    Yue Wang and Shaofeng Zou. Policy Gradient Method for Robust Reinforcement Learning . In International Conference on Machine Learning, 2022

  43. [43]

    Robust Markov Decision Processes

    Wolfram Wiesemann, Daniel Kuhn, and Ber c Rustem. Robust Markov Decision Processes . Mathematics of Operations Research, 38 0 (1): 0 153--183, 2013

  44. [44]

    On the Convergence Rates of Policy Gradient Methods

    Lin Xiao. On the Convergence Rates of Policy Gradient Methods . Journal of Machine Learning Research, 23 0 (282): 0 1--36, 2022

  45. [45]

    Robust Markov Decision Processes without Model Estimation

    Wenhao Yang, Han Wang, Tadashi Kozuno, Scott M Jordan, and Zhihua Zhang. Robust Markov Decision Processes without Model Estimation . arXiv preprint arXiv:2302.01248, 2023

  46. [46]

    Reward is Enough for Convex MDPs

    Tom Zahavy, Brendan O'Donoghue, Guillaume Desjardins, and Satinder Singh. Reward is Enough for Convex MDPs . In Advances in Neural Information Processing Systems, 2021

  47. [47]

    Variational Policy Gradient Method for Reinforcement Learning with General Utilities

    Junyu Zhang, Alec Koppel, Amrit Singh Bedi, Csaba Szepesvari, and Mengdi Wang. Variational Policy Gradient Method for Reinforcement Learning with General Utilities . In Advances in Neural Information Processing Systems, 2020

  48. [48]

    Penalized Robust MDPs and Risk-Sensitive MDPs: Equivalence, Policy Gradient, and Sample Complexity

    Runyu Zhang, Yang Hu, and Na Li. Penalized Robust MDPs and Risk-Sensitive MDPs: Equivalence, Policy Gradient, and Sample Complexity . In International Conference on Learning Representations, 2024