Revisiting Subgradient Dominance in Robust MDPs: Counterexamples, Hardness, and Sufficient Conditions
Pith reviewed 2026-05-09 21:59 UTC · model grok-4.3
The pith
Robust MDPs with general compact uncertainty sets do not satisfy subgradient dominance, contrary to prior claims, and optimal policies can be NP-hard to find even when subgradients are computable.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Existing assertions that RMDPs satisfy subgradient dominance for general compact uncertainty sets are false. Even for uncertainty sets with only two elements, the RMDP objective fails to be subgradient-dominant and can possess suboptimal strict local minima. Finding an epsilon-optimal policy is NP-hard in cases with finite transition uncertainty sets and sa-rectangular finite transition uncertainty with finite cost uncertainty, despite efficient subgradient computation. Subgradient dominance holds when, for each policy, either the worst-case transition kernel or the worst-case action-value function is unique.
What carries the argument
The subgradient dominance property, under which the squared norm of any subgradient lower-bounds the optimality gap of a policy, allowing projected subgradient descent to converge to near-optimal policies in polynomial steps.
If this is right
- Projected subgradient descent applied to general RMDPs can converge to policies whose value is bounded away from optimal.
- Convergence guarantees previously derived from subgradient dominance no longer apply to broad classes of robust MDPs.
- Solving for epsilon-optimal policies remains computationally intractable in the identified finite-uncertainty settings even though subgradients can be evaluated efficiently.
- Only the two identified uniqueness conditions restore the dominance property and its associated polynomial convergence for PSD.
Where Pith is reading between the lines
- Algorithm designers for robust sequential decision problems should check uniqueness of worst-case kernels or values before relying on subgradient methods.
- Alternative optimization techniques such as regularized objectives or global search may be needed when uniqueness fails.
- Similar counterexamples and hardness results could apply to other robust optimization problems over sequential policies.
Load-bearing premise
That robust MDPs with arbitrary compact uncertainty sets always satisfy subgradient dominance, independent of whether the worst-case transition or value is unique for each policy.
What would settle it
A concrete RMDP example with a two-element uncertainty set whose objective has a strict local minimum whose value is strictly worse than the global optimum, or a polynomial-time algorithm solving an NP-hard instance of epsilon-optimal policy search under finite transition uncertainty.
Figures
read the original abstract
Projected subgradient descent (PSD) has gained popularity for solving robust Markov decision processes (RMDPs) because it applies to a broader class of uncertainty sets than traditional dynamic programming. Existing work claims that RMDPs with a general compact uncertainty set satisfy the subgradient dominance property, under which exact PSD converges to an $\varepsilon$-optimal policy in a polynomial number of updates (e.g., Wang et al., 2023). We show that these claims are incorrect. Even when the uncertainty set has cardinality two, the RMDP objective is not subgradient-dominant and can admit suboptimal strict local minima. Moreover, we prove that finding an $\varepsilon$-optimal policy can be NP-hard even in settings where subgradients are efficiently computable: (i) finite transition uncertainty sets and (ii) $sa$-rectangular finite transition uncertainty sets with finite cost uncertainty sets. Finally, we identify two conditions under which RMDPs do satisfy subgradient dominance: when, for each policy, either the worst-case transition kernel or the worst-case action-value function is unique.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper refutes prior claims that robust MDPs with general compact uncertainty sets satisfy the subgradient dominance property. It shows via explicit counterexamples (even for uncertainty sets of cardinality two) that the RMDP objective need not be subgradient-dominant and can admit suboptimal strict local minima. It further proves NP-hardness of computing an ε-optimal policy in two settings where subgradients remain efficiently computable (finite transition uncertainty sets; sa-rectangular finite transition sets with finite cost sets). Two sufficient conditions for subgradient dominance are identified: uniqueness of the worst-case transition kernel or of the worst-case action-value function for each policy.
Significance. If the counterexamples and reductions hold, the work corrects an important misconception in the literature on first-order methods for robust MDPs. It delineates precise limitations of projected subgradient descent and supplies verifiable sufficient conditions under which the method is guaranteed to succeed, thereby guiding algorithm design for robust sequential decision problems.
minor comments (2)
- [Abstract / Introduction] The abstract and introduction would benefit from a brief explicit statement of the two sufficient conditions (unique worst-case kernel vs. unique worst-case Q-function) rather than deferring the precise wording entirely to the later theorem.
- [Preliminaries] Notation for the uncertainty set U and the robust objective J(π) should be introduced once in a dedicated preliminary subsection to avoid repeated re-definition across sections.
Simulated Author's Rebuttal
We thank the referee for their positive summary of our work and for recommending acceptance. The referee correctly identifies the key contributions: explicit counterexamples showing that RMDPs with compact uncertainty sets (even of cardinality two) need not satisfy subgradient dominance and can have suboptimal strict local minima, NP-hardness results for computing ε-optimal policies in settings where subgradients are efficiently computable, and two verifiable sufficient conditions (uniqueness of the worst-case kernel or worst-case action-value function per policy) under which subgradient dominance holds.
Circularity Check
No significant circularity detected
full rationale
The paper refutes prior claims on subgradient dominance for general compact uncertainty sets in RMDPs by providing explicit counterexamples (including for |U|=2) that demonstrate the objective is not subgradient-dominant and admits strict suboptimal local minima. It establishes NP-hardness of ε-optimal policy search via independent complexity reductions in settings with tractable subgradient evaluation (finite transition sets and sa-rectangular finite transition+cost sets). The two sufficient conditions (unique worst-case kernel or unique worst-case Q-function per policy) are stated directly from the problem definitions without any fitted parameters, self-referential derivations, or load-bearing self-citations. All steps are self-contained against external benchmarks and falsifiable via the constructions; no derivation reduces by construction to its inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Finite state and action spaces with compact uncertainty sets
- standard math Standard NP-hardness reductions from known hard problems
Reference graph
Works this paper leans on
-
[1]
On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift
Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift . Journal of Machine Learning Research, 22 0 (98): 0 1--76, 2021
work page 2021
-
[2]
Constrained Markov Decision Processes , volume 7
Eitan Altman. Constrained Markov Decision Processes , volume 7. CRC Press, 1999
work page 1999
-
[3]
Felipe Atenas, Claudia Sagastiz \'a bal, Paulo JS Silva, and Mikhail Solodov. A Unified Analysis of Descent Sequences in Weakly Convex Optimization, Including Convergence Rates for Bundle Methods . SIAM Journal on Optimization, 33 0 (1): 0 89--115, 2023
work page 2023
-
[4]
Solving Uncertain Markov Decision Processes
J Andrew Bagnell, Andrew Y Ng, and Jeff G Schneider. Solving Uncertain Markov Decision Processes . Technical report, Carnegie Mellon University, 2001
work page 2001
-
[5]
On the Global Optimality of Policy Gradient Methods in General Utility Reinforcement Learning
Anas Barakat, Souradip Chakraborty, Peihong Yu, Pratap Tokekar, and Amrit Singh Bedi. On the Global Optimality of Policy Gradient Methods in General Utility Reinforcement Learning . In Advances in Neural Information Processing Systems, 2025
work page 2025
-
[6]
Inverse Reinforcement Learning in Contextual MDPs
Stav Belogolovsky, Philip Korsunsky, Shie Mannor, Chen Tessler, and Tom Zahavy. Inverse Reinforcement Learning in Contextual MDPs . Machine Learning, 110 0 (9): 0 2295--2334, 2021
work page 2021
-
[7]
Convex optimization theory , volume 1
Dimitri Bertsekas. Convex optimization theory , volume 1. Athena Scientific, 2009
work page 2009
-
[8]
Your Policy Regularizer is Secretly an Adversary
Rob Brekelmans, Tim Genewein, Jordi Grau-Moya, Gr \'e goire Del \'e tang, Markus Kunesch, Shane Legg, and Pedro Ortega. Your Policy Regularizer is Secretly an Adversary . Transactions on Machine Learning Research, 2022
work page 2022
-
[9]
Robust Reinforcement Learning with General Utility
Ziyi Chen, Yan Wen, Zhengmian Hu, and Heng Huang. Robust Reinforcement Learning with General Utility . In Advances in Neural Information Processing Systems, 2025
work page 2025
-
[10]
Stochastic Model-Based Minimization of Weakly Convex Functions
Damek Davis and Dmitriy Drusvyatskiy. Stochastic Model-Based Minimization of Weakly Convex Functions . SIAM Journal on Optimization, 29 0 (1): 0 207--239, 2019
work page 2019
-
[11]
Twice Regularized MDPs and the Equivalence Between Robustness and Regularization
Esther Derman, Matthieu Geist, and Shie Mannor. Twice Regularized MDPs and the Equivalence Between Robustness and Regularization . In Advances in Neural Information Processing Systems, 2021
work page 2021
-
[12]
Efficiency of Minimizing Compositions of Convex Functions and Smooth Maps
Dmitriy Drusvyatskiy and Courtney Paquette. Efficiency of Minimizing Compositions of Convex Functions and Smooth Maps . Mathematical Programming, 178: 0 503--558, 2019
work page 2019
-
[13]
Efficient Projections onto the L 1-ball for Learning in High Dimensions
John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient Projections onto the L 1-ball for Learning in High Dimensions . In International Conference on Machine Learning, 2008
work page 2008
-
[14]
Diversity is All You Need: Learning Skills without a Reward Function
Benjamin Eysenbach, Julian Ibarz, Abhishek Gupta, and Sergey Levine. Diversity is All You Need: Learning Skills without a Reward Function . In International Conference on Learning Representations, 2019
work page 2019
-
[15]
Stochastic Optimization under Hidden Convexity
Ilyas Fatkhullin, Niao He, and Yifan Hu. Stochastic Optimization under Hidden Convexity . SIAM Journal on Optimization, 35 0 (4): 0 2544--2571, 2025
work page 2025
-
[16]
Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization
Uri Gadot, Esther Derman, Navdeep Kumar, Maxence Mohamed Elfatihi, Kfir Levy, and Shie Mannor. Solving Non-Rectangular Reward-Robust MDPs via Frequency Regularization . In AAAI Conference on Artificial Intelligence, 2024
work page 2024
-
[17]
Robust Markov Decision Processes: Beyond Rectangularity
Vineet Goyal and Julien Grand-Clement. Robust Markov Decision Processes: Beyond Rectangularity . Mathematics of Operations Research, 48 0 (1): 0 203--226, 2023
work page 2023
-
[18]
Scalable First-Order Methods for Robust Mdps
Julien Grand-Cl \'e ment and Christian Kroer. Scalable First-Order Methods for Robust Mdps . In AAAI Conference on Artificial Intelligence, 2021
work page 2021
-
[19]
Provably Efficient Maximum Entropy Exploration
Elad Hazan, Sham Kakade, Karan Singh, and Abby Van Soest. Provably Efficient Maximum Entropy Exploration . In International Conference on Machine Learning, 2019
work page 2019
-
[20]
Generative Adversarial Imitation Learning
Jonathan Ho and Stefano Ermon. Generative Adversarial Imitation Learning . In Advances in Neural Information Processing Systems, 2016
work page 2016
-
[21]
Regularized Policies are Reward Robust
Hisham Husain, Kamil Ciosek, and Ryota Tomioka. Regularized Policies are Reward Robust . In International Conference on Artificial Intelligence and Statistics, 2021
work page 2021
-
[22]
Garud N Iyengar. Robust Dynamic Programming . Mathematics of Operations Research, 30 0 (2): 0 257--280, 2005
work page 2005
-
[23]
Provably Efficient Reinforcement Learning with Linear Function Approximation
Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably Efficient Reinforcement Learning with Linear Function Approximation . In Conference on Learning Theory, 2020
work page 2020
-
[24]
Approximately optimal approximate reinforcement learning
Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, 2002
work page 2002
-
[25]
Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form
Toshinori Kitamura, Tadashi Kozuno, Wataru Kumagai, Kenta Hoshino, Yohei Hosoe, Kazumi Kasaura, Masashi Hamaya, Paavo Parmas, and Yutaka Matsuo. Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form . In International Conference on Learning Representations, 2025
work page 2025
-
[26]
Policy Gradient for Rectangular Robust Markov Decision Processes
Navdeep Kumar, Esther Derman, Matthieu Geist, Kfir Y Levy, and Shie Mannor. Policy Gradient for Rectangular Robust Markov Decision Processes . In Advances in Neural Information Processing Systems, 2024
work page 2024
-
[27]
Mengmeng Li, Daniel Kuhn, and Tobias Sutter. Policy Gradient Algorithms for Robust MDPs with Non-Rectangular Uncertainty Sets . arXiv preprint arXiv:2305.19004, 2023
-
[28]
First-Order Policy Optimization for Robust Markov Decision Process
Yan Li, Guanghui Lan, and Tuo Zhao. First-Order Policy Optimization for Robust Markov Decision Process . arXiv preprint arXiv:2209.10579, 2022
-
[29]
Shaocong Ma, Ziyi Chen, Yi Zhou, and Heng Huang. Rectified Robust Policy Optimization for Model-Uncertain Constrained Reinforcement Learning without Strong Duality . Transactions on Machine Learning Research, 2025
work page 2025
-
[30]
Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty
Shie Mannor, Ofir Mebel, and Huan Xu. Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty . In International Conference on Machine Learning, 2012
work page 2012
-
[31]
On the Global Convergence Rates of Softmax Policy Gradient Methods
Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. On the Global Convergence Rates of Softmax Policy Gradient Methods . In International Conference on Machine Learning, 2020
work page 2020
-
[32]
Robust Control of Markov Decision Processes with Uncertain Transition Matrices
Arnab Nilim and Laurent El Ghaoui. Robust Control of Markov Decision Processes with Uncertain Transition Matrices . Operations Research, 53 0 (5): 0 780--798, 2005
work page 2005
-
[33]
Sequential Decision-Making under Uncertainty: A Robust MDPs Review
Wenfan Ou and Sheng Bi. Sequential Decision-Making under Uncertainty: A Robust MDPs Review . Annals of Operations Research, 353 0 (3): 0 1239--1285, 2025
work page 2025
-
[34]
Markov Decision Processes: Discrete Stochastic Dynamic Programming
Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming . John Wiley & Sons, Inc., 1994
work page 1994
-
[35]
On the Moreau Envelope Properties of Weakly Convex Functions
Marien Renaud, Arthur Leclaire, and Nicolas Papadakis. On the Moreau Envelope Properties of Weakly Convex Functions . arXiv preprint arXiv:2509.13960, 2025
-
[36]
Variational Analysis , volume 317
R Tyrrell Rockafellar and Roger J-B Wets. Variational Analysis , volume 317. Springer Science & Business Media, 2009
work page 2009
-
[37]
Robust Constrained-MDPs: Soft-Constrained Robust Policy Optimization under Model Uncertainty
Reazul Hasan Russel, Mouhacine Benosman, and Jeroen Van Baar. Robust Constrained-MDPs: Soft-Constrained Robust Policy Optimization under Model Uncertainty . arXiv preprint arXiv:2010.04870, 2020
-
[38]
Maurice Sion. On General Minimax Theorems . Pacific Journal of Mathematics, 8 0 (1): 0 171 -- 176, 1958
work page 1958
-
[39]
G. Taguchi, G. Taguchi, and Asian Productivity Organization. Introduction to Quality Engineering: Designing Quality Into Products and Processes . Asian Productivity Organization, 1986
work page 1986
-
[40]
Policy Gradient in Robust MDPs with Global Convergence Guarantee
Qiuhao Wang, Chin Pang Ho, and Marek Petrik. Policy Gradient in Robust MDPs with Global Convergence Guarantee . In International Conference on Machine Learning, 2023
work page 2023
-
[41]
Policy Gradient for Robust Markov Decision Processes
Qiuhao Wang, Shaohang Xu, Chin Pang Ho, and Marek Petrik. Policy Gradient for Robust Markov Decision Processes . arXiv preprint arXiv:2410.22114, 2024
-
[42]
Policy Gradient Method for Robust Reinforcement Learning
Yue Wang and Shaofeng Zou. Policy Gradient Method for Robust Reinforcement Learning . In International Conference on Machine Learning, 2022
work page 2022
-
[43]
Robust Markov Decision Processes
Wolfram Wiesemann, Daniel Kuhn, and Ber c Rustem. Robust Markov Decision Processes . Mathematics of Operations Research, 38 0 (1): 0 153--183, 2013
work page 2013
-
[44]
On the Convergence Rates of Policy Gradient Methods
Lin Xiao. On the Convergence Rates of Policy Gradient Methods . Journal of Machine Learning Research, 23 0 (282): 0 1--36, 2022
work page 2022
-
[45]
Robust Markov Decision Processes without Model Estimation
Wenhao Yang, Han Wang, Tadashi Kozuno, Scott M Jordan, and Zhihua Zhang. Robust Markov Decision Processes without Model Estimation . arXiv preprint arXiv:2302.01248, 2023
-
[46]
Reward is Enough for Convex MDPs
Tom Zahavy, Brendan O'Donoghue, Guillaume Desjardins, and Satinder Singh. Reward is Enough for Convex MDPs . In Advances in Neural Information Processing Systems, 2021
work page 2021
-
[47]
Variational Policy Gradient Method for Reinforcement Learning with General Utilities
Junyu Zhang, Alec Koppel, Amrit Singh Bedi, Csaba Szepesvari, and Mengdi Wang. Variational Policy Gradient Method for Reinforcement Learning with General Utilities . In Advances in Neural Information Processing Systems, 2020
work page 2020
-
[48]
Penalized Robust MDPs and Risk-Sensitive MDPs: Equivalence, Policy Gradient, and Sample Complexity
Runyu Zhang, Yang Hu, and Na Li. Penalized Robust MDPs and Risk-Sensitive MDPs: Equivalence, Policy Gradient, and Sample Complexity . In International Conference on Learning Representations, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.